TensorRT-LLMs/executor.html
2024-07-17 21:29:04 +08:00

246 lines
24 KiB
HTML

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="./">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Executor API &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=19f00094" />
<!--[if lt IE 9]>
<script src="_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="_static/jquery.js?v=5d32c60e"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="_static/documentation_options.js?v=5929fcd5"></script>
<script src="_static/doctools.js?v=9a2dae69"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="_static/js/theme.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="quick-start-guide.html">Quick Start Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="release-notes.html">Release Notes</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Installation</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="installation/linux.html">Installing on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="installation/build-from-source-linux.html">Building from Source Code on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="installation/windows.html">Installing on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="installation/build-from-source-windows.html">Building from Source Code on Windows</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="architecture/overview.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/core-concepts.html">Model Definition</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/core-concepts.html#compilation">Compilation</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/core-concepts.html#runtime">Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/core-concepts.html#multi-gpu-and-multi-node-support">Multi-GPU and Multi-Node Support</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/checkpoint.html">TensorRT-LLM Checkpoint</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/workflow.html">TensorRT-LLM Build Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture/add-model.html">Adding a Model</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Advanced</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="advanced/gpt-attention.html">Multi-Head, Multi-Query, and Group-Query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/gpt-runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/batch-manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/inference-request.html">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/inference-request.html#responses">Responses</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="advanced/expert-parallelism.html">Expert Parallelism in TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Performance</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="performance/perf-overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="performance/perf-best-practices.html">Best Practices for Tuning the Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="performance/perf-analysis.html">Performance Analysis</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="reference/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="reference/support-matrix.html">Support Matrix</a></li>
<li class="toctree-l1"><a class="reference internal" href="reference/precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="reference/memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="_cpp_gen/executor.html">Executor</a></li>
<li class="toctree-l1"><a class="reference internal" href="_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.quantization.html">Quantization</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="blogs/H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="blogs/H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="blogs/Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="blogs/quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="blogs/XQA-kernel.html">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Executor API</li>
<li class="wy-breadcrumbs-aside">
<a href="_sources/executor.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="executor-api">
<span id="executor"></span><h1>Executor API<a class="headerlink" href="#executor-api" title="Link to this heading"></a></h1>
<p>TensorRT-LLM includes a high-level C++ API called the Executor API which allows you to execute requests
asynchronously, with in-flight batching, and without the need to define callbacks.</p>
<p>A software component (referred to as “the client” in the text that follows) can interact
with the executor using the API defined in the <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/include/tensorrt_llm/executor/executor.h"><code class="docutils literal notranslate"><span class="pre">executor.h</span></code></a> file.
For details about the API, refer to the <span class="xref std std-ref">_cpp_gen/executor.rst</span>.</p>
<p>The following sections provide an overview of the main classes defined in the Executor API.</p>
<section id="the-executor-class">
<h2>The Executor Class<a class="headerlink" href="#the-executor-class" title="Link to this heading"></a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">Executor</span></code> class is responsible for receiving requests from the client, and providing responses for those requests. The executor is constructed by providing a path to a directory containing the TensorRT-LLM engine or buffers containing the engine and the model JSON configuration. The client can create requests and enqueue those requests for execution using the <code class="docutils literal notranslate"><span class="pre">enqueueRequest</span></code> or <code class="docutils literal notranslate"><span class="pre">enqueueRequests</span></code> methods of the <code class="docutils literal notranslate"><span class="pre">Executor</span></code> class. Enqueued requests will be scheduled for execution by the executor, and multiple independent requests can be batched together at every iteration of the main execution loop (a process often referred to as continuous batching or iteration-level batching). Responses for a particular request can be awaited for by calling the <code class="docutils literal notranslate"><span class="pre">awaitResponses</span></code> method, and by providing the request id. Alternatively, responses for any requests can be awaited for by omitting to provide the request id when calling <code class="docutils literal notranslate"><span class="pre">awaitResponses</span></code>. The <code class="docutils literal notranslate"><span class="pre">Executor</span></code> class also allows to cancel requests using the <code class="docutils literal notranslate"><span class="pre">cancelRequest</span></code> method and to obtain per-iteration and per-request statistics using the <code class="docutils literal notranslate"><span class="pre">getLatestIterationStats</span></code>.</p>
<section id="logits-post-processor-optional">
<h3>Logits Post-Processor (optional)<a class="headerlink" href="#logits-post-processor-optional" title="Link to this heading"></a></h3>
<p>Users can alter the logits produced the network, by providing a map of named callbacks of the form:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">std</span><span class="p">::</span><span class="n">unordered_map</span><span class="o">&lt;</span><span class="n">std</span><span class="p">::</span><span class="n">string</span><span class="p">,</span> <span class="n">function</span><span class="o">&lt;</span><span class="n">Tensor</span><span class="p">(</span><span class="n">IdType</span><span class="p">,</span> <span class="n">Tensor</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">BeamTokens</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">StreamPtr</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">std</span><span class="p">::</span><span class="n">optional</span><span class="o">&lt;</span><span class="n">IdType</span><span class="o">&gt;</span><span class="p">)</span><span class="o">&gt;&gt;</span>
</pre></div>
</div>
<p>to the <code class="docutils literal notranslate"><span class="pre">ExecutorConfig</span></code>. The map key is the name associated with that logits post-processing callback. Each request can then specify the name of the logits post-processor to use for that particular request, if any.</p>
<p>The first argument to the callback is the request id, second is the logits tensor, third are the tokens produced by the request so far, fourth is the operation stream used by the logits tensor, and last one is an optional client id. The callback returns a modified tensor of logits.</p>
<p>Users <em>must</em> use the stream to access the logits tensor. For example, performing a addition with a bias tensor should be enqueued on that stream.
Alternatively, users may call <code class="docutils literal notranslate"><span class="pre">stream-&gt;synchronize()</span></code>, however, that will slow down the entire execution pipeline.</p>
<p>Multiple requests can share same client id and callback can use different logic based on client id.</p>
<p>We also provide a batched version that allows altering logits of multiple requests in a batch. This allows further optimizations and reduces callback overheads.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">std</span><span class="p">::</span><span class="n">function</span><span class="o">&lt;</span><span class="n">void</span><span class="p">(</span><span class="n">std</span><span class="p">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">IdType</span><span class="o">&gt;</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">std</span><span class="p">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">Tensor</span><span class="o">&gt;&amp;</span><span class="p">,</span> <span class="n">std</span><span class="p">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="p">::</span><span class="n">reference_wrapper</span><span class="o">&lt;</span><span class="n">BeamTokens</span> <span class="n">const</span><span class="o">&gt;&gt;</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">StreamPtr</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">,</span> <span class="n">std</span><span class="p">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">std</span><span class="p">::</span><span class="n">optional</span><span class="o">&lt;</span><span class="n">IdType</span><span class="o">&gt;&gt;</span> <span class="n">const</span><span class="o">&amp;</span><span class="p">)</span><span class="o">&gt;</span>
</pre></div>
</div>
<p>A single batched callback can be specified in <code class="docutils literal notranslate"><span class="pre">ExecutorConfig</span></code>. Each request can opt to apply this callback by specifying the name of the logits
post-processor as <code class="docutils literal notranslate"><span class="pre">Request::kBatchedPostProcessorName</span></code>.</p>
<p>Note: Neither callback variant is supported with the <code class="docutils literal notranslate"><span class="pre">STATIC</span></code> batching type for the moment.</p>
</section>
</section>
<section id="the-request-class">
<h2>The Request Class<a class="headerlink" href="#the-request-class" title="Link to this heading"></a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">Request</span></code> class is used to define properties of the request, such as the input token ids and the maximum number of tokens to generate. The <code class="docutils literal notranslate"><span class="pre">streaming</span></code> parameter can be used to indicate if the request should generate a response for each new generated tokens (<code class="docutils literal notranslate"><span class="pre">streaming</span> <span class="pre">=</span> <span class="pre">true</span></code>) or only after all tokens have been generated (<code class="docutils literal notranslate"><span class="pre">streaming</span> <span class="pre">=</span> <span class="pre">false</span></code>). Other mandatory parameters of the request include the sampling configuration (defined by the <code class="docutils literal notranslate"><span class="pre">SamplingConfig</span></code> class) which contains parameters controlling the decoding process and the output configuration (defined by the <code class="docutils literal notranslate"><span class="pre">OutputConfig</span></code> class) which controls what information should be included in the <code class="docutils literal notranslate"><span class="pre">Result</span></code> for a particular response.</p>
<p>Optional parameters can also be provided when constructing a request such as a list of bad words, a list of stop words, a client id, or configurations objects for prompt tuning, LoRA, or speculative decoding for example.</p>
</section>
<section id="the-response-class">
<h2>The Response Class<a class="headerlink" href="#the-response-class" title="Link to this heading"></a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">awaitResponses</span></code> method of the <code class="docutils literal notranslate"><span class="pre">Executor</span></code> class returns a vector of responses. Each response contains the request id associated with this response, and also contains either an error or a <code class="docutils literal notranslate"><span class="pre">Result</span></code>. Check if the response has an error by using the <code class="docutils literal notranslate"><span class="pre">hasError</span></code> method before trying to obtain the <code class="docutils literal notranslate"><span class="pre">Result</span></code> associated with this response using the <code class="docutils literal notranslate"><span class="pre">getResult</span></code> method.</p>
</section>
<section id="the-result-class">
<h2>The Result Class<a class="headerlink" href="#the-result-class" title="Link to this heading"></a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">Result</span></code> class holds the result for a given request. It contains a Boolean parameter called <code class="docutils literal notranslate"><span class="pre">isFinal</span></code> that indicates if this is the last <code class="docutils literal notranslate"><span class="pre">Result</span></code> that will be returned for the given request id. It also contains the generated tokens. If the request is configured with <code class="docutils literal notranslate"><span class="pre">streaming</span> <span class="pre">=</span> <span class="pre">false</span></code>, the <code class="docutils literal notranslate"><span class="pre">isFinal</span></code> Boolean will be set to <code class="docutils literal notranslate"><span class="pre">true</span></code> and all generated tokens will be included in the <code class="docutils literal notranslate"><span class="pre">outputTokenIds</span></code>. If <code class="docutils literal notranslate"><span class="pre">streaming</span> <span class="pre">=</span> <span class="pre">true</span></code> is used, a <code class="docutils literal notranslate"><span class="pre">Result</span></code> will only include 1 token and the <code class="docutils literal notranslate"><span class="pre">isFinal</span></code> flag will be set to <code class="docutils literal notranslate"><span class="pre">true</span></code> for the last result associated with this request.</p>
</section>
<section id="c-executor-api-example">
<h2>C++ Executor API Example<a class="headerlink" href="#c-executor-api-example" title="Link to this heading"></a></h2>
<p>Two C++ examples are provided that shows how to use the Executor API and can be found in the <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/cpp/executor/"><code class="docutils literal notranslate"><span class="pre">examples/cpp/executor</span></code></a> folder.</p>
</section>
<section id="python-bindings-for-the-executor-api">
<h2>Python Bindings for the Executor API<a class="headerlink" href="#python-bindings-for-the-executor-api" title="Link to this heading"></a></h2>
<p>Python bindings for the Executor API are also available to use the Executor API from Python. The Python bindings are defined in <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/pybind/executor/bindings.cpp">bindings.cpp</a> and once built, are available in package <code class="docutils literal notranslate"><span class="pre">tensorrt_llm.bindings.executor</span></code>. Running <code class="docutils literal notranslate"><span class="pre">'help('tensorrt_llm.bindings.executor')</span></code> in a Python interpreter will provide an overview of the classes available.</p>
<p>In addition, three Python examples are provided to demonstrate how to use the Python bindings to the Executor API for single and multi-GPU models. They can be found in <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bindings"><code class="docutils literal notranslate"><span class="pre">examples/bindings</span></code></a>.</p>
</section>
</section>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7ff8f4febd00>
<div class="footer">
<p>
Copyright © 2024 NVIDIA Corporation
</p>
<p>
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
data-cms-ai="0">Privacy Policy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
data-cms-ai="0">Manage My Privacy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
data-cms-ai="0">Do Not Sell or Share My Data</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
rel="noopener" data-cms-ai="0">Terms of Service</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
data-cms-ai="0">Accessibility</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
rel="noopener" data-cms-ai="0">Corporate Policies</a> |
<a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
data-cms-ai="0">Product Security</a> |
<a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
data-cms-ai="0">Contact</a>
</p>
</div>
</div>
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>