TensorRT-LLMs/advanced/inference-request.html
nv-guomingz 85f78df69c
Update gh-pages for windows part doc. (#1979)
Co-authored-by: Guoming Zhang <37257613+nv-guomingz@users.noreply.github.com>
2024-07-18 11:18:09 +08:00

411 lines
31 KiB
HTML

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Inference Request &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9a2dae69"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Run gpt-2b + LoRA using GptManager / cpp runtime" href="lora.html" />
<link rel="prev" title="The Batch Manager in TensorRT-LLM" href="batch-manager.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../quick-start-guide.html">Quick Start Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../release-notes.html">Release Notes</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Installation</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/linux.html">Installing on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-linux.html">Building from Source Code on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/windows.html">Installing on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-windows.html">Building from Source Code on Windows</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture/overview.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html">Model Definition</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#compilation">Compilation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#runtime">Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#multi-gpu-and-multi-node-support">Multi-GPU and Multi-Node Support</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/checkpoint.html">TensorRT-LLM Checkpoint</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/workflow.html">TensorRT-LLM Build Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/add-model.html">Adding a Model</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Advanced</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="gpt-attention.html">Multi-Head, Multi-Query, and Group-Query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpt-runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="batch-manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="#responses">Responses</a></li>
<li class="toctree-l1"><a class="reference internal" href="lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="expert-parallelism.html">Expert Parallelism in TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Performance</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-best-practices.html">Best Practices for Tuning the Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-analysis.html">Performance Analysis</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../reference/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/support-matrix.html">Support Matrix</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/executor.html">Executor</a></li>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/XQA-kernel.html">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Inference Request</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/advanced/inference-request.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="inference-request">
<span id="id1"></span><h1>Inference Request<a class="headerlink" href="#inference-request" title="Link to this heading"></a></h1>
<p>The main class to describe requests to <code class="docutils literal notranslate"><span class="pre">GptManager</span></code> is <code class="docutils literal notranslate"><span class="pre">InferenceRequest</span></code>. This is structured as a map of tensors and a <code class="docutils literal notranslate"><span class="pre">uint64_t</span> <span class="pre">requestId</span></code>.
The mandatory input tensors to create a valid <code class="docutils literal notranslate"><span class="pre">InferenceRequest</span></code> object are described below. Sampling config params are documented in the <a class="reference internal" href="gpt-runtime.html#gpt-runtime"><span class="std std-ref">C++ GPT Runtime</span></a> section. Descriptions have been omitted in the table.</p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name</p></th>
<th class="head text-center"><p>Shape</p></th>
<th class="head text-center"><p>Type</p></th>
<th class="head text-center"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">request_output_len</span></code></p></td>
<td class="text-center"><p>[1,1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Max number of output tokens</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">input_ids</span></code></p></td>
<td class="text-center"><p>[1, num_input_tokens]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Tensor of input tokens</p></td>
</tr>
</tbody>
</table>
<p>Optional tensors that can be supplied to <code class="docutils literal notranslate"><span class="pre">InferenceRequest</span></code> are shown below. Default values, where applicable are specified.:</p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name</p></th>
<th class="head text-center"><p>Shape</p></th>
<th class="head text-center"><p>Type</p></th>
<th class="head text-center"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">streaming</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">bool</span></code></p></td>
<td class="text-center"><p>(Default=<code class="docutils literal notranslate"><span class="pre">false</span></code>). When <code class="docutils literal notranslate"><span class="pre">true</span></code>, stream out tokens as they are generated. When <code class="docutils literal notranslate"><span class="pre">false</span></code> return only when the full generation has completed.</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beam_width</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>(Default=1) Beam width for this request; set to 1 for greedy sampling</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">temperature</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">temperature</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">runtime_top_k</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">topK</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">runtime_top_p</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">topP</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">len_penalty</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">lengthPenalty</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">early_stopping</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">earlyStopping</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">repetition_penalty</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">repetitionPenalty</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">min_length</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">minLength</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">presence_penalty</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">presencePenalty</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">frequency_penalty</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">frequencyPenalty</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">no_repeat_ngram_size</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">noRepeatNgramSize</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">random_seed</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">uint64_t</span></code></p></td>
<td class="text-center"><p>Sampling Config param: <code class="docutils literal notranslate"><span class="pre">randomSeed</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">end_id</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>End token Id. If not specified, defaults to -1</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">pad_id</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Pad token Id</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">embedding_bias</span></code></p></td>
<td class="text-center"><p>[1, vocab_size]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>The bias is added to the logits for each token in the vocabulary before decoding occurs. Positive values in the bias encourage the sampling of tokens, while negative values discourage it. A value of <code class="docutils literal notranslate"><span class="pre">0.f</span></code> leaves the logit value unchanged.</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">bad_words_list</span></code></p></td>
<td class="text-center"><p>[2, num_bad_words]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Bad words list</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">stop_words_list</span></code></p></td>
<td class="text-center"><p>[2, num_stop_words]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Stop words list</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">prompt_embedding_table</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float16</span></code></p></td>
<td class="text-center"><p>P-tuning prompt embedding table</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">prompt_vocab_size</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>P-tuning prompt vocab size</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">lora_task_id</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">uint64_t</span></code></p></td>
<td class="text-center"><p>Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time <code class="docutils literal notranslate"><span class="pre">lora_task_id</span></code> <code class="docutils literal notranslate"><span class="pre">lora_weights</span></code> and <code class="docutils literal notranslate"><span class="pre">lora_config</span></code> must all be given. The LoRA will be cached, so that subsequent requests for the same task only require <code class="docutils literal notranslate"><span class="pre">lora_task_id</span></code>. If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if <code class="docutils literal notranslate"><span class="pre">lora_task_id</span></code> is not cached</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">lora_weights</span></code></p></td>
<td class="text-center"><p>[num_lora_modules_layers, D x Hi + Ho x D]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code> (model data type)</p></td>
<td class="text-center"><p>weights for a LoRA adapter. Refer to <a class="reference internal" href="lora.html#lora"><span class="std std-ref">Run gpt-2b + LoRA using GptManager / cpp runtime</span></a> for more information.</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">lora_config</span></code></p></td>
<td class="text-center"><p>[num_lora_modules_layers, 3]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>LoRA configuration tensor. <code class="docutils literal notranslate"><span class="pre">[</span> <span class="pre">module_id,</span> <span class="pre">layer_idx,</span> <span class="pre">adapter_size</span> <span class="pre">(D</span> <span class="pre">aka</span> <span class="pre">R</span> <span class="pre">value)</span> <span class="pre">]</span></code> Refer to <a class="reference internal" href="lora.html#lora"><span class="std std-ref">Run gpt-2b + LoRA using GptManager / cpp runtime</span></a> for more information.</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">return_log_probs</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">bool</span></code></p></td>
<td class="text-center"><p>When <code class="docutils literal notranslate"><span class="pre">true</span></code>, include log probs in the output</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">return_context_logits</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">bool</span></code></p></td>
<td class="text-center"><p>When <code class="docutils literal notranslate"><span class="pre">true</span></code>, include context logits in the output</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">return_generation_logits</span></code></p></td>
<td class="text-center"><p>[1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">bool</span></code></p></td>
<td class="text-center"><p>When <code class="docutils literal notranslate"><span class="pre">true</span></code>, include generation logits in the output</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">draft_input_ids</span></code></p></td>
<td class="text-center"><p>[num_draft_tokens]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">draft_logits</span></code></p></td>
<td class="text-center"><p>[num_draft_tokens, vocab_size]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Draft logits associated with <code class="docutils literal notranslate"><span class="pre">draft_input_ids</span></code> to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration</p></td>
</tr>
</tbody>
</table>
</section>
<section id="responses">
<h1>Responses<a class="headerlink" href="#responses" title="Link to this heading"></a></h1>
<p>Responses from GptManager are formatted as a list of tensors. The table below shows the set of output tensors returned by <code class="docutils literal notranslate"><span class="pre">GptManager</span></code> (via the <code class="docutils literal notranslate"><span class="pre">SendResponseCallback</span></code>):</p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name</p></th>
<th class="head text-center"><p>Shape</p></th>
<th class="head text-center"><p>Type</p></th>
<th class="head text-center"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">output_ids</span></code></p></td>
<td class="text-center"><p>[beam_width, num_output_tokens]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Tensor of output tokens. When <code class="docutils literal notranslate"><span class="pre">streaming</span></code> is enabled, this is a single token.</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">sequence_length</span></code></p></td>
<td class="text-center"><p>[beam_width]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">int32_t</span></code></p></td>
<td class="text-center"><p>Number of output tokens. When <code class="docutils literal notranslate"><span class="pre">streaming</span></code> is set, this will be 1.</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">output_log_probs</span></code></p></td>
<td class="text-center"><p>[1, beam_width, num_output_tokens]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Only if <code class="docutils literal notranslate"><span class="pre">return_log_probs</span></code> is set on input. Tensor of log probabilities of output token logits.</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">cum_log_probs</span></code></p></td>
<td class="text-center"><p>[1, beam_width]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Only if <code class="docutils literal notranslate"><span class="pre">return_log_probs</span></code> is set on input. Cumulative log probability of the sequence generated.</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">context_logits</span></code></p></td>
<td class="text-center"><p>[1, num_input_tokens, vocab_size]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Only if <code class="docutils literal notranslate"><span class="pre">return_context_logits</span></code> is set on input. Tensor of input token logits.</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">generation_logits</span></code></p></td>
<td class="text-center"><p>[1, beam_width, num_output_tokens, vocab_size]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">float</span></code></p></td>
<td class="text-center"><p>Only if <code class="docutils literal notranslate"><span class="pre">return_generation_logits</span></code> is set on input. Tensor of output token logits.</p></td>
</tr>
</tbody>
</table>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="batch-manager.html" class="btn btn-neutral float-left" title="The Batch Manager in TensorRT-LLM" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="lora.html" class="btn btn-neutral float-right" title="Run gpt-2b + LoRA using GptManager / cpp runtime" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7f0d229342b0>
<div class="footer">
<p>
Copyright © 2024 NVIDIA Corporation
</p>
<p>
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
data-cms-ai="0">Privacy Policy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
data-cms-ai="0">Manage My Privacy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
data-cms-ai="0">Do Not Sell or Share My Data</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
rel="noopener" data-cms-ai="0">Terms of Service</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
data-cms-ai="0">Accessibility</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
rel="noopener" data-cms-ai="0">Corporate Policies</a> |
<a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
data-cms-ai="0">Product Security</a> |
<a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
data-cms-ai="0">Contact</a>
</p>
</div>
</div>
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>