TensorRT-LLMs/advanced/gpt-runtime.html
2025-01-08 14:45:46 +08:00

621 lines
50 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>C++ GPT Runtime &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9bcbadda"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
<script src="../_static/copybutton.js?v=65e89d2a"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Executor API" href="executor.html" />
<link rel="prev" title="Multi-Head, Multi-Query, and Group-Query Attention" href="gpt-attention.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../quick-start-guide.html">Quick Start Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../key-features.html">Key Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../release-notes.html">Release Notes</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Installation</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/linux.html">Installing on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-linux.html">Building from Source Code on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/windows.html">Installing on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-windows.html">Building from Source Code on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/grace-hopper.html">Installing on Grace Hopper</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">LLM API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../llm-api/index.html">API Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api/reference.html">API Reference</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">LLM API Examples</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/index.html">LLM Examples Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/customization.html">Common Customizations</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/llm_api_examples.html">Examples</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Model Definition API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.quantization.html">Quantization</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/executor.html">Executor</a></li>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Command-Line Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../commands/trtllm-build.html">trtllm-build</a></li>
<li class="toctree-l1"><a class="reference internal" href="../commands/trtllm-serve.html">trtllm-serve</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture/overview.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html">Model Definition</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#compilation">Compilation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#runtime">Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#multi-gpu-and-multi-node-support">Multi-GPU and Multi-Node Support</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/checkpoint.html">TensorRT-LLM Checkpoint</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/workflow.html">TensorRT-LLM Build Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/add-model.html">Adding a Model</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Advanced</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="gpt-attention.html">Multi-Head, Multi-Query, and Group-Query Attention</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">C++ GPT Runtime</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#overview">Overview</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#model-configuration">Model Configuration</a></li>
<li class="toctree-l3"><a class="reference internal" href="#world-configuration">World Configuration</a></li>
<li class="toctree-l3"><a class="reference internal" href="#sampling-parameters">Sampling Parameters</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#the-session">The Session</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#internal-components">Internal Components</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#in-flight-batching-support">In-flight Batching Support</a></li>
<li class="toctree-l2"><a class="reference internal" href="#know-issues-and-future-changes">Know Issues and Future Changes</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="executor.html">Executor API</a></li>
<li class="toctree-l1"><a class="reference internal" href="graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="inference-request.html">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="inference-request.html#responses">Responses</a></li>
<li class="toctree-l1"><a class="reference internal" href="lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="expert-parallelism.html">Expert Parallelism in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="kv-cache-reuse.html">KV cache reuse</a></li>
<li class="toctree-l1"><a class="reference internal" href="speculative-decoding.html">Speculative Sampling</a></li>
<li class="toctree-l1"><a class="reference internal" href="disaggregated-service.html">Disaggregated-Service (experimental)</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Performance</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-benchmarking.html">Benchmarking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-best-practices.html">Best Practices</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-analysis.html">Performance Analysis</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../reference/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/support-matrix.html">Support Matrix</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/XQA-kernel.html">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">C++ GPT Runtime</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/advanced/gpt-runtime.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="c-gpt-runtime">
<span id="gpt-runtime"></span><h1>C++ GPT Runtime<a class="headerlink" href="#c-gpt-runtime" title="Link to this heading"></a></h1>
<p>TensorRT-LLM includes a C++ component to execute TensorRT engines built with
the Python API as described in the <a class="reference internal" href="../architecture/overview.html#architecture-overview"><span class="std std-ref">TensorRT-LLM Architecture</span></a> section.
That component is called the C++ runtime.</p>
<p>The API of the C++ runtime is composed of the classes declared in
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/include/tensorrt_llm/runtime"><code class="docutils literal notranslate"><span class="pre">cpp/include/tensorrt_llm/runtime</span></code></a> and
implemented in <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/runtime"><code class="docutils literal notranslate"><span class="pre">cpp/tensorrt_llm/runtime</span></code></a>.</p>
<p>Even if the different components described in that document mention GPT in
their name, they are not restricted to this specific model. Those classes can
be used to implement auto-regressive models like BLOOM, GPT-J, GPT-NeoX or
LLaMA, for example.</p>
<p>Complete support of encoder-decoder models, like T5, will be added to
TensorRT-LLM in a future release. An experimental version, only in Python for
now, can be found in the <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec"><code class="docutils literal notranslate"><span class="pre">examples/enc_dec</span></code></a> folder.</p>
<section id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Link to this heading"></a></h2>
<p>Runtime models are described by an instance of the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h"><code class="docutils literal notranslate"><span class="pre">ModelConfig</span></code></a>
class and a pointer to the TensorRT engine that must be
executed to perform the inference.
The environment is configured through the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h"><code class="docutils literal notranslate"><span class="pre">WorldConfig</span></code></a>
(that name comes from
<a class="reference external" href="https://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a> and its “famous”
<code class="docutils literal notranslate"><span class="pre">MPI_COMM_WORLD</span></code> default communicator).
The <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h"><code class="docutils literal notranslate"><span class="pre">SamplingConfig</span></code></a>
class encapsulates parameters that control the
<a class="reference external" href="https://huggingface.co/blog/how-to-generate">generation</a> of new tokens.</p>
<section id="model-configuration">
<h3>Model Configuration<a class="headerlink" href="#model-configuration" title="Link to this heading"></a></h3>
<p>The model configuration is an instance of the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime//modelConfig.h"><code class="docutils literal notranslate"><span class="pre">ModelConfig</span></code></a> class.
That class encapsulates the following parameters (they are declared as private
member variables and exposed through getters and setters):</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">vocabSize</span></code>, the size of the vocabulary,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">numLayers</span></code>, the number of layers in the model,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">numHeads</span></code>, the number of heads in the attention block,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">numKvHeads</span></code>, the number of heads for K and V in the attention component.
When the number of K/V heads is the same as the number of (Q) heads, the
model uses multi-head attention. When the number of K/V heads is 1, it uses
multi-query attention. Otherwise, it uses group-query attention. Refer to <a class="reference internal" href="gpt-attention.html#gpt-attention"><span class="std std-ref">Multi-Head, Multi-Query, and Group-Query Attention</span></a> for more information,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">hiddenSize</span></code>, the size of the hidden dimension,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">dataType</span></code>, the datatype that was used to build the TensorRT engine and that
must be used to run the model during inference,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">useGptAttentionPlugin</span></code>, indicates if the <a class="reference internal" href="gpt-attention.html#gpt-attention"><span class="std std-ref">Multi-Head, Multi-Query, and Group-Query Attention</span></a> operator was compiled using the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/plugins/gptAttentionPlugin">GPT Attention plugin</a>,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">inputPacked</span></code>, indicates that the input must be packed (or padded when set
to <code class="docutils literal notranslate"><span class="pre">false</span></code>). For performance reasons, it is recommended to always use packed,
even if its default is set to <code class="docutils literal notranslate"><span class="pre">false</span></code> (will be changed in a future release).
Refer to <a class="reference internal" href="gpt-attention.html#gpt-attention"><span class="std std-ref">Multi-Head, Multi-Query, and Group-Query Attention</span></a> for more information,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">pagedKvCache</span></code>, indicates if the K/V cache uses paging.
Refer to <a class="reference internal" href="gpt-attention.html#gpt-attention"><span class="std std-ref">Multi-Head, Multi-Query, and Group-Query Attention</span></a> for more information,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">tokensPerBlock</span></code>, is the number of tokens in each block of the K/V cache.
Its relevant when the paged K/V cache is enabled. By default, the value is
64. Refer to <a class="reference internal" href="gpt-attention.html#gpt-attention"><span class="std std-ref">Multi-Head, Multi-Query, and Group-Query Attention</span></a> for more information,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">quantMode</span></code>, controls the quantization method. Refer to <a class="reference internal" href="../reference/precision.html#precision"><span class="std std-ref">Numerical Precision</span></a> for more information.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">maxBatchSize</span></code>, indicates the maximum batch size that the TensorRT engine
was built for,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">maxInputLen</span></code>, the maximum size of the input sequences,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">maxSequenceLen</span></code>, the maximum total size (input+output) of the sequences.</p></li>
</ul>
</section>
<section id="world-configuration">
<h3>World Configuration<a class="headerlink" href="#world-configuration" title="Link to this heading"></a></h3>
<p>Familiarity with
<a class="reference external" href="https://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a>, is not required
to utilize the TensorRT-LMM C++ runtime. There are two main things
you need to know:</p>
<ul class="simple">
<li><p>The C++ Runtime in TensorRT-LLM uses
<a class="reference external" href="https://en.wikipedia.org/wiki/Process_(computing)">processes</a> to execute
TensorRT engines on the different GPUs. Those GPUs can be located on a single
node as well as on different nodes in a cluster. Each process is called a
<em>rank</em> in MPI.</p></li>
<li><p>The ranks are grouped in communication groups. The
TensorRT-LLM C++ Runtime calls that group the <em>world</em>.</p></li>
</ul>
<p>The world configuration is an instance of the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/worldConfig.h"><code class="docutils literal notranslate"><span class="pre">WorldConfig</span></code></a>
class, which encapsulates the following parameters:</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">tensorParallelism</span></code>, the number of ranks that collaborate together to
implement Tensor Parallelism (TP). With TP, each GPU performs computations for
all the layers of the model. Some of those computations are distributed
across the GPU. TP is more balanced than Pipeline Parallelism (PP), in most cases, but
requires higher bandwidth between the GPUs. It is the recommended setting in
the presence of NVLINK between GPUs,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">pipelineParallelism</span></code>, the number of ranks that collaborate together to
implement Pipeline Parallelism (PP). With PP, each GPU works on a subset of
consecutive layers. Communications between the GPUs happen only at the
boundaries of the subsets of layers. It is harder to guarantee the full
utilization of the GPUs with PP but it requires less memory bandwidth. It
is the recommended setting in the absence of NVLINK between GPUs,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">rank</span></code>, the unique identifier of the rank,</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">gpusPerNode</span></code>, indicates the number of GPUs on each node. Having that
information allows the C++ runtime to optimize communications between GPUs in
a node (like taking advantage of the
<a class="reference external" href="https://www.nvidia.com/en-us/data-center/nvlink/">NVLINK</a>
interconnect between GPUs of an A100
<a class="reference external" href="https://www.nvidia.com/en-us/data-center/dgx-platform/">DGX</a>
node).</p></li>
</ul>
</section>
<section id="sampling-parameters">
<h3>Sampling Parameters<a class="headerlink" href="#sampling-parameters" title="Link to this heading"></a></h3>
<p>The <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/samplingConfig.h"><code class="docutils literal notranslate"><span class="pre">SamplingConfig</span></code></a>
class encapsulates parameters that control the
<a class="reference external" href="https://huggingface.co/blog/how-to-generate">generation</a> of new tokens.
A comparison of selecting decoding method is listed as the table below (<code class="docutils literal notranslate"><span class="pre">X</span></code> means it is not supported yet).
Except for the <code class="docutils literal notranslate"><span class="pre">beamWidth</span></code> parameter, all the fields are optional and the
runtime will use a default value if no values are provided by the user. For
vector fields, the TensorRT-LLM runtime supports one value per sequence (that is,
the vector contains <code class="docutils literal notranslate"><span class="pre">batchSize</span></code> values). If all the sequences use the same
value for a given parameter, the vector can be limited to a single element
(that is, <code class="docutils literal notranslate"><span class="pre">size()</span> <span class="pre">==</span> <span class="pre">1</span></code>).</p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Method name in HF</p></th>
<th class="head text-center"><p>Condition in HF</p></th>
<th class="head text-center"><p>Method name in TRT-LLM</p></th>
<th class="head text-center"><p>Condition in TRT-LLM</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p>assisted decoding</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">assistant_model</span></code> or <code class="docutils literal notranslate"><span class="pre">prompt_lookup_num_tokens!=None</span></code></p></td>
<td class="text-center"><p>X</p></td>
<td class="text-center"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p>beam-search decoding</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">num_beams&gt;1</span></code> and <code class="docutils literal notranslate"><span class="pre">do_sample=False</span></code></p></td>
<td class="text-center"><p>beam search</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beamWidth</span> <span class="pre">&gt;</span> <span class="pre">1</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p>beam-search multinomial sampling</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">num_beams&gt;1</span></code> and <code class="docutils literal notranslate"><span class="pre">do_sample=True</span></code></p></td>
<td class="text-center"><p>X</p></td>
<td class="text-center"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p>constrained beam-search decoding</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">constraints!=None</span></code> or <code class="docutils literal notranslate"><span class="pre">force_words_ids!=None</span></code></p></td>
<td class="text-center"><p>X</p></td>
<td class="text-center"><p></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p>contrastive search</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">penalty_alpha&gt;0</span></code> and <code class="docutils literal notranslate"><span class="pre">top_k&gt;1</span></code></p></td>
<td class="text-center"><p>X</p></td>
<td class="text-center"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p>diverse beam-search decoding</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">num_beams&gt;1</span></code> and <code class="docutils literal notranslate"><span class="pre">num_beam_groups&gt;1</span></code></p></td>
<td class="text-center"><p>X</p></td>
<td class="text-center"><p></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p>greedy decoding</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">num_beams=1</span></code> and <code class="docutils literal notranslate"><span class="pre">do_sample=False</span></code></p></td>
<td class="text-center"><p>sampling</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beamWidth</span> <span class="pre">==</span> <span class="pre">1</span></code> and <code class="docutils literal notranslate"><span class="pre">topK=0</span></code> and <code class="docutils literal notranslate"><span class="pre">topP=0.0f</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p>multinomial sampling</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">num_beams=1</span></code> and <code class="docutils literal notranslate"><span class="pre">do_sample=True</span></code></p></td>
<td class="text-center"><p>sampling</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beamWidth</span> <span class="pre">==</span> <span class="pre">1</span></code> and (<code class="docutils literal notranslate"><span class="pre">topK&gt;0</span></code> or <code class="docutils literal notranslate"><span class="pre">topP&gt;0.0f</span></code>)</p></td>
</tr>
</tbody>
</table>
<p><em><strong>General</strong></em></p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name in TRT-LLM</p></th>
<th class="head text-center"><p>Description</p></th>
<th class="head text-center"><p>Data type</p></th>
<th class="head text-center"><p>Range of value</p></th>
<th class="head text-center"><p>Default value</p></th>
<th class="head text-center"><p>Name in HF</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">temperature</span></code></p></td>
<td class="text-center"><p>modulation of logits in sampling workflow</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>[0.0f, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">1.0f</span></code> (no modulation)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">temperature</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">minLength</span></code></p></td>
<td class="text-center"><p>lower-bound on the number of tokens generated</p></td>
<td class="text-center"><p>List[Int]</p></td>
<td class="text-center"><p>[0, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code> (no effect (the first generated token can be EOS)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">min_length</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">repetitionPenalty</span></code></p></td>
<td class="text-center"><p>penalize repetitive tokens <br> multiplicative, irrespective of appearances count</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>[0.0f, $+\infty$) <br> <code class="docutils literal notranslate"><span class="pre">&lt;</span> <span class="pre">1.0f</span></code> encourages repetition <br> <code class="docutils literal notranslate"><span class="pre">&gt;</span> <span class="pre">1.0f</span></code> discourages it</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">1.0f</span></code> (no effect)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">repetition_penalty</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">presencePenalty</span></code></p></td>
<td class="text-center"><p>penalize existed tokens <br> additive, irrespective of appearances count</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>($-\infty$, $+\infty$) <br> <code class="docutils literal notranslate"><span class="pre">&lt;</span> <span class="pre">0.0f</span></code> encourages repetition <br> <code class="docutils literal notranslate"><span class="pre">&gt;</span> <span class="pre">0.0f</span></code> discourages it</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0.0f</span></code> (no effect)</p></td>
<td class="text-center"><p>no</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">frequencyPenalty</span></code></p></td>
<td class="text-center"><p>penalize existed tokens <br> additive, dependent on appearances count</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>($-\infty$, $+\infty$) <br> <code class="docutils literal notranslate"><span class="pre">&lt;</span> <span class="pre">0.0f</span></code> encourages repetition <br> <code class="docutils literal notranslate"><span class="pre">&gt;</span> <span class="pre">0.0f</span></code> discourages it</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0.0f</span></code> (no effect)</p></td>
<td class="text-center"><p>no</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">noRepeatNgramSize</span></code></p></td>
<td class="text-center"><p></p></td>
<td class="text-center"><p>List[Int]</p></td>
<td class="text-center"><p>[0, $+\infty$) <br> <code class="docutils literal notranslate"><span class="pre">&gt;</span> <span class="pre">0</span></code> all ngrams of that size can only occur once</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code> (no effect)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">no_repeat_ngram_size</span></code></p></td>
</tr>
</tbody>
</table>
<ul class="simple">
<li><p>The tokens of input prompt are included during adopting <code class="docutils literal notranslate"><span class="pre">repetitionPenalty</span></code>, <code class="docutils literal notranslate"><span class="pre">presencePenalty</span></code>, and <code class="docutils literal notranslate"><span class="pre">frequencyPenalty</span></code> onto logits.</p></li>
<li><p>The parameters <code class="docutils literal notranslate"><span class="pre">repetitionPenalty</span></code>, <code class="docutils literal notranslate"><span class="pre">presencePenalty</span></code>, and <code class="docutils literal notranslate"><span class="pre">frequencyPenalty</span></code> are not mutually exclusive.</p></li>
</ul>
<p><em><strong>Sampling</strong></em></p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name in TRT-LLM</p></th>
<th class="head text-center"><p>Description</p></th>
<th class="head text-center"><p>Data type</p></th>
<th class="head text-center"><p>Range of value</p></th>
<th class="head text-center"><p>Default value</p></th>
<th class="head text-center"><p>Name in HF</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">randomSeed</span></code></p></td>
<td class="text-center"><p>random seed for random number generator</p></td>
<td class="text-center"><p>Int64</p></td>
<td class="text-center"><p>[0, 2^64-1]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code></p></td>
<td class="text-center"><p>no</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">topK</span></code></p></td>
<td class="text-center"><p>the number of logits to sample from</p></td>
<td class="text-center"><p>List[Int]</p></td>
<td class="text-center"><p>[0, 1024]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code></p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">top_k</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">topP</span></code></p></td>
<td class="text-center"><p>the top-P probability to sample from</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>[0.0f, 1.0f]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0.0f</span></code></p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">top_p</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">topPDecay</span></code></p></td>
<td class="text-center"><p>the decay in the <code class="docutils literal notranslate"><span class="pre">topP</span></code> algorithm</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>(0.0f, 1.0f]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">1.0f</span></code></p></td>
<td class="text-center"><p>no</p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">topPMin</span></code></p></td>
<td class="text-center"><p>the decay in the <code class="docutils literal notranslate"><span class="pre">topP</span></code> algorithm</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>(0.0f, 1.0f]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">1.0e-6,f</span></code></p></td>
<td class="text-center"><p>no</p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">topPResetIds</span></code></p></td>
<td class="text-center"><p>the decay in the <code class="docutils literal notranslate"><span class="pre">topP</span></code> algorithm</p></td>
<td class="text-center"><p>List[Int]</p></td>
<td class="text-center"><p>[-1, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">-1</span></code> (no effect)</p></td>
<td class="text-center"><p>no</p></td>
</tr>
</tbody>
</table>
<ul class="simple">
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">topK</span> <span class="pre">=</span> <span class="pre">0</span></code> and <code class="docutils literal notranslate"><span class="pre">topP</span> <span class="pre">=</span> <span class="pre">0.0f</span></code>, greedy search is performed.</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">topK</span> <span class="pre">&gt;</span> <span class="pre">0</span></code> and <code class="docutils literal notranslate"><span class="pre">topP</span> <span class="pre">=</span> <span class="pre">0.0f</span></code>, <code class="docutils literal notranslate"><span class="pre">topK</span></code> tokens of highest probabilities will become the candidates of sampling (named <code class="docutils literal notranslate"><span class="pre">TopK</span> <span class="pre">sampling</span></code> in TRT-LLM).</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">topK</span> <span class="pre">=</span> <span class="pre">0</span></code> and <code class="docutils literal notranslate"><span class="pre">topP</span> <span class="pre">&gt;</span> <span class="pre">0.0f</span></code>, tokens will be sorted with probability descendly, then the tokens with highest probabilities which the accumulated probability larger than <code class="docutils literal notranslate"><span class="pre">topP</span></code> will become the candidates of sampling (named <code class="docutils literal notranslate"><span class="pre">TopP</span> <span class="pre">sampling</span></code> in TRT-LLM).</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">topK</span> <span class="pre">&gt;</span> <span class="pre">0</span></code> and <code class="docutils literal notranslate"><span class="pre">topP</span> <span class="pre">&gt;</span> <span class="pre">0.0f</span></code>, <code class="docutils literal notranslate"><span class="pre">topK</span></code> tokens of highest probabilities will be selected, then those selected tokens will be sorted with probability descendly and their probability will be normalized, then the tokens with highest normalized probabilities which the accumulated probability larger than <code class="docutils literal notranslate"><span class="pre">topP</span></code> will become the candidates of sampling (named <code class="docutils literal notranslate"><span class="pre">TopKTopP</span> <span class="pre">sampling</span></code> in TRT-LLM)</p></li>
<li><p>If different <code class="docutils literal notranslate"><span class="pre">topK</span></code> values are provided for the different sequences in the batch, the performance of the implementation will depend on the largest value. For efficiency reasons, we recommend to batch requests with similar <code class="docutils literal notranslate"><span class="pre">topK</span></code> values together.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">topPDecay</span></code>, <code class="docutils literal notranslate"><span class="pre">topPMin</span></code> and <code class="docutils literal notranslate"><span class="pre">topPResetIds</span></code> are explained in
<a class="reference external" href="https://arxiv.org/abs/2206.04624"><em>Factuality Enhanced Language Models for Open-Ended Text Generation</em></a>.
<code class="docutils literal notranslate"><span class="pre">topPDecay</span></code> is the decay, <code class="docutils literal notranslate"><span class="pre">topPMin</span></code> is the lower-bound and <code class="docutils literal notranslate"><span class="pre">topPResetIds</span></code> indicates where to reset the decay.</p></li>
</ul>
<p><em><strong>Beam-search</strong></em></p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>Name in TRT-LLM</p></th>
<th class="head text-center"><p>Description</p></th>
<th class="head text-center"><p>Data type</p></th>
<th class="head text-center"><p>Range of value</p></th>
<th class="head text-center"><p>Default value</p></th>
<th class="head text-center"><p>Name in HF</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beamWidth</span></code></p></td>
<td class="text-center"><p>width for beam-search algorithm</p></td>
<td class="text-center"><p>Int</p></td>
<td class="text-center"><p>[0, 1024]</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code> (disable beam search)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beam_width</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">beamSearchDiversityRate</span></code></p></td>
<td class="text-center"><p>diversity of generated tokens</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>[0, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0.0f</span></code></p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">diversity_penalty</span></code></p></td>
</tr>
<tr class="row-even"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">lengthPenalty</span></code></p></td>
<td class="text-center"><p>penalize longer sequences</p></td>
<td class="text-center"><p>List[Float]</p></td>
<td class="text-center"><p>[0, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0.0f</span></code></p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">length_penalty</span></code></p></td>
</tr>
<tr class="row-odd"><td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">earlyStopping</span></code></p></td>
<td class="text-center"><p>see description below</p></td>
<td class="text-center"><p>List[Int]</p></td>
<td class="text-center"><p>($-\infty$, $+\infty$)</p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">0</span></code></p></td>
<td class="text-center"><p><code class="docutils literal notranslate"><span class="pre">early_stopping</span></code></p></td>
</tr>
</tbody>
</table>
<ul class="simple">
<li><p>Beam-search algorithm: <a class="reference external" href="https://en.wikipedia.org/wiki/Beam_search">beam search</a>.</p></li>
<li><p>Parameter <code class="docutils literal notranslate"><span class="pre">diversity_penalty</span></code> in HF is only used for <code class="docutils literal notranslate"><span class="pre">diverse</span> <span class="pre">beam-search</span> <span class="pre">decoding</span></code> (or named <code class="docutils literal notranslate"><span class="pre">Group-Beam-Search</span></code>), which is not supported by TRT-LLM yet.</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">earlyStopping</span> <span class="pre">=</span> <span class="pre">1</span></code>, decoding will stop once <code class="docutils literal notranslate"><span class="pre">beamWidth</span></code> finished sentences are generated.</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">earlyStopping</span> <span class="pre">=</span> <span class="pre">0</span></code>, decoding will keep going until no better sentences (with better score) can be generated.</p></li>
<li><p>If setting <code class="docutils literal notranslate"><span class="pre">earlyStopping</span></code> to other values, decoding will stop only depending on <code class="docutils literal notranslate"><span class="pre">lengthlengthPenalty</span></code>.</p></li>
<li><p>The <code class="docutils literal notranslate"><span class="pre">beamWidth</span></code> parameter is a scalar value. It means that in this release of
TensorRT-LLM, it is not possible to specify a different width for each input
sequence. This limitation is likely to be removed in a future release.</p></li>
</ul>
</section>
</section>
<section id="the-session">
<h2>The Session<a class="headerlink" href="#the-session" title="Link to this heading"></a></h2>
<p><em>The runtime session is deprecated in favor of the <a class="reference internal" href="executor.html#executor"><span class="std std-ref">Executor API</span></a>.
It will be removed in a future release of TensorRT-LLM.</em></p>
<p>An example of how to use the <code class="docutils literal notranslate"><span class="pre">GptSession</span></code> to run a GPT-like auto-regressive model can be found in
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tests/runtime/gptSessionTest.cpp"><code class="docutils literal notranslate"><span class="pre">cpp/tests/runtime/gptSessionTest.cpp</span></code></a>.</p>
<section id="internal-components">
<h3>Internal Components<a class="headerlink" href="#internal-components" title="Link to this heading"></a></h3>
<p>The <code class="docutils literal notranslate"><span class="pre">GptSession</span></code> class encapsulates two main components. The
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/tllmRuntime.h"><code class="docutils literal notranslate"><span class="pre">TllmRuntime</span></code></a> is in charge of the
execution of the TensorRT engine. The
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/runtime/gptDecoder.h"><code class="docutils literal notranslate"><span class="pre">GptDecoder</span></code></a>
does the generation of the tokens from the logits. The <code class="docutils literal notranslate"><span class="pre">TllmRuntime</span></code> class is
an internal component and you are not expected to use that class directly.
The <code class="docutils literal notranslate"><span class="pre">GptDecoder</span></code> can be used directly to implement custom generation loop
and for use cases that cannot be satisfied by the implementation in
<code class="docutils literal notranslate"><span class="pre">GptSession</span></code>.</p>
</section>
</section>
<section id="in-flight-batching-support">
<h2>In-flight Batching Support<a class="headerlink" href="#in-flight-batching-support" title="Link to this heading"></a></h2>
<p>In-flight batching is supported using separate decoders per
request. The biggest difference compared to using a single decoder is in how
the token generation from logits is managed. A batch is split into <code class="docutils literal notranslate"><span class="pre">batchSize</span></code>
individual requests and kernels are issued using separated CUDA streams.
This behavior may be revisited in a future release to maintain the structure
of the batch and improve efficiency.</p>
</section>
<section id="know-issues-and-future-changes">
<h2>Know Issues and Future Changes<a class="headerlink" href="#know-issues-and-future-changes" title="Link to this heading"></a></h2>
<ul class="simple">
<li><p>In the current release of TensorRT-LLM, the C++ and Python runtimes are two
separate software components and the C++ runtime is being more actively
developed (with features like in-flight batching). An objective, for a
future release, could be to rebuild the Python runtime on top of the C++
one.</p></li>
</ul>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="gpt-attention.html" class="btn btn-neutral float-left" title="Multi-Head, Multi-Query, and Group-Query Attention" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="executor.html" class="btn btn-neutral float-right" title="Executor API" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7f1ace887bf0>
<div class="footer">
<p>
Copyright © 2024 NVIDIA Corporation
</p>
<p>
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
data-cms-ai="0">Privacy Policy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
data-cms-ai="0">Manage My Privacy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
data-cms-ai="0">Do Not Sell or Share My Data</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
rel="noopener" data-cms-ai="0">Terms of Service</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
data-cms-ai="0">Accessibility</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
rel="noopener" data-cms-ai="0">Corporate Policies</a> |
<a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
data-cms-ai="0">Product Security</a> |
<a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
data-cms-ai="0">Contact</a>
</p>
</div>
</div>
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>