TensorRT-LLMs/performance/perf-best-practices.html
2024-12-25 13:44:02 +08:00

521 lines
45 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Best Practices for Tuning the Performance of TensorRT-LLM &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
<script src="../_static/copybutton.js?v=65e89d2a"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Performance Analysis" href="perf-analysis.html" />
<link rel="prev" title="TensorRT-LLM Benchmarking" href="perf-benchmarking.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../quick-start-guide.html">Quick Start Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../key-features.html">Key Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="../release-notes.html">Release Notes</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Installation</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/linux.html">Installing on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-linux.html">Building from Source Code on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/windows.html">Installing on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-windows.html">Building from Source Code on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/grace-hopper.html">Installing on Grace Hopper</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">LLM API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../llm-api/index.html">API Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api/reference.html">API Reference</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">LLM API Examples</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/index.html">LLM Examples Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/customization.html">Common Customizations</a></li>
<li class="toctree-l1"><a class="reference internal" href="../llm-api-examples/llm_api_examples.html">Examples</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Model Definition API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.quantization.html">Quantization</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/executor.html">Executor</a></li>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Command-Line Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../commands/trtllm-build.html">trtllm-build</a></li>
<li class="toctree-l1"><a class="reference internal" href="../commands/trtllm-serve.html">trtllm-serve</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture/overview.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html">Model Definition</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#compilation">Compilation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#runtime">Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#multi-gpu-and-multi-node-support">Multi-GPU and Multi-Node Support</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/checkpoint.html">TensorRT-LLM Checkpoint</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/workflow.html">TensorRT-LLM Build Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/add-model.html">Adding a Model</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Advanced</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../advanced/gpt-attention.html">Multi-Head, Multi-Query, and Group-Query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/gpt-runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/executor.html">Executor API</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/inference-request.html">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/inference-request.html#responses">Responses</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/expert-parallelism.html">Expert Parallelism in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/kv-cache-reuse.html">KV cache reuse</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/speculative-decoding.html">Speculative Sampling</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Performance</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="perf-overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="perf-benchmarking.html">Benchmarking</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Best Practices</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#how-to-measure-performance">How To Measure Performance?</a></li>
<li class="toctree-l2"><a class="reference internal" href="#build-options-to-optimize-the-performance-of-tensorrt-llm-models">Build Options to Optimize the Performance of TensorRT-LLM Models</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#max-batch-size-max-seq-len-and-max-num-tokens"><code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code>, <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> and <code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code></a><ul>
<li class="toctree-l4"><a class="reference internal" href="#max-batch-size"><code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code></a></li>
<li class="toctree-l4"><a class="reference internal" href="#max-seq-len"><code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code></a></li>
<li class="toctree-l4"><a class="reference internal" href="#max-num-tokens"><code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code></a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#multiple-profiles">Multiple profiles</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#fp8-context-fused-multi-head-attention">FP8 Context Fused Multi-Head Attention</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#gpt-attention-plugin-and-context-fused-multi-head-attention">GPT Attention Plugin and Context Fused Multi-Head Attention</a></li>
<li class="toctree-l3"><a class="reference internal" href="#remove-input-padding">Remove Input Padding</a></li>
<li class="toctree-l3"><a class="reference internal" href="#paged-kv-cache">Paged KV Cache</a></li>
<li class="toctree-l3"><a class="reference internal" href="#reduce-norm-fusion">Reduce Norm Fusion</a></li>
<li class="toctree-l3"><a class="reference internal" href="#user-buffer">User Buffer</a></li>
<li class="toctree-l3"><a class="reference internal" href="#embedding-parallelism-embedding-sharing-and-look-up-plugin">Embedding Parallelism, Embedding Sharing, and Look-Up Plugin</a></li>
<li class="toctree-l3"><a class="reference internal" href="#horizontal-fusion-in-gated-mlp">Horizontal Fusion in Gated-MLP</a></li>
<li class="toctree-l3"><a class="reference internal" href="#gemm-plugin">GEMM Plugin</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#fp8-gemm-plugin-for-small-batch-size-performance-optimization">FP8 GEMM Plugin for Small Batch Size Performance Optimization</a></li>
<li class="toctree-l4"><a class="reference internal" href="#gemm-swiglu-fusion-in-gated-mlp">GEMM + SwiGLU Fusion in Gated-MLP</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="#bert-attention-plugin-and-context-fused-multi-head-attention">BERT Attention Plugin and Context Fused Multi-Head Attention</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#runtime-options-to-optimize-the-performance-of-tensorrt-llm-models">Runtime Options to Optimize the Performance of TensorRT-LLM Models</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#capacity-scheduler-policy">Capacity Scheduler Policy</a></li>
<li class="toctree-l3"><a class="reference internal" href="#context-chunking-policy">Context Chunking Policy</a></li>
<li class="toctree-l3"><a class="reference internal" href="#batching-type">Batching Type</a></li>
<li class="toctree-l3"><a class="reference internal" href="#max-tokens-in-paged-kv-cache-and-kv-cache-free-gpu-memory-fraction">Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction</a></li>
<li class="toctree-l3"><a class="reference internal" href="#maximum-attention-window-size">Maximum Attention Window Size</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="perf-analysis.html">Performance Analysis</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../reference/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/support-matrix.html">Support Matrix</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/XQA-kernel.html">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Best Practices for Tuning the Performance of TensorRT-LLM</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/performance/perf-best-practices.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="best-practices-for-tuning-the-performance-of-tensorrt-llm">
<span id="perf-best-practice"></span><h1>Best Practices for Tuning the Performance of TensorRT-LLM<a class="headerlink" href="#best-practices-for-tuning-the-performance-of-tensorrt-llm" title="Link to this heading"></a></h1>
<p>This document provides some best practices for tuning the performance of TensorRT-LLM.</p>
<section id="how-to-measure-performance">
<h2>How To Measure Performance?<a class="headerlink" href="#how-to-measure-performance" title="Link to this heading"></a></h2>
<p>TensorRT-LLM can be benchmarked using the
<a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/README.md">C++</a> tools. We are actively developing <code class="docutils literal notranslate"><span class="pre">trtllm-bench</span></code> command, which is going to be the recommended way of benchmarking TensorRT-LLM.</p>
<p>For detailed performance data and
the steps to reproduce those results, see
this <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html">Document</a>.
The <a class="reference external" href="https://github.com/triton-inference-server/tensorrtllm_backend">TensorRT-LLM backend</a>
can also be used to measure the performance of TensorRT-LLM for online serving.</p>
</section>
<section id="build-options-to-optimize-the-performance-of-tensorrt-llm-models">
<h2>Build Options to Optimize the Performance of TensorRT-LLM Models<a class="headerlink" href="#build-options-to-optimize-the-performance-of-tensorrt-llm-models" title="Link to this heading"></a></h2>
<p>This part summarizes how to build engines to enhance the performance of the
runtime. The following options have reasonable default values but for some of them,
its possible that tuning is needed to get the peak numbers.</p>
<p><em><strong>Note that some of those features and how to enable them may change in the future.</strong></em></p>
<section id="max-batch-size-max-seq-len-and-max-num-tokens">
<h3><code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code>, <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> and <code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code><a class="headerlink" href="#max-batch-size-max-seq-len-and-max-num-tokens" title="Link to this heading"></a></h3>
<p align="center">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/media/max_bs_toks_len.svg?raw=true" alt="Explain `max_batch_size`, `max_seq_len` and `max_num_tokens`" width="30%" height="auto">
</p>
<p>Regarding the impacts of those three arguments to the GPU memory usage, please refer to <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/reference/memory.html">memory.md</a></p>
<section id="max-batch-size">
<h4><code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code><a class="headerlink" href="#max-batch-size" title="Link to this heading"></a></h4>
<p><code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code> defines the maximum number of requests that the engine can handle.</p>
<p>It controls the maximum number of requests that can be scheduled at runtime.</p>
<p>Set high enough <code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code> when building the engine so that it does not become the bottleneck of the throughput, and use runtime <code class="docutils literal notranslate"><span class="pre">max_batch_size</span></code> to tune it without re-building the engine if you want to get better user throughput or lower latency.</p>
</section>
<section id="max-seq-len">
<h4><code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code><a class="headerlink" href="#max-seq-len" title="Link to this heading"></a></h4>
<p><code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> defines the maximum sequence length of single request</p>
<p>Starting from TensorRT-LLM v0.11, when <code class="docutils literal notranslate"><span class="pre">--remove_input_padding</span></code> and <code class="docutils literal notranslate"><span class="pre">--context_fmha</span></code> are enabled, <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> can replace <code class="docutils literal notranslate"><span class="pre">max_input_len</span></code> and <code class="docutils literal notranslate"><span class="pre">max_output_len</span></code>, and is set to <code class="docutils literal notranslate"><span class="pre">max_position_embeddings</span></code> by default.</p>
<p>Use default <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> (which is <code class="docutils literal notranslate"><span class="pre">max_position_embeddings</span></code>), no need to tune it unless you are very sure what max sequence lengths would be on your workloads. If the GPU memory is so limited that it cannot make sure even one request to reach <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code>, youll need to reduce it.</p>
</section>
<section id="max-num-tokens">
<h4><code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code><a class="headerlink" href="#max-num-tokens" title="Link to this heading"></a></h4>
<p><code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code> defines the maximum number of batched input tokens after padding is removed in each batch.</p>
<p><code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code> is set to 8192 by default starting from v0.11, you can tune it using the runtime <code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code> without re-buliding the engine. It is recommended to tune <code class="docutils literal notranslate"><span class="pre">--max_num_tokens</span></code> for better performance.</p>
<p>The maximum number of tokens equals will not take effects when input padding is
not removed. When input padding is removed (see <a class="reference internal" href="#remove-input-padding">Remove Input
Padding</a>), the tokens from different sequences are
packed together and the maximum number of the tokens can be set to a different
(lower) value, which by default to be 8192.</p>
<p>There are two aspects that must be considered. Firstly, some input sequences
will be shorter than the maximum input length. Secondly, when in-flight
sequence batching is enabled, requests in context phase will be executed with
requests in generation phase. Those latter requests produce a lot fewer tokens
than <code class="docutils literal notranslate"><span class="pre">max_input_len</span></code> (at most, <code class="docutils literal notranslate"><span class="pre">beam_width</span></code> tokens).</p>
<p>Using a more realistic value for <code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code> allows TensorRT-LLM to
allocate more memory to store the KV cache and execute more requests together.
It leads to an increased efficiency.</p>
<p>Increasing <code class="docutils literal notranslate"><span class="pre">max_num_tokens</span></code> appropriately will be beneficial to performance.
When increasing <code class="docutils literal notranslate"><span class="pre">--max_num_tokens</span></code> to some point, GPU utilization will plateau,
going beyond that saturation point may hurt both first token latency as well as
total end-to-end latency.</p>
<p>See also <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#chunked-context">chunked context</a>.</p>
</section>
</section>
<section id="multiple-profiles">
<h3>Multiple profiles<a class="headerlink" href="#multiple-profiles" title="Link to this heading"></a></h3>
<p><code class="docutils literal notranslate"><span class="pre">--multiple_profiles</span></code> enables multiple TensorRT optimization profiles in the
built engines, it will benefits the performance especially when GEMM plugin is
disabled, because more optimization profiles help TensorRT have more chances to
select better kernels.</p>
<p>Note: This feature increases engine build time but no other adverse effects are expected.</p>
<section id="fp8-context-fused-multi-head-attention">
<h4>FP8 Context Fused Multi-Head Attention<a class="headerlink" href="#fp8-context-fused-multi-head-attention" title="Link to this heading"></a></h4>
<p><code class="docutils literal notranslate"><span class="pre">--use_fp8_context_fmha</span></code> enables FP8 Context fused multi-head attention. We
recommend enabling this when fp8 quantization is used to improve the context phase
attention performance. Note that only NVIDIA Hopper architecture is supported.</p>
</section>
</section>
<section id="gpt-attention-plugin-and-context-fused-multi-head-attention">
<h3>GPT Attention Plugin and Context Fused Multi-Head Attention<a class="headerlink" href="#gpt-attention-plugin-and-context-fused-multi-head-attention" title="Link to this heading"></a></h3>
<p>The GPT attention plugin and fused multi-head attention kernel are enabled by
default. For the context phase, use the <code class="docutils literal notranslate"><span class="pre">--gpt_attention_plugin</span></code>
and <code class="docutils literal notranslate"><span class="pre">--context_fmha</span></code> arguments with <code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code> to control.</p>
<p>The TensorRT-LLM GPT attention plugin uses efficient kernels and enables an
in-place update of the KV cache. It results in reduced memory consumption as
well as the removal of unneeded memory copy operations (compared with the
implementation that uses the <code class="docutils literal notranslate"><span class="pre">concat</span></code> operator to update the KV cache).</p>
<p>Enabling the fused multi-head attention, during the context phase, will trigger
a kernel that performs the MHA/MQA/GQA block using a single kernel, for more
details, see this <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#context-phase">Document</a>.</p>
</section>
<section id="remove-input-padding">
<h3>Remove Input Padding<a class="headerlink" href="#remove-input-padding" title="Link to this heading"></a></h3>
<p>The remove input padding feature is enabled by default, the <code class="docutils literal notranslate"><span class="pre">--remove_input_padding</span></code>
argument in <code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code> is used to control it.</p>
<p>When input padding is removed, the different tokens are packed together. It
reduces both the amount of computations and memory consumption. For more details, see
this <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#padded-and-packed-tensors">Document</a>.</p>
</section>
<section id="paged-kv-cache">
<h3>Paged KV Cache<a class="headerlink" href="#paged-kv-cache" title="Link to this heading"></a></h3>
<p>Paged KV cache is enabled by default, the <code class="docutils literal notranslate"><span class="pre">--paged_kv_cache</span></code> argument in
<code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code> is used to control it.</p>
<p>The paged KV cache helps manage memory for the KV cache more efficiently (see
this <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#paged-kv-cache">Document</a>). It usually leads to an
increase in the batch size and an improved efficiency.</p>
</section>
<section id="reduce-norm-fusion">
<h3>Reduce Norm Fusion<a class="headerlink" href="#reduce-norm-fusion" title="Link to this heading"></a></h3>
<p>There is an experimental feature called “Reduce Norm Fusion”
available to extend the custom AllReduce functionality. It can be enabled by
using the <code class="docutils literal notranslate"><span class="pre">--reduce_fusion</span> <span class="pre">enable</span></code> argument with <code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code> when the
custom AllReduce is already enabled.</p>
<p>This feature aims to fuse the <code class="docutils literal notranslate"><span class="pre">ResidualAdd</span></code>
and <code class="docutils literal notranslate"><span class="pre">LayerNorm</span></code> kernels after <code class="docutils literal notranslate"><span class="pre">AllReduce</span></code> into a single kernel, resulting in
improved end-to-end performance.</p>
<p>Please note that currently, this feature is
only supported for the llama model. It is recommended to enable this feature when the batch size is small and the generation phase time is the dominant factor.</p>
</section>
<section id="user-buffer">
<h3>User Buffer<a class="headerlink" href="#user-buffer" title="Link to this heading"></a></h3>
<p>An experimental feature called “User Buffer” is available to enhance communication performance. It can be enabled by using the <code class="docutils literal notranslate"><span class="pre">--user_buffer</span> <span class="pre">enable</span></code> argument with <code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code>.
This feature aims to eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance.
This feature must be enabled with <code class="docutils literal notranslate"><span class="pre">--reduce_fusion</span> <span class="pre">enable</span></code> and is only supported for the FP8 LLAMA model.</p>
</section>
<section id="embedding-parallelism-embedding-sharing-and-look-up-plugin">
<h3>Embedding Parallelism, Embedding Sharing, and Look-Up Plugin<a class="headerlink" href="#embedding-parallelism-embedding-sharing-and-look-up-plugin" title="Link to this heading"></a></h3>
<p>The embedding parallelism feature enables the sharding of the embedding table
across multiple GPUs, so that the memory usage could be reduced and the
throughput improved. The embedding sharing feature enables the sharing of the
embedding table between <code class="docutils literal notranslate"><span class="pre">look_up</span></code> and <code class="docutils literal notranslate"><span class="pre">lm_head</span></code> layers to reduced memory usage.</p>
<p>It is recommended to enable embedding parallelism to improve throughput with <code class="docutils literal notranslate"><span class="pre">--use_parallel_embedding</span></code> and <code class="docutils literal notranslate"><span class="pre">--embedding_sharding_dim</span></code> in <code class="docutils literal notranslate"><span class="pre">convert_checkpoint.py</span></code>.</p>
<p>Embedding sharing is by default enabled if following conditions are met:</p>
<ol class="arabic simple">
<li><p><code class="docutils literal notranslate"><span class="pre">look_up</span></code> and <code class="docutils literal notranslate"><span class="pre">lm_head</span></code> layers have identical weights.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">--gemm_plugin</span></code> is not used when building the engine.</p></li>
<li><p>For tensor parallelism cases, <code class="docutils literal notranslate"><span class="pre">-embedding_sharding_dim</span> <span class="pre">0</span></code> must be set. In other words, we must enable embedding parallelism along the vocab dimension,</p></li>
</ol>
<p>See those <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt#embedding-parallelism">Examples</a> for details.</p>
</section>
<section id="horizontal-fusion-in-gated-mlp">
<h3>Horizontal Fusion in Gated-MLP<a class="headerlink" href="#horizontal-fusion-in-gated-mlp" title="Link to this heading"></a></h3>
<p>Horizontal fusion in Gated-MLP combines two Matmul operations into a single one
followed by a separate SwiGLU kernel. It can effectively reduce latency.
This feature is enabled by default.</p>
</section>
<section id="gemm-plugin">
<h3>GEMM Plugin<a class="headerlink" href="#gemm-plugin" title="Link to this heading"></a></h3>
<p>The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM operations. On FP16 and
BF16, its recommended to be enabled for better performance and smaller GPU
memory usage. On FP8, its recommended to be disabled.</p>
<section id="fp8-gemm-plugin-for-small-batch-size-performance-optimization">
<h4>FP8 GEMM Plugin for Small Batch Size Performance Optimization<a class="headerlink" href="#fp8-gemm-plugin-for-small-batch-size-performance-optimization" title="Link to this heading"></a></h4>
<p>FP8 gemm plugin is an experimental feature aimed to improve performance in
small-batch-size cases(e.g. BS&lt;=4) and can be enabled by <code class="docutils literal notranslate"><span class="pre">--gemm_plugin</span> <span class="pre">fp8</span></code>
when building FP8 models. Although inputs with larger batch size can be correctly
inferenced, the performance may decrease as batch size grows. Therefore, this
feature is only recommended for latency reduction in small-batch-size scenarios
currently.</p>
</section>
<section id="gemm-swiglu-fusion-in-gated-mlp">
<h4>GEMM + SwiGLU Fusion in Gated-MLP<a class="headerlink" href="#gemm-swiglu-fusion-in-gated-mlp" title="Link to this heading"></a></h4>
<p>The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper. While this fusion improves performance, it can slightly reduce accuracy in FP8 PTQ because one quantization scaling factor is discarded.</p>
<p>We recommend enabling this feature for large models running on Hopper with FP8 precision. Use the following <code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code> arguments to enable it:</p>
<ul class="simple">
<li><p>For large models: <code class="docutils literal notranslate"><span class="pre">--use_fused_mlp=enable</span> <span class="pre">--gemm_swiglu_plugin=fp8</span></code></p></li>
<li><p>For small batch sizes: <code class="docutils literal notranslate"><span class="pre">--use_fused_mlp=enable</span> <span class="pre">--low_latency_gemm_swiglu_plugin=fp8</span></code> to improve latency.</p></li>
</ul>
<p>We do not recommend enabling this feature for very small workloads or if the
accuracy loss is unacceptable.</p>
</section>
</section>
<section id="bert-attention-plugin-and-context-fused-multi-head-attention">
<h3>BERT Attention Plugin and Context Fused Multi-Head Attention<a class="headerlink" href="#bert-attention-plugin-and-context-fused-multi-head-attention" title="Link to this heading"></a></h3>
<p>BERT attention plugin and context fused multi-head attention are both
recommended for the BERT model. They are enabled by default using the
<code class="docutils literal notranslate"><span class="pre">--bert_attention_plugin</span></code> and <code class="docutils literal notranslate"><span class="pre">--context_fmha</span></code> arguments with
<code class="docutils literal notranslate"><span class="pre">trtllm-build</span></code>.</p>
</section>
</section>
<section id="runtime-options-to-optimize-the-performance-of-tensorrt-llm-models">
<h2>Runtime Options to Optimize the Performance of TensorRT-LLM Models<a class="headerlink" href="#runtime-options-to-optimize-the-performance-of-tensorrt-llm-models" title="Link to this heading"></a></h2>
<p>This part summarizes the runtime configuration knobs that can be tweaked to
enhance the performance of already built engines. Note that currently the
configurations can be modified using the
<a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/executor.html#executor-api">Executor API</a>
as well as the
<a class="reference external" href="https://github.com/triton-inference-server/tensorrtllm_backend">TensorRT-LLM backend</a>.</p>
<section id="capacity-scheduler-policy">
<h3>Capacity Scheduler Policy<a class="headerlink" href="#capacity-scheduler-policy" title="Link to this heading"></a></h3>
<p>There currently are three batch scheduler policies: <code class="docutils literal notranslate"><span class="pre">GUARANTEED_NO_EVICT</span></code> (default),
<code class="docutils literal notranslate"><span class="pre">MAX_UTILIZATION</span></code> and <code class="docutils literal notranslate"><span class="pre">STATIC_BATCH</span></code>.</p>
<p>The scheduling policy can be set to <code class="docutils literal notranslate"><span class="pre">MAX_UTILIZATION</span></code> to pack as many
requests as possible at each iteration of the forward loop, when in-flight
sequence batching is enabled. It maximizes the utilization of the GPUs by
aggressively scheduling requests at the risk of having to pause requests if the
KV cache size limit is reached.</p>
<p>For a more conservative approach with respect to the KV cache limitations in
terms of memory allocation, <code class="docutils literal notranslate"><span class="pre">CapacitySchedulerPolicy</span></code> should be set to
<code class="docutils literal notranslate"><span class="pre">GUARANTEED_NO_EVICT</span></code> to guarantee that a started request is never paused.</p>
<p>If the goal is to maximizes the throughput, users should try <code class="docutils literal notranslate"><span class="pre">MAX_UTILIZATION</span></code>.
However, they need to keep in mind that it may have a negative impact on
latency if requests have to be paused.</p>
<p><code class="docutils literal notranslate"><span class="pre">STATIC_BATCH</span></code> is a legacy mode and is not recommended for production usage.</p>
</section>
<section id="context-chunking-policy">
<h3>Context Chunking Policy<a class="headerlink" href="#context-chunking-policy" title="Link to this heading"></a></h3>
<p>Context chunking will increase the chance of batch processing between
the context and the generation phase, thereby balancing the calculation amount
of each iteration and increasing throughput.</p>
<p>There currently are two context chunking policies: <code class="docutils literal notranslate"><span class="pre">FIRST_COME_FIRST_SERVED</span></code> (default)
and <code class="docutils literal notranslate"><span class="pre">EQUAL_PROGRESS</span></code>.</p>
<p><code class="docutils literal notranslate"><span class="pre">FIRST_COME_FIRST_SERVED</span></code> should achieve overall better performance, while
<code class="docutils literal notranslate"><span class="pre">EQUAL_PROGRESS</span></code> can be helpful in theory to make sure time to first token (TTFT)
for most requests are relatively similar.</p>
</section>
<section id="batching-type">
<h3>Batching Type<a class="headerlink" href="#batching-type" title="Link to this heading"></a></h3>
<p>The batching type can be set to <code class="docutils literal notranslate"><span class="pre">INFLIGHT</span></code> (default) and <code class="docutils literal notranslate"><span class="pre">STATIC</span></code>.
It is recommended to use <code class="docutils literal notranslate"><span class="pre">INFLIGHT</span></code> to increase throughput and reduce latency.</p>
</section>
<section id="max-tokens-in-paged-kv-cache-and-kv-cache-free-gpu-memory-fraction">
<h3>Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction<a class="headerlink" href="#max-tokens-in-paged-kv-cache-and-kv-cache-free-gpu-memory-fraction" title="Link to this heading"></a></h3>
<p>The <code class="docutils literal notranslate"><span class="pre">max_tokens_in_paged_kv_cache</span></code> and <code class="docutils literal notranslate"><span class="pre">kv_cache_free_gpu_mem_fraction</span></code>
parameters can be used to control the maximum number of tokens handled by the
KV cache manager. Setting them properly helps better control the amount of
available memory for the KV cache manager during inference. Keeping in mind
that increasing the amount of memory available to the KV cache manager tends to
translate to a higher achievable throughput.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">max_tokens_in_paged_kv_cache</span></code> flag directly sets the maximum number of
tokens in the KV cache manager. When left unset, that value will be computed
based on the <code class="docutils literal notranslate"><span class="pre">kv_cache_free_gpu_mem_fraction</span></code> setting.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">kv_cache_free_gpu_mem_fraction</span></code> is a floating-point number between <code class="docutils literal notranslate"><span class="pre">0.0</span></code>
and <code class="docutils literal notranslate"><span class="pre">1.0</span></code> that indicates the maximum fraction of GPU memory (after loading the
model) that will be used for the KV cache. The default value is <code class="docutils literal notranslate"><span class="pre">0.90</span></code> and
means that 90% of the free GPU memory will be used to save tokens in the KV
cache. Based on that value, TensorRT-LLM can determine the maximum number of
tokens in the KV cache manager.</p>
<p>When both parameters are set, the maximum number of tokens in the KV cache
manager will be set to the smaller value between <code class="docutils literal notranslate"><span class="pre">max_tokens_in_paged_kv_cache</span></code>
and the value computed from the amount of memory available for the KV cache.</p>
<p>Unless users clearly know the maximum number of tokens in the KV cache needed
by the model, it is recommended to leave <code class="docutils literal notranslate"><span class="pre">max_tokens_in_paged_kv_cache</span></code> unset.
For <code class="docutils literal notranslate"><span class="pre">kv_cache_free_gpu_mem_fraction</span></code>, if no other programs are executed on the
same GPU, it is recommended to test with a as high value as <code class="docutils literal notranslate"><span class="pre">0.95</span></code> to target a
high throughput. Note that the <code class="docutils literal notranslate"><span class="pre">kv_cache_free_gpu_mem_fraction</span></code> parameter
cannot be set to <code class="docutils literal notranslate"><span class="pre">1.0</span></code> because some amount of memory has to be reserved for
inputs and outputs.</p>
</section>
<section id="maximum-attention-window-size">
<h3>Maximum Attention Window Size<a class="headerlink" href="#maximum-attention-window-size" title="Link to this heading"></a></h3>
<p>The <code class="docutils literal notranslate"><span class="pre">max_attention_window_size</span></code> flag sets the maximum number of tokens that are
attended to in order to generate one token when using techniques like sliding window
attention. See this
<a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache">Document</a>
for more details. It defaults to the maximum sequence length
(<code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> when building the engine), which means
that the feature is disabled by default.</p>
<p>When set to a smaller value than <code class="docutils literal notranslate"><span class="pre">max_seq_len</span></code> (during
engine build), only the KV cache of the last <code class="docutils literal notranslate"><span class="pre">max_attention_window_size</span></code> tokens
will be stored. If the input sequence length at runtime exceeds the
<code class="docutils literal notranslate"><span class="pre">max_attention_window_size</span></code> value, the accuracy may start dropping, but the
runtime performance will be better (due to the reduction in terms of
computations and GPU memory allocation). Users can modify that value to
increase runtime performance at the expense of reduced accuracy.</p>
</section>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="perf-benchmarking.html" class="btn btn-neutral float-left" title="TensorRT-LLM Benchmarking" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="perf-analysis.html" class="btn btn-neutral float-right" title="Performance Analysis" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7fed9b5a96d0>
<div class="footer">
<p>
Copyright © 2024 NVIDIA Corporation
</p>
<p>
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
data-cms-ai="0">Privacy Policy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
data-cms-ai="0">Manage My Privacy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
data-cms-ai="0">Do Not Sell or Share My Data</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
rel="noopener" data-cms-ai="0">Terms of Service</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
data-cms-ai="0">Accessibility</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
rel="noopener" data-cms-ai="0">Corporate Policies</a> |
<a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
data-cms-ai="0">Product Security</a> |
<a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
data-cms-ai="0">Contact</a>
</p>
</div>
</div>
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>