TensorRT-LLMs/blogs/XQA-kernel.html
2024-02-29 20:56:26 +08:00

196 lines
11 KiB
HTML

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../gpt_runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../batch_manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../inference_request.html">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="../gpt_attention.html">Multi-head, Multi-query and Group-query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="../precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../build_from_source.html">Build from Source</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance.html">Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../2023-05-19-how-to-debug.html">How to debug</a></li>
<li class="toctree-l1"><a class="reference internal" href="../2023-05-17-how-to-add-a-new-model.html">How to add a new model</a></li>
<li class="toctree-l1"><a class="reference internal" href="../graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="../memory.html">Memory Usage of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../new_workflow.html">New Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../perf_best_practices.html">Best Practices for Tuning the Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance_analysis.html">Performance Analysis of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.quantization.html">Quantization</a></li>
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/blogs/XQA-kernel.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="new-xqa-kernel-provides-2-4x-more-llama-70b-throughput-within-the-same-latency-budget">
<h1>New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget<a class="headerlink" href="#new-xqa-kernel-provides-2-4x-more-llama-70b-throughput-within-the-same-latency-budget" title="Link to this heading"></a></h1>
<p>XQA kernel provides optimization for <a class="reference external" href="https://arxiv.org/abs/1911.02150">MQA</a> and <a class="reference external" href="https://arxiv.org/abs/2305.13245v3">GQA</a> during generation phase. It also provides optimization for beam search. Using tensor cores for acceleration, reducing data loading and conversion, it delivers increased throughput within the same latency budget. Increased throughput allows serving greater number of user requests while providing the same experience.</p>
<p>Support matrix and usage flags are described in <a class="reference internal" href="#/docs/source/gpt_attention.md#xqa-optimization"><span class="xref myst">docs/source/gpt_attention</span></a>.</p>
<p><strong>Increased Throughput:</strong>
Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.</p>
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/XQA_ThroughputvsLatency.png?raw=true" alt="XQA increased throughput within same latency budget" width="950" height="auto">
<p><sub>Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a</sub></p>
<section id="llama-70b-on-h200-up-to-2-4x-increased-throughput-with-xqa-within-same-latency-budget">
<h2>Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget<a class="headerlink" href="#llama-70b-on-h200-up-to-2-4x-increased-throughput-with-xqa-within-same-latency-budget" title="Link to this heading"></a></h2>
<p><strong>H200 2.4x with XQA</strong></p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>GPUs</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-left"><p>Output Length</p></th>
<th class="head text-left"><p>Throughput w/o XQA (tok/s/GPU)</p></th>
<th class="head text-left"><p>Throughput w/ XQA (tok/s/GPU)</p></th>
<th class="head text-left"><p>Speedup</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>Llama-70B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>1,227</p></td>
<td class="text-left"><p>2,941</p></td>
<td class="text-left"><p>2.4x</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p></p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>13,232</p></td>
<td class="text-left"><p>25,300</p></td>
<td class="text-left"><p>1.9x</p></td>
</tr>
</tbody>
</table>
<section id="closing">
<h3>Closing<a class="headerlink" href="#closing" title="Link to this heading"></a></h3>
<p>These improvements will be published in the <code class="docutils literal notranslate"><span class="pre">main</span></code> branch soon, and will be
included in the v0.8 releases.</p>
<p>For more information about H200, please see the <a class="reference internal" href="H200launch.html"><span class="std std-doc">H200 announcement blog</span></a>.</p>
<p>Throughput is calculated as output tokens per second per gpu.
<code class="docutils literal notranslate"><span class="pre">out_tps=output_seqlen*batch_size/total_latency/tp</span></code></p>
<p><sub> <strong>Glossary:</strong>
| DP = Data Parallel
ISL = Input Sequence Length
| PP = Pipeline Parallel
| OSL = Output Sequence Length
| OOM = Out of Memory
| TP = Tensor Parallel <sub/></p>
</section>
</section>
</section>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2023, NVidia.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>