mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-22 11:42:41 +08:00
238 lines
14 KiB
HTML
238 lines
14 KiB
HTML
<!DOCTYPE html>
|
||
<html class="writer-html5" lang="en" data-content_root="../">
|
||
<head>
|
||
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
|
||
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
<title>H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation</title>
|
||
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
|
||
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />
|
||
|
||
|
||
<!--[if lt IE 9]>
|
||
<script src="../_static/js/html5shiv.min.js"></script>
|
||
<![endif]-->
|
||
|
||
<script src="../_static/jquery.js?v=5d32c60e"></script>
|
||
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
|
||
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
|
||
<script src="../_static/doctools.js?v=888ff710"></script>
|
||
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
|
||
<script src="../_static/js/theme.js"></script>
|
||
<link rel="index" title="Index" href="../genindex.html" />
|
||
<link rel="search" title="Search" href="../search.html" />
|
||
<link rel="next" title="H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM" href="H200launch.html" />
|
||
<link rel="prev" title="Runtime" href="../_cpp_gen/runtime.html" />
|
||
</head>
|
||
|
||
<body class="wy-body-for-nav">
|
||
<div class="wy-grid-for-nav">
|
||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||
<div class="wy-side-scroll">
|
||
<div class="wy-side-nav-search" >
|
||
|
||
|
||
|
||
<a href="../index.html" class="icon icon-home">
|
||
tensorrt_llm
|
||
</a>
|
||
<div role="search">
|
||
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
|
||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
||
<input type="hidden" name="check_keywords" value="yes" />
|
||
<input type="hidden" name="area" value="default" />
|
||
</form>
|
||
</div>
|
||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||
<ul>
|
||
<li class="toctree-l1"><a class="reference internal" href="../architecture.html">TensorRT-LLM Architecture</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../gpt_runtime.html">C++ GPT Runtime</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../batch_manager.html">The Batch Manager in TensorRT-LLM</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../gpt_attention.html">Multi-head, Multi-query and Group-query Attention</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../precision.html">Numerical Precision</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../installation.html">Build TensorRT-LLM</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../performance.html">Performance of TensorRT-LLM</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../2023-05-19-how-to-debug.html">How to debug</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../2023-05-17-how-to-add-a-new-model.html">How to add a new model</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../graph-rewriting.html">Graph Rewriting Module</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../memory.html">Memory Usage of TensorRT-LLM</a></li>
|
||
</ul>
|
||
<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
|
||
<ul>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.layers.html">Layers</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.functional.html">Functionals</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.models.html">Models</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.plugin.html">Plugin</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.quantization.html">Quantization</a></li>
|
||
<li class="toctree-l1"><a class="reference internal" href="../python-api/tensorrt_llm.runtime.html">Runtime</a></li>
|
||
</ul>
|
||
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
|
||
<ul>
|
||
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
|
||
</ul>
|
||
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
|
||
<ul class="current">
|
||
<li class="toctree-l1 current"><a class="current reference internal" href="#">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a><ul>
|
||
<li class="toctree-l2"><a class="reference internal" href="#mlperf-on-h100-with-fp8">MLPerf on H100 with FP8</a></li>
|
||
<li class="toctree-l2"><a class="reference internal" href="#what-is-h100-fp8">What is H100 FP8?</a></li>
|
||
</ul>
|
||
</li>
|
||
<li class="toctree-l1"><a class="reference internal" href="H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
|
||
</ul>
|
||
|
||
</div>
|
||
</div>
|
||
</nav>
|
||
|
||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||
<a href="../index.html">tensorrt_llm</a>
|
||
</nav>
|
||
|
||
<div class="wy-nav-content">
|
||
<div class="rst-content">
|
||
<div role="navigation" aria-label="Page navigation">
|
||
<ul class="wy-breadcrumbs">
|
||
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
|
||
<li class="breadcrumb-item active">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</li>
|
||
<li class="wy-breadcrumbs-aside">
|
||
<a href="../_sources/blogs/H100vsA100.md.txt" rel="nofollow"> View page source</a>
|
||
</li>
|
||
</ul>
|
||
<hr/>
|
||
</div>
|
||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||
<div itemprop="articleBody">
|
||
|
||
<section id="h100-has-4-6x-a100-performance-in-tensorrt-llm-achieving-10-000-tok-s-at-100ms-to-first-token">
|
||
<h1>H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token<a class="headerlink" href="#h100-has-4-6x-a100-performance-in-tensorrt-llm-achieving-10-000-tok-s-at-100ms-to-first-token" title="Link to this heading"></a></h1>
|
||
<p>TensorRT-LLM evaluated on both Hopper and Ampere shows <strong>H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100</strong>. H100 FP8 is able to achieve over 10,000 output tok/s at <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8">peak throughput</a> for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#id1">min-latency</a> applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.</p>
|
||
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_tps.png?raw=true" alt="max throughput" width="500" height="auto">
|
||
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_1st.png?raw=true" alt="1st token latency" width="500" height="auto">
|
||
<p><sub>TensorRT-LLM throughput & first token latency on H100 & A100. H100 FP8, A100 FP16, SXM 80GB GPUs, ISL/OSL’s provided, TP=1, BS=32/64 max throughput, BS=1 1st token latency. TensorRT-LLM v0.5.0, TensorRT 9.1. </sub>
|
||
<sub>Max throughput calculated by sweeping BS 1,2,…,64. Throughput taken at largest successful.</sub></p>
|
||
<p><strong>Max Throughput & Min Latency</strong></p>
|
||
<table class="docutils align-default">
|
||
<thead>
|
||
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
|
||
<th class="head text-left"><p>Batch Size</p></th>
|
||
<th class="head text-left"><p>Input Length</p></th>
|
||
<th class="head text-left"><p>Output Length</p></th>
|
||
<th class="head text-right"><p>Throughput (out tok/s)</p></th>
|
||
<th class="head text-right"><p>1st Token Latency (ms)</p></th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr class="row-even"><td class="text-left"><p><strong>H100</strong></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
</tr>
|
||
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>64</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-right"><p><strong>10,907</strong></p></td>
|
||
<td class="text-right"><p>102</p></td>
|
||
</tr>
|
||
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>1</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>-</p></td>
|
||
<td class="text-right"><p>185</p></td>
|
||
<td class="text-right"><p><strong>7.1</strong></p></td>
|
||
</tr>
|
||
<tr class="row-odd"><td class="text-left"><p><strong>A100</strong></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
</tr>
|
||
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>64</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-right"><p>3,679</p></td>
|
||
<td class="text-right"><p>481</p></td>
|
||
</tr>
|
||
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>1</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>-</p></td>
|
||
<td class="text-right"><p>111</p></td>
|
||
<td class="text-right"><p>12.5</p></td>
|
||
</tr>
|
||
<tr class="row-even"><td class="text-left"><p><strong>Speedup</strong></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-left"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
<td class="text-right"><p></p></td>
|
||
</tr>
|
||
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>64</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-right"><p><strong>3.0x</strong></p></td>
|
||
<td class="text-right"><p><strong>4.7x</strong></p></td>
|
||
</tr>
|
||
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
|
||
<td class="text-left"><p>1</p></td>
|
||
<td class="text-left"><p>128</p></td>
|
||
<td class="text-left"><p>-</p></td>
|
||
<td class="text-right"><p><strong>2.4x</strong></p></td>
|
||
<td class="text-right"><p>1.7x</p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
<p><sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL’s provided, TensorRT-LLM v0.5.0., TensorRT 9.1</sub></p>
|
||
<p>The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT-LLM’s <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#performance-of-tensorrt-llm">Performance Documentation</a></p>
|
||
<p>Stay tuned for a highlight on Llama coming soon!</p>
|
||
<section id="mlperf-on-h100-with-fp8">
|
||
<h2>MLPerf on H100 with FP8<a class="headerlink" href="#mlperf-on-h100-with-fp8" title="Link to this heading"></a></h2>
|
||
<p>In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.</p>
|
||
</section>
|
||
<section id="what-is-h100-fp8">
|
||
<h2>What is H100 FP8?<a class="headerlink" href="#what-is-h100-fp8" title="Link to this heading"></a></h2>
|
||
<p>H100 is NVIDIA’s next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.</p>
|
||
<p>FP8 specification introduced in the paper <a class="reference external" href="https://arxiv.org/abs/2209.05433">FP8 Formats for Deep Learning</a> can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.</p>
|
||
<p>In practice, FP8 can improve perceived performance on H100 (FP8 vs FP16) by more than 2x. FP8 is a W8A8 format, meaning the weights are stored in 8bit, as are the activations, or compute. 8bit weights decrease GPU memory consumption & bandwidth meaning a larger model, sequence length, or batchsize can be fit into the same GPU. This can enable new use cases, and larger max batch size can increase max throughput beyond 2x of FP16 H100.</p>
|
||
</section>
|
||
</section>
|
||
|
||
|
||
</div>
|
||
</div>
|
||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||
<a href="../_cpp_gen/runtime.html" class="btn btn-neutral float-left" title="Runtime" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||
<a href="H200launch.html" class="btn btn-neutral float-right" title="H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||
</div>
|
||
|
||
<hr/>
|
||
|
||
<div role="contentinfo">
|
||
<p>© Copyright 2023, NVidia.</p>
|
||
</div>
|
||
|
||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||
|
||
|
||
</footer>
|
||
</div>
|
||
</div>
|
||
</section>
|
||
</div>
|
||
<script>
|
||
jQuery(function () {
|
||
SphinxRtdTheme.Navigation.enable(true);
|
||
});
|
||
</script>
|
||
|
||
</body>
|
||
</html> |