TensorRT-LLMs/blogs/H100vsA100.html
nv-guomingz 85f78df69c
Update gh-pages for windows part doc. (#1979)
Co-authored-by: Guoming Zhang <37257613+nv-guomingz@users.noreply.github.com>
2024-07-18 11:18:09 +08:00

293 lines
18 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9a2dae69"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM" href="H200launch.html" />
<link rel="prev" title="Runtime" href="../_cpp_gen/runtime.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../quick-start-guide.html">Quick Start Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../release-notes.html">Release Notes</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Installation</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/linux.html">Installing on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-linux.html">Building from Source Code on Linux</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/windows.html">Installing on Windows</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/build-from-source-windows.html">Building from Source Code on Windows</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Architecture</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture/overview.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html">Model Definition</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#compilation">Compilation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#runtime">Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/core-concepts.html#multi-gpu-and-multi-node-support">Multi-GPU and Multi-Node Support</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/checkpoint.html">TensorRT-LLM Checkpoint</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/workflow.html">TensorRT-LLM Build Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../architecture/add-model.html">Adding a Model</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Advanced</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../advanced/gpt-attention.html">Multi-Head, Multi-Query, and Group-Query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/gpt-runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/batch-manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/inference-request.html">Inference Request</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/inference-request.html#responses">Responses</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/lora.html">Run gpt-2b + LoRA using GptManager / cpp runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../advanced/expert-parallelism.html">Expert Parallelism in TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Performance</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-overview.html">Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-best-practices.html">Best Practices for Tuning the Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance/perf-analysis.html">Performance Analysis</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Reference</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../reference/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/support-matrix.html">Support Matrix</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../reference/memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/executor.html">Executor</a></li>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#mlperf-on-h100-with-fp8">MLPerf on H100 with FP8</a></li>
<li class="toctree-l2"><a class="reference internal" href="#what-is-h100-fp8">What is H100 FP8?</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="Falcon180B-H200.html">Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100</a></li>
<li class="toctree-l1"><a class="reference internal" href="quantization-in-TRT-LLM.html">Speed up inference with SOTA quantization techniques in TRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="XQA-kernel.html">New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/blogs/H100vsA100.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<blockquote>
<div><p>:bangbang: :new: <em>NVIDIA H200 has been announced &amp; is optimized on TensorRT-LLM. Learn more about H200, &amp; H100 comparison, here:</em> <a class="reference internal" href="H200launch.html"><span class="std std-doc"><strong>H200</strong> achieves nearly <strong>12,000 tokens/sec on Llama2-13B</strong> with TensorRT-LLM</span></a></p>
</div></blockquote>
<section id="h100-has-4-6x-a100-performance-in-tensorrt-llm-achieving-10-000-tok-s-at-100ms-to-first-token">
<h1>H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token<a class="headerlink" href="#h100-has-4-6x-a100-performance-in-tensorrt-llm-achieving-10-000-tok-s-at-100ms-to-first-token" title="Link to this heading"></a></h1>
<p>TensorRT-LLM evaluated on both Hopper and Ampere shows <strong>H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100</strong>. H100 FP8 is able to achieve over 10,000 output tok/s at <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8">peak throughput</a> for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#id1">min-latency</a> applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.</p>
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_tps.png?raw=true" alt="max throughput" width="500" height="auto">
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_1st.png?raw=true" alt="1st token latency" width="500" height="auto">
<p><sub>TensorRT-LLM throughput &amp; first token latency on H100 &amp; A100. H100 FP8, A100 FP16, SXM 80GB GPUs, ISL/OSLs provided, TP=1, BS=32/64 max throughput, BS=1 1st token latency. TensorRT-LLM v0.5.0, TensorRT 9.1. </sub>
<sub>Max throughput calculated by sweeping BS 1,2,…,64. Throughput taken at largest successful.</sub></p>
<p><strong>Max Throughput &amp; Min Latency</strong></p>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-left"><p>Output Length</p></th>
<th class="head text-right"><p>Throughput (out tok/s)</p></th>
<th class="head text-right"><p>1st Token Latency (ms)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p><strong>H100</strong></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p><strong>10,907</strong></p></td>
<td class="text-right"><p>102</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>-</p></td>
<td class="text-right"><p>185</p></td>
<td class="text-right"><p><strong>7.1</strong></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p><strong>A100</strong></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,679</p></td>
<td class="text-right"><p>481</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>-</p></td>
<td class="text-right"><p>111</p></td>
<td class="text-right"><p>12.5</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p><strong>Speedup</strong></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p><strong>3.0x</strong></p></td>
<td class="text-right"><p><strong>4.7x</strong></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>-</p></td>
<td class="text-right"><p><strong>2.4x</strong></p></td>
<td class="text-right"><p>1.7x</p></td>
</tr>
</tbody>
</table>
<p><sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSLs provided, TensorRT-LLM v0.5.0., TensorRT 9.1</sub></p>
<p>The full data behind these charts &amp; tables and including larger models with higher TP values can be found in TensorRT-LLMs <a class="reference external" href="https://nvidia.github.io/TensorRT-LLM/performance.html#performance-of-tensorrt-llm">Performance Documentation</a></p>
<p>Stay tuned for a highlight on Llama coming soon!</p>
<section id="mlperf-on-h100-with-fp8">
<h2>MLPerf on H100 with FP8<a class="headerlink" href="#mlperf-on-h100-with-fp8" title="Link to this heading"></a></h2>
<p>In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.</p>
</section>
<section id="what-is-h100-fp8">
<h2>What is H100 FP8?<a class="headerlink" href="#what-is-h100-fp8" title="Link to this heading"></a></h2>
<p>H100 is NVIDIAs next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.</p>
<p>FP8 specification introduced in the paper <a class="reference external" href="https://arxiv.org/abs/2209.05433">FP8 Formats for Deep Learning</a> can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.</p>
<p>In practice, FP8 can improve perceived performance on H100 (FP8 vs FP16) by more than 2x. FP8 is a W8A8 format, meaning the weights are stored in 8bit, as are the activations, or compute. 8bit weights decrease GPU memory consumption &amp; bandwidth meaning a larger model, sequence length, or batchsize can be fit into the same GPU. This can enable new use cases, and larger max batch size can increase max throughput beyond 2x of FP16 H100.</p>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="../_cpp_gen/runtime.html" class="btn btn-neutral float-left" title="Runtime" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="H200launch.html" class="btn btn-neutral float-right" title="H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<jinja2.runtime.BlockReference object at 0x7f0d22324ee0>
<div class="footer">
<p>
Copyright © 2024 NVIDIA Corporation
</p>
<p>
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
data-cms-ai="0">Privacy Policy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
data-cms-ai="0">Manage My Privacy</a> |
<a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
data-cms-ai="0">Do Not Sell or Share My Data</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
rel="noopener" data-cms-ai="0">Terms of Service</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
data-cms-ai="0">Accessibility</a> |
<a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
rel="noopener" data-cms-ai="0">Corporate Policies</a> |
<a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
data-cms-ai="0">Product Security</a> |
<a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
data-cms-ai="0">Contact</a>
</p>
</div>
</div>
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>