Update latest news (#549)

2026-01-13 22:18:36 +08:00 · 2023-12-04 22:04:00 +08:00 · 2023-12-04 22:04:00 +08:00 · e093e48459
commit e093e48459
parent 71f60f6df0
8 changed files with 193 additions and 47 deletions
--- a/README.md
+++ b/README.md
@ -17,24 +17,29 @@ TensorRT-LLM
 <div align="left">

 ## Latest News
-* [2023/11/13] [**H200** achieves nearly **12,000 tok/sec on Llama2-13B**](./docs/source/blogs/H200launch.md)
+* [2023/12/04] [**Falcon-180B** on a **single H200** GPU with INT4 AWQ, and **6.7x faster Llama-70B** over A100](./docs/source/blogs/H200launch.md)

-<img src="./docs/source/blogs/media/H200launch_tps.png" alt="H200 TPS" width="500" height="auto">
+<img src="./docs/source/blogs/media/Falcon180B-H200_H200vA100.png" alt="H200 TPS" width="400" height="auto">

-H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x faster than H100.
+H200 with INT4 AWQ, runs Falcon-180B on a _single_ GPU.

+H200 is now 2.4x faster on Llama-70B with recent improvements to TensorRT-LLM GQA; up to 6.7x faster than A100.

-* [2023/11/03] [TensorRT-LLM is up to **4.6x faster on H100 than A100**, achieving **10,000 tok/s at 100ms to first token.**](./docs/source/blogs/H100vsA100.md)
+* [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
+* [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
 * [2023/10/22] [🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
 * [2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
 ](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)
 * [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
 ](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
-* [2023/9/9] [NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/)

-[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ; [2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
-[2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
-[2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
+
+[2023/11/27 - Amazon Sagemaker](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
+[2023/11/17 - Perplexity](https://blog.perplexity.ai/blog/turbocharging-llama-2-70b-with-nvidia-h100) ;
+[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ;
+[2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
+[2023/10/04 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
+[2023/09/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);

 ## Table of Contents

--- a/docs/source/blogs/Falcon180B-H200.md
+++ b/docs/source/blogs/Falcon180B-H200.md
@ -0,0 +1,133 @@
+# Falcon-180B on a *single* H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
+
+H200's large capacity & high memory bandwidth, paired with TensorRT-LLM's
+optimizations, maximizes inference performance.
+
+## Falcon-180B on a single H200 with INT4 AWQ
+[Falcon-180B](https://huggingface.co/tiiuae/falcon-180B), one of the largest &
+most accurate open source models available, can run on a *single* H200 GPU.
+
+The 141GB of memory on H200, paired with TensorRT-LLM running INT4 AWQ with
+FP8, allows for the entire large language model to fit on a single GPU, where
+previously eight A100s were required. H200 Falcon-180B provides up to **800**
+tok/s and retains high accuracy.
+
+**Model Performance:**
+H200's large capacity & high memory bandwidth, utilizing INT4 AWQ to reduce
+memory footprint, allows for great performance on Falcon-180B on a single GPU.
+
+
+<img src="./media/Falcon180B-H200_tps.png" alt="Falcon-180B performance comparison" width="450" height="auto">
+
+<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
+<sup>
+TensorRT-LLM v0.7a |
+Falcon-180B |
+1xH200 TP1 |
+INT4 AWQ |
+BS: (in order) 256, 128 </sup>
+
+
+**Model Accuracy:**
+Often quantization can have adverse impacts on the accuracy of the model,
+however, TensorRT-LLM's AWQ decreases memory footprint of the model by **4x**
+while maintaining high accuracy.
+
+<img src="./media/Falcon180B-H200_acc.png" alt="Falcon-180B accuracy comparison" width="600" height="auto">
+
+
+<sup>Preliminary measured accuracy, subject to change. </sup>
+<sup>
+TensorRT-LLM v0.7a |
+Falcon-180B |
+1xH200 TP1 |
+INT4 AWQ
+</sup>
+
+[**INT4 Activation-aware Weight Quantization
+(AWQ)**](https://arxiv.org/abs/2306.00978) (Lin et al., 2023) is a quantization
+technique which compresses the weights of an LLM down to 4bits based on their
+relative importance, and performs computation in FP16. This allows for AWQ to
+retain higher accuracy than other 4bit methods and reduce memory usage, but
+requires special kernels capable of handling the change in precision
+performantly.
+
+TensorRT-LLM has implemented custom kernels for AWQ, and taken the technique a
+step further by performing FP8 computation on Hopper GPUs instead of the
+standard FP16.
+
+Similar examples running Falcon-180B with quantization in TensorRT-LLM are
+available in [examples/falcon](/examples/falcon).
+
+## Llama-70B on H200 up to 6.7x A100
+
+TensorRT-LLM has improved its Group Query Attention (GQA) kernels, in the
+generation phase, providing up to 2.4x improvement on Llama-70B over
+TensorRT-LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster
+than A100.
+
+**H200 6.7x A100**
+
+<img src="./media/Falcon180B-H200_H200vA100.png" alt="Llama-70B H200 vs A100 comparison" width="600" height="auto">
+
+
+|Model     |GPUs | Input Length | Output Length | Throughput (out tok/s/GPU)|
+|:---------|:----|:-------------|:--------------|:------|
+|Llama-70B |   1 |           128|           128 | 3,803 |
+|          |   8 |              |               | 3,803 |
+|          |   1 |              |          2048 | 2,941 |
+|          |   8 |              |               | 3,163 |
+|          |   1 |              |          4096 | 1,946 |
+|          |   8 |              |               | 2,263 |
+
+
+
+<sup>Preliminary measured performance, subject to change. </sup>
+<sup>
+TensorRT-LLM v0.7a |
+Llama2-70B |
+1xH200 = TP1, 8xH200 = max TP/PP/DP config |
+FP8 |
+BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
+
+
+**TensorRT-LLM GQA now 2.4x faster on H200**
+
+<img src="./media/Falcon180B-H200_DecvOct.png" alt="Llama-70B H200 December vs Oct." width="400" height="auto">
+
+<sup>Preliminary measured performance, subject to change.</sup>
+<sup>
+TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
+Llama2-70B |
+1xH200 TP1 |
+FP8 |
+BS 192 </sup>
+
+[**Grouped Query Attention (GQA)**](https://arxiv.org/abs/2305.13245v2)
+(Ainslie et al., 2023), used in Llama-70B, is a variant of Multihead Attention
+(MHA) which groups key-value (KV) heads together, resulting in fewer KV heads
+than query (Q) heads. TensorRT-LLM has a custom implementation of MHA which
+supports GQA, multi-query attention (MQA) and standard MHA. It leverages Tensor
+Cores, including in the generation phase, and delivers great performance on
+NVIDIA GPUs.
+
+###### Closing
+
+These improvements will be published in the `main` branch soon, and will be
+included in the v0.7 & v0.8 releases.
+
+Similar examples running Llama-70B in TensorRT-LLM are published in
+[examples/llama](/examples/llama).
+
+For more information about H200, please see the [H200 announcement blog](./H200launch.md).
+
+Throughput is calculated as output tokens per second per gpu.
+`out_tps=output_seqlen*batch_size/total_latency/tp`
+
+<sub> **Glossary:**
+| DP  = Data Parallel
+  ISL = Input Sequence Length
+| PP  = Pipeline Parallel
+| OSL = Output Sequence Length
+| OOM = Out of Memory
+| TP  = Tensor Parallel <sub/>
--- a/docs/source/blogs/H200launch.md
+++ b/docs/source/blogs/H200launch.md
@ -1,3 +1,5 @@
+:loudspeaker: Note: The below data is using TensorRT-LLM v0.5. There have been significant improvements in v0.6 & later. Please see updated Llama performance [here](./Falcon180B-H200.md).
+
 # H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM

 TensorRT-LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform) achieves **11,819 tokens/s on Llama2-13B** on a single GPU. H200 is up to **1.9x faster** than H100. This performance is enabled by H200's larger, faster [HBM3e memory](#latest-hbm-memory).
--- a/docs/source/blogs/media/Falcon180B-H200_DecvOct.png
+++ b/docs/source/blogs/media/Falcon180B-H200_DecvOct.png
--- a/docs/source/blogs/media/Falcon180B-H200_H200vA100.png
+++ b/docs/source/blogs/media/Falcon180B-H200_H200vA100.png
--- a/docs/source/blogs/media/Falcon180B-H200_acc.png
+++ b/docs/source/blogs/media/Falcon180B-H200_acc.png
--- a/docs/source/blogs/media/Falcon180B-H200_tps.png
+++ b/docs/source/blogs/media/Falcon180B-H200_tps.png
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
@ -12,35 +12,41 @@ performance that can be delivered by TensorRT-LLM.
 The different performance numbers below were collected using the methodology
 described in the benchmarks [folder](../../benchmarks/).

-## High Throughput
+## Peak Throughput

 The below tables provide reference data at large batch sizes, representing
-high throughput tasks.
+high throughput offline tasks.
+
+This data has been updated for v0.6.1, unless specified.

 ### H100 GPUs (FP8)

-| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
-| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
-| GPT-J 6B                     | 64         | 1         | 128          | 128           |                 10,907 |
-| GPT-J 6B                     | 64         | 1         | 128          | 2048          |                  6,179 |
-| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                  2,229 |
-| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                  2,980 |
-|                              |            |           |              |               |                        |
-| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  9,193 |
-| LLaMA 7B                     | 64         | 1         | 128          | 2048          |                  5,367 |
-| LLaMA 7B                     | 64         | 1         | 2048         | 128           |                  2,058 |
-| LLaMA 7B                     | 32         | 1         | 2048         | 2048          |                  2,230 |
-|                              |            |           |              |               |                        |
-| LLaMA 70B                    | 64         | 4         | 128          | 128           |                  3,317 |
-| LLaMA 70B                    | 64         | 4         | 128          | 2048          |                  2,616 |
-| LLaMA 70B                    | 64         | 4         | 2048         | 128           |                    843 |
-| LLaMA 70B                    | 64         | 4         | 2048         | 2048          |                  1,583 |
-|                              |            |           |              |               |                        |
-| Falcon 180B                  | 96         | 8         | 128          | 128           |                  2,686 |
-| Falcon 180B                  | 96         | 8         | 128          | 2048          |                  2,073 |
-| Falcon 180B                  | 64         | 8         | 2048         | 128           |                    465 |
+| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s/GPU) |
+| :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
+| GPT-J 6B                     | 1024       | 1         | 128          | 128           |                     26,150 |
+| GPT-J 6B                     | 120        | 1         | 128          | 2048          |                      8,011 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                      2,551 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                      3,327 |
+|                              |            |           |              |               |                            |
+| LLaMA 7B                     | 768        | 1         | 128          | 128           |                     19,694 |
+| LLaMA 7B                     | 112        | 1         | 128          | 2048          |                      6,818 |
+| LLaMA 7B                     | 80         | 1         | 2048         | 128           |                      2,244 |
+| LLaMA 7B                     | 48         | 1         | 2048         | 2048          |                      2,740 |
+|                              |            |           |              |               |                            |
+| LLaMA 70B                    | 1024       | 2         | 128          | 128           |                      2,657 |
+| LLaMA 70B                    | 480        | 4         | 128          | 2048          |                      1,486 |
+| LLaMA 70B                    | 96         | 2         | 2048         | 128           |                        306 |
+| LLaMA 70B                    | 64         | 2         | 2048         | 2048          |                        547 |
+|                              |            |           |              |               |                            |
+| Falcon 180B                  | 1024       | 4         | 128          | 128           |                        987 |
+| Falcon 180B                  | 1024       | 8         | 128          | 2048          |                        724 |
+| Falcon 180B                  | 64         | 4         | 2048         | 128           |                        112 |
+| Falcon 180B                  | 64         | 4         | 2048         | 2048          |                        264 |
+
+### L40S GPUs (FP8)<sup>*</sup>
+
+<sup> * The following data is from TensorRT-LLM v0.5. </sup>

-### L40S GPUs (FP8)

 | Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
 | :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
@ -59,28 +65,28 @@ high throughput tasks.

 | Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
 | :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
-| GPT-J 6B                     | 64         | 1         | 128          | 128           |                  3,679 |
-| GPT-J 6B                     | 32         | 1         | 128          | 2048          |                  1,558 |
-| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    526 |
-| GPT-J 6B                     | 16         | 1         | 2048         | 2048          |                    650 |
+| GPT-J 6B                     | 512        | 1         | 128          | 128           |                  6,374 |
+| GPT-J 6B                     | 120        | 2         | 128          | 2048          |                  2,192 |
+| GPT-J 6B                     | 60         | 1         | 2048         | 128           |                    670 |
+| GPT-J 6B                     | 64         | 2         | 2048         | 2048          |                    903 |
 |                              |            |           |              |               |                        |
-| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  3,486 |
-| LLaMA 7B                     | 32         | 1         | 128          | 2048          |                  1,459 |
-| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    529 |
-| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    592 |
+| LLaMA 7B                     | 384        | 1         | 128          | 128           |                  5,586 |
+| LLaMA 7B                     | 60         | 1         | 128          | 2048          |                  1,928 |
+| LLaMA 7B                     | 52         | 1         | 2048         | 128           |                    591 |
+| LLaMA 7B                     | 64         | 2         | 2048         | 2048          |                    782 |
 |                              |            |           |              |               |                        |
-| LLaMA 70B                    | 64         | 4         | 128          | 128           |                  1,237 |
-| LLaMA 70B                    | 64         | 4         | 128          | 2048          |                  1,181 |
-| LLaMA 70B                    | 64         | 4         | 2048         | 128           |                    272 |
-| LLaMA 70B                    | 64         | 4         | 2048         | 2048          |                    738 |
+| LLaMA 70B                    | 1280       | 4         | 128          | 128           |                    670 |
+| LLaMA 70B                    | 240        | 4         | 128          | 2048          |                    525 |
+| LLaMA 70B                    | 120        | 4         | 2048         | 128           |                     79 |
 |                              |            |           |              |               |                        |
-| Falcon 180B                  | 64         | 8         | 128          | 128           |                    929 |
-| Falcon 180B                  | 64         | 8         | 128          | 2048          |                    923 |
-| Falcon 180B                  | 64         | 8         | 2048         | 128           |                    202 |
+| Falcon 180B                  | 1024       | 8         | 128          | 128           |                    232 |
+| Falcon 180B                  | 128        | 8         | 128          | 2048          |                    180 |

 (1) TP stands for Tensor Parallelism.

-## Low Latency
+## Low Latency<sup>**</sup>
+
+<sup> ** The following data is from TensorRT-LLM v0.5. Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>

 The below tables provide reference data at batch size 1 for first token
 latency, representing end-user's perceived latency for online streaming