mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Update latest news (#549)
This commit is contained in:
parent
71f60f6df0
commit
e093e48459
21
README.md
21
README.md
@ -17,24 +17,29 @@ TensorRT-LLM
|
||||
<div align="left">
|
||||
|
||||
## Latest News
|
||||
* [2023/11/13] [**H200** achieves nearly **12,000 tok/sec on Llama2-13B**](./docs/source/blogs/H200launch.md)
|
||||
* [2023/12/04] [**Falcon-180B** on a **single H200** GPU with INT4 AWQ, and **6.7x faster Llama-70B** over A100](./docs/source/blogs/H200launch.md)
|
||||
|
||||
<img src="./docs/source/blogs/media/H200launch_tps.png" alt="H200 TPS" width="500" height="auto">
|
||||
<img src="./docs/source/blogs/media/Falcon180B-H200_H200vA100.png" alt="H200 TPS" width="400" height="auto">
|
||||
|
||||
H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x faster than H100.
|
||||
H200 with INT4 AWQ, runs Falcon-180B on a _single_ GPU.
|
||||
|
||||
H200 is now 2.4x faster on Llama-70B with recent improvements to TensorRT-LLM GQA; up to 6.7x faster than A100.
|
||||
|
||||
* [2023/11/03] [TensorRT-LLM is up to **4.6x faster on H100 than A100**, achieving **10,000 tok/s at 100ms to first token.**](./docs/source/blogs/H100vsA100.md)
|
||||
* [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
|
||||
* [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
|
||||
* [2023/10/22] [🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙](https://github.com/NVIDIA/trt-llm-rag-windows#readme)
|
||||
* [2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
|
||||
](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)
|
||||
* [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
|
||||
](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
|
||||
* [2023/9/9] [NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/)
|
||||
|
||||
[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ; [2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
|
||||
[2023/10/4 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
|
||||
[2023/9/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
|
||||
|
||||
[2023/11/27 - Amazon Sagemaker](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
|
||||
[2023/11/17 - Perplexity](https://blog.perplexity.ai/blog/turbocharging-llama-2-70b-with-nvidia-h100) ;
|
||||
[2023/10/31 - Phind](https://www.phind.com/blog/phind-model-beats-gpt4-fast) ;
|
||||
[2023/10/12 - Databricks (MosaicML)](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) ;
|
||||
[2023/10/04 - Perplexity](https://blog.perplexity.ai/blog/introducing-pplx-api) ;
|
||||
[2023/09/27 - CloudFlare](https://www.cloudflare.com/press-releases/2023/cloudflare-powers-hyper-local-ai-inference-with-nvidia/);
|
||||
|
||||
## Table of Contents
|
||||
|
||||
|
||||
133
docs/source/blogs/Falcon180B-H200.md
Normal file
133
docs/source/blogs/Falcon180B-H200.md
Normal file
@ -0,0 +1,133 @@
|
||||
# Falcon-180B on a *single* H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
|
||||
|
||||
H200's large capacity & high memory bandwidth, paired with TensorRT-LLM's
|
||||
optimizations, maximizes inference performance.
|
||||
|
||||
## Falcon-180B on a single H200 with INT4 AWQ
|
||||
[Falcon-180B](https://huggingface.co/tiiuae/falcon-180B), one of the largest &
|
||||
most accurate open source models available, can run on a *single* H200 GPU.
|
||||
|
||||
The 141GB of memory on H200, paired with TensorRT-LLM running INT4 AWQ with
|
||||
FP8, allows for the entire large language model to fit on a single GPU, where
|
||||
previously eight A100s were required. H200 Falcon-180B provides up to **800**
|
||||
tok/s and retains high accuracy.
|
||||
|
||||
**Model Performance:**
|
||||
H200's large capacity & high memory bandwidth, utilizing INT4 AWQ to reduce
|
||||
memory footprint, allows for great performance on Falcon-180B on a single GPU.
|
||||
|
||||
|
||||
<img src="./media/Falcon180B-H200_tps.png" alt="Falcon-180B performance comparison" width="450" height="auto">
|
||||
|
||||
<sup>Preliminary measured Performance, subject to change. TP1 does not represent peak performance on H200. </sup>
|
||||
<sup>
|
||||
TensorRT-LLM v0.7a |
|
||||
Falcon-180B |
|
||||
1xH200 TP1 |
|
||||
INT4 AWQ |
|
||||
BS: (in order) 256, 128 </sup>
|
||||
|
||||
|
||||
**Model Accuracy:**
|
||||
Often quantization can have adverse impacts on the accuracy of the model,
|
||||
however, TensorRT-LLM's AWQ decreases memory footprint of the model by **4x**
|
||||
while maintaining high accuracy.
|
||||
|
||||
<img src="./media/Falcon180B-H200_acc.png" alt="Falcon-180B accuracy comparison" width="600" height="auto">
|
||||
|
||||
|
||||
<sup>Preliminary measured accuracy, subject to change. </sup>
|
||||
<sup>
|
||||
TensorRT-LLM v0.7a |
|
||||
Falcon-180B |
|
||||
1xH200 TP1 |
|
||||
INT4 AWQ
|
||||
</sup>
|
||||
|
||||
[**INT4 Activation-aware Weight Quantization
|
||||
(AWQ)**](https://arxiv.org/abs/2306.00978) (Lin et al., 2023) is a quantization
|
||||
technique which compresses the weights of an LLM down to 4bits based on their
|
||||
relative importance, and performs computation in FP16. This allows for AWQ to
|
||||
retain higher accuracy than other 4bit methods and reduce memory usage, but
|
||||
requires special kernels capable of handling the change in precision
|
||||
performantly.
|
||||
|
||||
TensorRT-LLM has implemented custom kernels for AWQ, and taken the technique a
|
||||
step further by performing FP8 computation on Hopper GPUs instead of the
|
||||
standard FP16.
|
||||
|
||||
Similar examples running Falcon-180B with quantization in TensorRT-LLM are
|
||||
available in [examples/falcon](/examples/falcon).
|
||||
|
||||
## Llama-70B on H200 up to 6.7x A100
|
||||
|
||||
TensorRT-LLM has improved its Group Query Attention (GQA) kernels, in the
|
||||
generation phase, providing up to 2.4x improvement on Llama-70B over
|
||||
TensorRT-LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster
|
||||
than A100.
|
||||
|
||||
**H200 6.7x A100**
|
||||
|
||||
<img src="./media/Falcon180B-H200_H200vA100.png" alt="Llama-70B H200 vs A100 comparison" width="600" height="auto">
|
||||
|
||||
|
||||
|Model |GPUs | Input Length | Output Length | Throughput (out tok/s/GPU)|
|
||||
|:---------|:----|:-------------|:--------------|:------|
|
||||
|Llama-70B | 1 | 128| 128 | 3,803 |
|
||||
| | 8 | | | 3,803 |
|
||||
| | 1 | | 2048 | 2,941 |
|
||||
| | 8 | | | 3,163 |
|
||||
| | 1 | | 4096 | 1,946 |
|
||||
| | 8 | | | 2,263 |
|
||||
|
||||
|
||||
|
||||
<sup>Preliminary measured performance, subject to change. </sup>
|
||||
<sup>
|
||||
TensorRT-LLM v0.7a |
|
||||
Llama2-70B |
|
||||
1xH200 = TP1, 8xH200 = max TP/PP/DP config |
|
||||
FP8 |
|
||||
BS: (in order) 960, 960, 192, 560, 96, 640 </sup>
|
||||
|
||||
|
||||
**TensorRT-LLM GQA now 2.4x faster on H200**
|
||||
|
||||
<img src="./media/Falcon180B-H200_DecvOct.png" alt="Llama-70B H200 December vs Oct." width="400" height="auto">
|
||||
|
||||
<sup>Preliminary measured performance, subject to change.</sup>
|
||||
<sup>
|
||||
TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a |
|
||||
Llama2-70B |
|
||||
1xH200 TP1 |
|
||||
FP8 |
|
||||
BS 192 </sup>
|
||||
|
||||
[**Grouped Query Attention (GQA)**](https://arxiv.org/abs/2305.13245v2)
|
||||
(Ainslie et al., 2023), used in Llama-70B, is a variant of Multihead Attention
|
||||
(MHA) which groups key-value (KV) heads together, resulting in fewer KV heads
|
||||
than query (Q) heads. TensorRT-LLM has a custom implementation of MHA which
|
||||
supports GQA, multi-query attention (MQA) and standard MHA. It leverages Tensor
|
||||
Cores, including in the generation phase, and delivers great performance on
|
||||
NVIDIA GPUs.
|
||||
|
||||
###### Closing
|
||||
|
||||
These improvements will be published in the `main` branch soon, and will be
|
||||
included in the v0.7 & v0.8 releases.
|
||||
|
||||
Similar examples running Llama-70B in TensorRT-LLM are published in
|
||||
[examples/llama](/examples/llama).
|
||||
|
||||
For more information about H200, please see the [H200 announcement blog](./H200launch.md).
|
||||
|
||||
Throughput is calculated as output tokens per second per gpu.
|
||||
`out_tps=output_seqlen*batch_size/total_latency/tp`
|
||||
|
||||
<sub> **Glossary:**
|
||||
| DP = Data Parallel
|
||||
ISL = Input Sequence Length
|
||||
| PP = Pipeline Parallel
|
||||
| OSL = Output Sequence Length
|
||||
| OOM = Out of Memory
|
||||
| TP = Tensor Parallel <sub/>
|
||||
@ -1,3 +1,5 @@
|
||||
:loudspeaker: Note: The below data is using TensorRT-LLM v0.5. There have been significant improvements in v0.6 & later. Please see updated Llama performance [here](./Falcon180B-H200.md).
|
||||
|
||||
# H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
|
||||
|
||||
TensorRT-LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform) achieves **11,819 tokens/s on Llama2-13B** on a single GPU. H200 is up to **1.9x faster** than H100. This performance is enabled by H200's larger, faster [HBM3e memory](#latest-hbm-memory).
|
||||
|
||||
BIN
docs/source/blogs/media/Falcon180B-H200_DecvOct.png
Normal file
BIN
docs/source/blogs/media/Falcon180B-H200_DecvOct.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 33 KiB |
BIN
docs/source/blogs/media/Falcon180B-H200_H200vA100.png
Normal file
BIN
docs/source/blogs/media/Falcon180B-H200_H200vA100.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 49 KiB |
BIN
docs/source/blogs/media/Falcon180B-H200_acc.png
Normal file
BIN
docs/source/blogs/media/Falcon180B-H200_acc.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 50 KiB |
BIN
docs/source/blogs/media/Falcon180B-H200_tps.png
Normal file
BIN
docs/source/blogs/media/Falcon180B-H200_tps.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
@ -12,35 +12,41 @@ performance that can be delivered by TensorRT-LLM.
|
||||
The different performance numbers below were collected using the methodology
|
||||
described in the benchmarks [folder](../../benchmarks/).
|
||||
|
||||
## High Throughput
|
||||
## Peak Throughput
|
||||
|
||||
The below tables provide reference data at large batch sizes, representing
|
||||
high throughput tasks.
|
||||
high throughput offline tasks.
|
||||
|
||||
This data has been updated for v0.6.1, unless specified.
|
||||
|
||||
### H100 GPUs (FP8)
|
||||
|
||||
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
||||
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
||||
| GPT-J 6B | 64 | 1 | 128 | 128 | 10,907 |
|
||||
| GPT-J 6B | 64 | 1 | 128 | 2048 | 6,179 |
|
||||
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,229 |
|
||||
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 2,980 |
|
||||
| | | | | | |
|
||||
| LLaMA 7B | 64 | 1 | 128 | 128 | 9,193 |
|
||||
| LLaMA 7B | 64 | 1 | 128 | 2048 | 5,367 |
|
||||
| LLaMA 7B | 64 | 1 | 2048 | 128 | 2,058 |
|
||||
| LLaMA 7B | 32 | 1 | 2048 | 2048 | 2,230 |
|
||||
| | | | | | |
|
||||
| LLaMA 70B | 64 | 4 | 128 | 128 | 3,317 |
|
||||
| LLaMA 70B | 64 | 4 | 128 | 2048 | 2,616 |
|
||||
| LLaMA 70B | 64 | 4 | 2048 | 128 | 843 |
|
||||
| LLaMA 70B | 64 | 4 | 2048 | 2048 | 1,583 |
|
||||
| | | | | | |
|
||||
| Falcon 180B | 96 | 8 | 128 | 128 | 2,686 |
|
||||
| Falcon 180B | 96 | 8 | 128 | 2048 | 2,073 |
|
||||
| Falcon 180B | 64 | 8 | 2048 | 128 | 465 |
|
||||
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s/GPU) |
|
||||
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
|
||||
| GPT-J 6B | 1024 | 1 | 128 | 128 | 26,150 |
|
||||
| GPT-J 6B | 120 | 1 | 128 | 2048 | 8,011 |
|
||||
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,551 |
|
||||
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,327 |
|
||||
| | | | | | |
|
||||
| LLaMA 7B | 768 | 1 | 128 | 128 | 19,694 |
|
||||
| LLaMA 7B | 112 | 1 | 128 | 2048 | 6,818 |
|
||||
| LLaMA 7B | 80 | 1 | 2048 | 128 | 2,244 |
|
||||
| LLaMA 7B | 48 | 1 | 2048 | 2048 | 2,740 |
|
||||
| | | | | | |
|
||||
| LLaMA 70B | 1024 | 2 | 128 | 128 | 2,657 |
|
||||
| LLaMA 70B | 480 | 4 | 128 | 2048 | 1,486 |
|
||||
| LLaMA 70B | 96 | 2 | 2048 | 128 | 306 |
|
||||
| LLaMA 70B | 64 | 2 | 2048 | 2048 | 547 |
|
||||
| | | | | | |
|
||||
| Falcon 180B | 1024 | 4 | 128 | 128 | 987 |
|
||||
| Falcon 180B | 1024 | 8 | 128 | 2048 | 724 |
|
||||
| Falcon 180B | 64 | 4 | 2048 | 128 | 112 |
|
||||
| Falcon 180B | 64 | 4 | 2048 | 2048 | 264 |
|
||||
|
||||
### L40S GPUs (FP8)<sup>*</sup>
|
||||
|
||||
<sup> * The following data is from TensorRT-LLM v0.5. </sup>
|
||||
|
||||
### L40S GPUs (FP8)
|
||||
|
||||
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
||||
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
||||
@ -59,28 +65,28 @@ high throughput tasks.
|
||||
|
||||
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
||||
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
||||
| GPT-J 6B | 64 | 1 | 128 | 128 | 3,679 |
|
||||
| GPT-J 6B | 32 | 1 | 128 | 2048 | 1,558 |
|
||||
| GPT-J 6B | 32 | 1 | 2048 | 128 | 526 |
|
||||
| GPT-J 6B | 16 | 1 | 2048 | 2048 | 650 |
|
||||
| GPT-J 6B | 512 | 1 | 128 | 128 | 6,374 |
|
||||
| GPT-J 6B | 120 | 2 | 128 | 2048 | 2,192 |
|
||||
| GPT-J 6B | 60 | 1 | 2048 | 128 | 670 |
|
||||
| GPT-J 6B | 64 | 2 | 2048 | 2048 | 903 |
|
||||
| | | | | | |
|
||||
| LLaMA 7B | 64 | 1 | 128 | 128 | 3,486 |
|
||||
| LLaMA 7B | 32 | 1 | 128 | 2048 | 1,459 |
|
||||
| LLaMA 7B | 32 | 1 | 2048 | 128 | 529 |
|
||||
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 592 |
|
||||
| LLaMA 7B | 384 | 1 | 128 | 128 | 5,586 |
|
||||
| LLaMA 7B | 60 | 1 | 128 | 2048 | 1,928 |
|
||||
| LLaMA 7B | 52 | 1 | 2048 | 128 | 591 |
|
||||
| LLaMA 7B | 64 | 2 | 2048 | 2048 | 782 |
|
||||
| | | | | | |
|
||||
| LLaMA 70B | 64 | 4 | 128 | 128 | 1,237 |
|
||||
| LLaMA 70B | 64 | 4 | 128 | 2048 | 1,181 |
|
||||
| LLaMA 70B | 64 | 4 | 2048 | 128 | 272 |
|
||||
| LLaMA 70B | 64 | 4 | 2048 | 2048 | 738 |
|
||||
| LLaMA 70B | 1280 | 4 | 128 | 128 | 670 |
|
||||
| LLaMA 70B | 240 | 4 | 128 | 2048 | 525 |
|
||||
| LLaMA 70B | 120 | 4 | 2048 | 128 | 79 |
|
||||
| | | | | | |
|
||||
| Falcon 180B | 64 | 8 | 128 | 128 | 929 |
|
||||
| Falcon 180B | 64 | 8 | 128 | 2048 | 923 |
|
||||
| Falcon 180B | 64 | 8 | 2048 | 128 | 202 |
|
||||
| Falcon 180B | 1024 | 8 | 128 | 128 | 232 |
|
||||
| Falcon 180B | 128 | 8 | 128 | 2048 | 180 |
|
||||
|
||||
(1) TP stands for Tensor Parallelism.
|
||||
|
||||
## Low Latency
|
||||
## Low Latency<sup>**</sup>
|
||||
|
||||
<sup> ** The following data is from TensorRT-LLM v0.5. Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>
|
||||
|
||||
The below tables provide reference data at batch size 1 for first token
|
||||
latency, representing end-user's perceived latency for online streaming
|
||||
|
||||
Loading…
Reference in New Issue
Block a user