mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
145 lines
10 KiB
Markdown
145 lines
10 KiB
Markdown
# Performance of TensorRT-LLM
|
|
|
|
This document summarizes performance measurements of TensorRT-LLM on H100
|
|
(Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.
|
|
|
|
The data in the following tables is provided as a reference point to help users
|
|
validate observed performance. It should not be considered as the peak
|
|
performance that can be delivered by TensorRT-LLM.
|
|
|
|
## Methodology
|
|
|
|
The different performance numbers below were collected using the methodology
|
|
described in the benchmarks [folder](../../benchmarks/).
|
|
|
|
## High Throughput
|
|
|
|
The below tables provide reference data at large batch sizes, representing
|
|
high throughput tasks.
|
|
|
|
### H100 GPUs (FP8)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
|
| GPT-J 6B | 64 | 1 | 128 | 128 | 10,907 |
|
|
| GPT-J 6B | 64 | 1 | 128 | 2048 | 6,179 |
|
|
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,229 |
|
|
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 2,980 |
|
|
| | | | | | |
|
|
| LLaMA 7B | 64 | 1 | 128 | 128 | 9,193 |
|
|
| LLaMA 7B | 64 | 1 | 128 | 2048 | 5,367 |
|
|
| LLaMA 7B | 64 | 1 | 2048 | 128 | 2,058 |
|
|
| LLaMA 7B | 32 | 1 | 2048 | 2048 | 2,230 |
|
|
| | | | | | |
|
|
| LLaMA 70B | 64 | 4 | 128 | 128 | 3,317 |
|
|
| LLaMA 70B | 64 | 4 | 128 | 2048 | 2,616 |
|
|
| LLaMA 70B | 64 | 4 | 2048 | 128 | 843 |
|
|
| LLaMA 70B | 64 | 4 | 2048 | 2048 | 1,583 |
|
|
| | | | | | |
|
|
| Falcon 180B | 96 | 8 | 128 | 128 | 2,686 |
|
|
| Falcon 180B | 96 | 8 | 128 | 2048 | 2,073 |
|
|
| Falcon 180B | 64 | 8 | 2048 | 128 | 465 |
|
|
|
|
### L40S GPUs (FP8)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
|
| GPT-J 6B | 64 | 1 | 128 | 128 | 3,630 |
|
|
| GPT-J 6B | 64 | 1 | 128 | 2048 | 1,859 |
|
|
| GPT-J 6B | 32 | 1 | 2048 | 128 | 616 |
|
|
| GPT-J 6B | 32 | 1 | 2048 | 2048 | 757 |
|
|
| | | | | | |
|
|
| LLaMA 7B | 64 | 1 | 128 | 128 | 3,240 |
|
|
| LLaMA 7B | 64 | 1 | 128 | 2048 | 1,622 |
|
|
| LLaMA 7B | 32 | 1 | 2048 | 128 | 581 |
|
|
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 531 |
|
|
|
|
|
|
### A100 GPUs (FP16)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
|
|
| GPT-J 6B | 64 | 1 | 128 | 128 | 3,679 |
|
|
| GPT-J 6B | 32 | 1 | 128 | 2048 | 1,558 |
|
|
| GPT-J 6B | 32 | 1 | 2048 | 128 | 526 |
|
|
| GPT-J 6B | 16 | 1 | 2048 | 2048 | 650 |
|
|
| | | | | | |
|
|
| LLaMA 7B | 64 | 1 | 128 | 128 | 3,486 |
|
|
| LLaMA 7B | 32 | 1 | 128 | 2048 | 1,459 |
|
|
| LLaMA 7B | 32 | 1 | 2048 | 128 | 529 |
|
|
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 592 |
|
|
| | | | | | |
|
|
| LLaMA 70B | 64 | 4 | 128 | 128 | 1,237 |
|
|
| LLaMA 70B | 64 | 4 | 128 | 2048 | 1,181 |
|
|
| LLaMA 70B | 64 | 4 | 2048 | 128 | 272 |
|
|
| LLaMA 70B | 64 | 4 | 2048 | 2048 | 738 |
|
|
| | | | | | |
|
|
| Falcon 180B | 64 | 8 | 128 | 128 | 929 |
|
|
| Falcon 180B | 64 | 8 | 128 | 2048 | 923 |
|
|
| Falcon 180B | 64 | 8 | 2048 | 128 | 202 |
|
|
|
|
(1) TP stands for Tensor Parallelism.
|
|
|
|
## Low Latency
|
|
|
|
The below tables provide reference data at batch size 1 for first token
|
|
latency, representing end-user's perceived latency for online streaming
|
|
tasks.
|
|
|
|
### H100 GPUs (FP8)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
|
|
| GPT-J 6B | 1 | 1 | 128 | 7 |
|
|
| GPT-J 6B | 1 | 1 | 2048 | 29 |
|
|
| | | | | |
|
|
| LLaMA 7B | 1 | 1 | 128 | 7 |
|
|
| LLaMA 7B | 1 | 1 | 2048 | 36 |
|
|
| | | | | |
|
|
| LLaMA 70B | 1 | 4 | 128 | 26 |
|
|
| LLaMA 70B | 1 | 4 | 2048 | 109 |
|
|
| | | | | |
|
|
| Falcon 180B | 1 | 8 | 128 | 27 |
|
|
| Falcon 180B | 1 | 8 | 2048 | 205 |
|
|
|
|
### L40S GPUs (FP8)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
|
|
| GPT-J 6B | 1 | 1 | 128 | 12 |
|
|
| GPT-J 6B | 1 | 1 | 2048 | 71 |
|
|
| | | | | |
|
|
| LLaMA 7B | 1 | 1 | 128 | 14 |
|
|
| LLaMA 7B | 1 | 1 | 2048 | 73 |
|
|
|
|
### A100 GPUs (FP16)
|
|
|
|
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
|
|
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
|
|
| GPT-J 6B | 1 | 1 | 128 | 12 |
|
|
| GPT-J 6B | 1 | 1 | 2048 | 129 |
|
|
| | | | | |
|
|
| LLaMA 7B | 1 | 1 | 128 | 16 |
|
|
| LLaMA 7B | 1 | 1 | 2048 | 133 |
|
|
| | | | | |
|
|
| LLaMA 70B | 1 | 4 | 128 | 47 |
|
|
| LLaMA 70B | 1 | 4 | 2048 | 377 |
|
|
| | | | | |
|
|
| Falcon 180B | 1 | 8 | 128 | 61 |
|
|
| Falcon 180B | 1 | 8 | 2048 | 509 |
|
|
|
|
(1) TP stands for Tensor Parallelism.
|
|
|
|
|
|
## Known Issues
|
|
|
|
The following issues are being addressed to improve the efficiency of TensorRT-LLM.
|
|
|
|
### Fused Matmul + Gated-SiLU (LLaMA)
|
|
|
|
There are different possible implementations for Matmul followed by Gated-SiLU.
|
|
The simplest implementation uses two Matmul operations and combines the results
|
|
in a separate CUDA kernel. That's the current implementation in TensorRT-LLM.
|
|
The next release will include a more efficient implementation that runs a
|
|
single Matmul.
|