mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
[IB-1920][doc] Update Perf_Overview.md with Benchmarking Results for Release 1.1 (#9723)
Signed-off-by: Zachary Patel <22306219+zbpatel@users.noreply.github.com>
This commit is contained in:
parent
16a7d18c4e
commit
cfaa13a98a
@ -7,8 +7,11 @@ This document summarizes performance measurements of TensorRT-LLM on a number of
|
||||
The data in the following tables is provided as a reference point to help users validate observed performance.
|
||||
It should *not* be considered as the peak performance that can be delivered by TensorRT-LLM.
|
||||
|
||||
Not all configurations were tested for all GPUs.
|
||||
|
||||
We attempted to keep commands as simple as possible to ease reproducibility and left many options at their default settings.
|
||||
Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situaiton.
|
||||
Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situation.
|
||||
|
||||
|
||||
For DeepSeek R1 performance, please check out our [performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
|
||||
|
||||
@ -16,15 +19,13 @@ For more information on benchmarking with `trtllm-bench` see this NVIDIA [blog p
|
||||
|
||||
## Throughput Measurements
|
||||
|
||||
The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),
|
||||
and shows the throughput scenario under maximum load. The reported metric is `Total Output Throughput (tokens/sec)`.
|
||||
The below table shows performance data where a local inference client is fed requests at an high rate / no delay between messages,
|
||||
and shows the throughput scenario under maximum load. The reported metric is `Output Throughput per GPU (tokens/sec/GPU)`.
|
||||
|
||||
The performance numbers below were collected using the steps described in this document.
|
||||
|
||||
Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
|
||||
|
||||
*(NEW for v1.0) RTX 6000 Pro Blackwell Server Edition Benchmarks:*
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.
|
||||
|
||||
|
||||
@ -41,201 +42,216 @@ Other hardware variants may have different TDP, memory bandwidth, core count, or
|
||||
### FP4 Models
|
||||
|
||||
```text
|
||||
nvidia/Llama-3.3-70B-Instruct-FP4
|
||||
nvidia/Llama-3.1-405B-Instruct-FP4
|
||||
nvidia/DeepSeek-R1-0528-NVFP4-v2
|
||||
nvidia/Qwen3-235B-A22B-FP4
|
||||
nvidia/Qwen3-30B-A3B-FP4
|
||||
nvidia/DeepSeek-R1-0528-FP4
|
||||
nvidia/Llama-3.3-70B-Instruct-FP4
|
||||
nvidia/Llama-4-Maverick-17B-128E-Instruct-NVFP4
|
||||
```
|
||||
|
||||
### FP8 Models
|
||||
|
||||
```text
|
||||
nvidia/Llama-3.1-8B-Instruct-FP8
|
||||
nvidia/Llama-3.3-70B-Instruct-FP8
|
||||
nvidia/Llama-3.1-405B-Instruct-FP8
|
||||
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
|
||||
deepseek-ai/DeepSeek-R1-0528
|
||||
nvidia/Qwen3-235B-A22B-FP8
|
||||
nvidia/Llama-3.3-70B-Instruct-FP8
|
||||
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
|
||||
```
|
||||
|
||||
#### Llama 4 Scout
|
||||
# Performance Summary - All Networks
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP4 (FP8) | H100<br/>TP4 (FP8) |
|
||||
|---------------------------|---------------------|---------------------|-------------------|-------------------|
|
||||
| 128/2048 | 14,699 | 15,238 | 34,316 | 15,130 |
|
||||
| 128/4096 | 8,932 | 9,556 | 21,332 | 8,603 |
|
||||
| 500/2000 | 11,977 | 11,795 | 24,630 | 12,399 |
|
||||
| 1000/1000 | 10,591 | 7,738 | 21,636 | 12,129 |
|
||||
| 1000/2000 | 9,356 | 8,581 | 18,499 | 9,838 |
|
||||
| 2048/128 | 3,137 | 3,295 | 3,699 | 3,253 |
|
||||
| 2048/2048 | 7,152 | 7,464 | 14,949 | 7,972 |
|
||||
| 5000/500 | 2,937 | 3,107 | 4,605 | 3,342 |
|
||||
| 20000/2000 | 1,644 | 1,767 | 2,105 | |
|
||||
## Units
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **4 GPUs**<br/>TP2,PP2 (FP4) | **8 GPUs**<br/>TP4,PP2 (FP4) |
|
||||
|---|---|---|
|
||||
| 128/2048 | 12,321 | 21,035 |
|
||||
| 128/4096 | 7,643 | 13,421 |
|
||||
| 1000/1000 | 9,476 | 15,781 |
|
||||
| 1000/2000 | 8,919 | 16,434 |
|
||||
| 2048/128 | 2,615 | 2,941 |
|
||||
| 2048/2048 | 6,208 | 10,410 |
|
||||
| 5000/500 | 2,662 | |
|
||||
All performance values are measured in `output tokens per second per GPU`, where `output tokens` includes the first and all subsequent generated tokens (input tokens are not included).
|
||||
|
||||
#### Llama 3.3 70B
|
||||
Data in these tables is taken from the `Per GPU Output Throughput (tps/gpu)` metric reported by `trtllm-bench`.
|
||||
The calculations for metrics reported by trtllm-bench can be found in the dataclasses [reporting.py](../../../tensorrt_llm/bench/dataclasses/reporting.py#L570) and [statistics.py](../../../tensorrt_llm/bench/dataclasses/statistics.py#L188)
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP1 (FP8) | H100<br/>TP2 (FP8) |
|
||||
|---|---|---|---|---|
|
||||
| 128/2048 | 9,922 | 11,309 | 4,336 | 6,651 |
|
||||
| 128/4096 | 6,831 | 7,849 | 2,872 | 4,199 |
|
||||
| 500/2000 | 7,762 | 9,028 | 3,666 | 5,222 |
|
||||
| 1000/1000 | 7,007 | 7,326 | 2,909 | 4,205 |
|
||||
| 1000/2000 | 6,271 | 6,513 | 2,994 | 4,146 |
|
||||
| 2048/128 | 1,339 | 1,450 | 442 | 762 |
|
||||
| 2048/2048 | 4,783 | 5,646 | 2,003 | 3,082 |
|
||||
| 5000/500 | 1,459 | 1,602 | 566 | 898 |
|
||||
| 20000/2000 | 665 | 755 | 283 | 437 |
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **1 GPUs**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP1,PP2 (FP4) | **4 GPUs**<br/>TP1,PP4 (FP4) | **8 GPUs**<br/>TP1,PP8 (FP4) |
|
||||
|---|---|---|---|---|
|
||||
| 128/2048 | 2,422 | 4,993 | 7,922 | 9,833 |
|
||||
| 128/4096 | 1,349 | 2,893 | 4,978 | 7,352 |
|
||||
| 500/2000 | 1,856 | 4,114 | 6,939 | 9,435 |
|
||||
| 1000/1000 | 1,787 | 3,707 | 5,961 | 8,166 |
|
||||
| 1000/2000 | 1,594 | 2,993 | 5,274 | 6,943 |
|
||||
| 2048/128 | 393 | 813 | 1,511 | 2,495 |
|
||||
| 2048/2048 | 1,074 | 2,336 | 3,870 | 6,078 |
|
||||
| 5000/500 | 401 | 812 | 1,511 | 2,491 |
|
||||
| 20000/2000 | 142 | 319 | 630 | 1,148 |
|
||||
## Table of Contents
|
||||
|
||||
#### Qwen3-235B-A22B
|
||||
- [Deepseek R1 0528](#deepseek-r1-0528)
|
||||
- [GPT-OSS 120B](#gpt-oss-120b)
|
||||
- [GPT-OSS 20B](#gpt-oss-20b)
|
||||
- [LLaMA v3.3 70B](#llama-v33-70b)
|
||||
- [LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition](#llama-v33-70b-rtx-configurations)
|
||||
- [LLaMA v4 Maverick](#llama-v4-maverick)
|
||||
- [Qwen3 235B A22B](#qwen3-235b-a22b)
|
||||
- [Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition](#qwen3-235b-a22b-rtx-configurations)
|
||||
- [Qwen3 30B A3B](#qwen3-30b-a3b)
|
||||
- [Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition](#qwen3-30b-a3b-rtx-configurations)
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP8 (FP4) | H200<br/>TP8 (FP8) | H100<br/>TP8 (FP8) |
|
||||
---
|
||||
|
||||
<a id="deepseek-r1-0528"></a>
|
||||
|
||||
# Deepseek R1 0528
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP8 (FP8) |
|
||||
|---|---|---|---|
|
||||
| 128/2048 | 66,057 | 42,821 | 19,658 |
|
||||
| 128/4096 | 39,496 | 26,852 | 12,447 |
|
||||
| 500/2000 | 57,117 | 28,026 | 18,351 |
|
||||
| 1000/1000 | 42,391 | 23,789 | 14,898 |
|
||||
| 1000/2000 | 34,105 | 22,061 | 15,136 |
|
||||
| 2048/128 | 7,329 | 3,331 | |
|
||||
| 2048/2048 | 26,854 | 16,672 | 9,924 |
|
||||
| 5000/500 | 8,190 | 3,623 | 3,225 |
|
||||
| 20000/2000 | 4,453 | 1,876 | |
|
||||
| 1000/1000 | 6,463 | 6,939 | 1,627 |
|
||||
| 1024/1024 | 6,430 | 6,924 | 1,620 |
|
||||
| 1024/8192 | 3,862 | 4,379 | 1,218 |
|
||||
| 1024/32768 | 1,451 | 1,465 | 438 |
|
||||
| 8192/1024 | 1,168 | 1,192 | |
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **8 GPUs**<br/>TP2,PP4 (FP4) |
|
||||
|---|---|
|
||||
| 128/2048 | 12,494 |
|
||||
| 128/4096 | 7,715 |
|
||||
| 500/2000 | 11,157 |
|
||||
| 1000/1000 | 10,697 |
|
||||
| 1000/2000 | 10,109 |
|
||||
| 2048/128 | 3,181 |
|
||||
| 2048/2048 | 6,712 |
|
||||
| 5000/500 | 3,173 |
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
#### Qwen3-30B-A3B
|
||||
---
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) |
|
||||
|---|---|
|
||||
| 128/2048 | 37,844 |
|
||||
| 128/4096 | 24,953 |
|
||||
| 500/2000 | 27,817 |
|
||||
| 1000/1000 | 25,828 |
|
||||
| 1000/2000 | 22,051 |
|
||||
| 2048/128 | 6,251 |
|
||||
| 2048/2048 | 17,554 |
|
||||
| 5000/500 | 6,142 |
|
||||
| 20000/2000 | 2,944 |
|
||||
<a id="gpt-oss-120b"></a>
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **1 GPUs**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP2,PP1 (FP4) | **4 GPUs**<br/>TP4,PP1 (FP4) | **8 GPUs**<br/>TP8,PP1 (FP4) |
|
||||
# GPT-OSS 120B
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>DEP2 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP1 (FP8) | H100<br/>DEP4 (FP8) |
|
||||
|---|---|---|---|---|
|
||||
| 128/2048 | 12,540 | 22,744 | 35,715 | 52,676 |
|
||||
| 128/4096 | 7,491 | 15,049 | 28,139 | 33,895 |
|
||||
| 500/2000 | 10,695 | 17,266 | 26,175 | 44,088 |
|
||||
| 1000/1000 | 9,910 | 16,431 | 24,046 | 31,785 |
|
||||
| 1000/2000 | 8,378 | 13,323 | 25,131 | 28,881 |
|
||||
| 2048/128 | 3,257 | 3,785 | 4,311 | 4,798 |
|
||||
| 2048/2048 | 5,908 | 10,679 | 18,134 | 22,391 |
|
||||
| 5000/500 | 2,530 | 3,799 | 5,212 | 5,965 |
|
||||
| 20000/2000 | 871 | 1,558 | 2,551 | |
|
||||
| 1000/1000 | 25,943 | 27,198 | 6,868 | 4,685 |
|
||||
| 1024/1024 | 25,870 | 26,609 | 6,798 | 4,715 |
|
||||
| 1024/8192 | 17,289 | 14,800 | 3,543 | |
|
||||
| 1024/32768 | 6,279 | 5,556 | | 1,177 |
|
||||
| 8192/1024 | 6,111 | 6,835 | 1,828 | 1,169 |
|
||||
| 32768/1024 | 1,392 | 1,645 | 519 | 333 |
|
||||
|
||||
#### DeepSeek R1
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP8 (FP4) |
|
||||
|---|---|
|
||||
| 128/2048 | 62,599 |
|
||||
| 128/4096 | 44,046 |
|
||||
| 1000/1000 | 37,634 |
|
||||
| 1000/2000 | 40,538 |
|
||||
| 2048/128 | 5,026 |
|
||||
| 2048/2048 | 28,852 |
|
||||
---
|
||||
|
||||
#### Llama 4 Maverick
|
||||
<a id="gpt-oss-20b"></a>
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP8 (FP4) | H200<br/>TP8 (FP8) | H100<br/>TP8 (FP8) |
|
||||
|---|---|---|---|
|
||||
| 128/2048 | 112,676 | 40,572 | 10,829 |
|
||||
| 128/4096 | 68,170 | 24,616 | 6,744 |
|
||||
| 500/2000 | | 37,835 | 10,108 |
|
||||
| 1000/1000 | 79,617 | 31,782 | 9,677 |
|
||||
| 1000/2000 | 63,766 | 34,734 | 9,151 |
|
||||
| 2048/128 | 18,088 | 7,307 | |
|
||||
| 2048/2048 | 52,195 | 20,957 | 6,916 |
|
||||
| 5000/500 | | 8,456 | 3,457 |
|
||||
| 20000/2000 | 12,678 | 4,106 | |
|
||||
# GPT-OSS 20B
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **8 GPUs**<br/>TP4,PP2 (FP4) |
|
||||
|---|---|
|
||||
| 128/2048 | 19,146 |
|
||||
| 128/4096 | 12,165 |
|
||||
| 500/2000 | 17,870 |
|
||||
| 1000/1000 | 15,954 |
|
||||
| 1000/2000 | 12,456 |
|
||||
| 2048/128 | 4,463 |
|
||||
| 2048/2048 | 10,727 |
|
||||
| 5000/500 | 4,613 |
|
||||
|
||||
#### Llama 3.1 405B
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP4 (FP4) | GB200<br/>TP4 (FP4) | H200<br/>TP8 (FP8) | H100<br/>TP8 (FP8) |
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP1 (FP8) | H100<br/>TP1 (FP8) |
|
||||
|---|---|---|---|---|
|
||||
| 128/2048 | 8,020 | 8,151 | 5,348 | 4,340 |
|
||||
| 128/4096 | 6,345 | 6,608 | 4,741 | 3,116 |
|
||||
| 500/2000 | 6,244 | 6,540 | 4,724 | 3,994 |
|
||||
| 1000/1000 | 5,209 | 5,389 | 3,330 | 2,919 |
|
||||
| 1000/2000 | 4,933 | 5,135 | 3,722 | 2,895 |
|
||||
| 2048/128 | 749 | 797 | 456 | 453 |
|
||||
| 2048/2048 | 4,212 | 4,407 | 2,948 | 2,296 |
|
||||
| 5000/500 | 1,048 | 1,112 | 650 | 610 |
|
||||
| 20000/2000 | 672 | 739 | 505 | 345 |
|
||||
| 1000/1000 | 53,812 | 55,823 | 13,858 | 11,557 |
|
||||
| 1024/1024 | 53,491 | 56,528 | 13,890 | 11,403 |
|
||||
| 1024/8192 | 34,702 | 38,100 | 12,743 | 8,617 |
|
||||
| 1024/32768 | 14,589 | 16,463 | | |
|
||||
| 8192/1024 | 11,904 | 12,941 | 4,015 | 3,366 |
|
||||
| 32768/1024 | 2,645 | 2,905 | 915 | 785 |
|
||||
|
||||
RTX 6000 Pro Blackwell Server Edition
|
||||
| Sequence Length (ISL/OSL) | **8 GPUs**<br/>TP1,PP8 (FP4) |
|
||||
|---|---|
|
||||
| 128/2048 | 2,981 |
|
||||
| 1000/1000 | 2,369 |
|
||||
| 1000/2000 | 1,931 |
|
||||
| 2048/128 | 579 |
|
||||
| 2048/2048 | 1,442 |
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
#### Llama 3.1 8B
|
||||
---
|
||||
|
||||
| Sequence Length (ISL/OSL) | H200<br/>TP1 (FP8) | H100<br/>TP1 (FP8) |
|
||||
<a id="llama-v33-70b"></a>
|
||||
|
||||
# LLaMA v3.3 70B
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) | H200<br/>TP2 (FP8) | H100<br/>TP2 (FP8) |
|
||||
|---|---|---|---|---|
|
||||
| 1000/1000 | 6,920 | 7,769 | 2,587 | 2,209 |
|
||||
| 1024/1024 | 6,842 | 7,751 | 2,582 | |
|
||||
| 1024/8192 | 3,242 | 3,805 | 2,009 | |
|
||||
| 8192/1024 | 1,362 | 1,491 | 537 | 398 |
|
||||
| 32768/1024 | 274 | 302 | 120 | |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="llama-v33-70b-rtx-configurations"></a>
|
||||
|
||||
# LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition
|
||||
|
||||
*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
|
||||
|
||||
| Sequence Length (ISL/OSL) | **1 GPUs**<br/>TP1,PP1 (FP4) | **2 GPUs**<br/>TP1,PP2 (FP4) |
|
||||
|---|---|---|
|
||||
| 128/2048 | 26,221 | 22,714 |
|
||||
| 128/4096 | 18,027 | 14,325 |
|
||||
| 500/2000 | 20,770 | 17,660 |
|
||||
| 1000/1000 | 17,744 | 15,220 |
|
||||
| 1000/2000 | 16,828 | 13,899 |
|
||||
| 2048/128 | 3,538 | 3,450 |
|
||||
| 2048/2048 | 12,194 | 9,305 |
|
||||
| 5000/500 | 3,902 | 3,459 |
|
||||
| 20000/2000 | 1,804 | 1,351 |
|
||||
| 1000/1000 | 1,724 | 1,901 |
|
||||
| 1024/1024 | 1,708 | 1,887 |
|
||||
| 8192/1024 | 296 | 327 |
|
||||
| 32768/1024 | | 67 |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="llama-v4-maverick"></a>
|
||||
|
||||
# LLaMA v4 Maverick
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP8 (FP8) |
|
||||
|---|---|---|---|
|
||||
| 1000/1000 | 11,337 | 11,828 | 4,146 |
|
||||
| 1024/1024 | 11,227 | 11,905 | 4,180 |
|
||||
| 1024/8192 | 5,174 | 5,508 | 1,157 |
|
||||
| 1024/32768 | 2,204 | 2,300 | 679 |
|
||||
| 8192/1024 | 3,279 | 3,444 | 1,276 |
|
||||
| 32768/1024 | 859 | 963 | |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="qwen3-235b-a22b"></a>
|
||||
|
||||
# Qwen3 235B A22B
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>DEP4 (FP4) | GB200<br/>DEP4 (FP4) | H200<br/>DEP4 (FP8) | H100<br/>DEP8 (FP8) |
|
||||
|---|---|---|---|---|
|
||||
| 1000/1000 | 5,764 | 6,172 | 3,288 | 1,932 |
|
||||
| 1024/1024 | 5,756 | 5,862 | 3,268 | 1,935 |
|
||||
| 1024/8192 | 3,389 | 3,423 | 1,417 | 873 |
|
||||
| 1024/32768 | 1,255 | | | |
|
||||
| 8192/1024 | 1,410 | 1,464 | 627 | |
|
||||
| 32768/1024 | 319 | 333 | 134 | |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="qwen3-235b-a22b-rtx-configurations"></a>
|
||||
|
||||
# Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition
|
||||
|
||||
*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
|
||||
|
||||
| Sequence Length (ISL/OSL) | **4 GPUs**<br/>DEP2,PP2 (FP4) | **8 GPUs**<br/>DEP8,PP1 (FP4) |
|
||||
|---|---|---|
|
||||
| 1000/1000 | 1,731 | 969 |
|
||||
| 1024/1024 | 1,732 | 963 |
|
||||
| 1024/8192 | 644 | 711 |
|
||||
| 32768/1024 | 70 | |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="qwen3-30b-a3b"></a>
|
||||
|
||||
# Qwen3 30B A3B
|
||||
|
||||
| Sequence Length (ISL/OSL) | B200<br/>TP1 (FP4) | GB200<br/>TP1 (FP4) |
|
||||
|---|---|---|
|
||||
| 1000/1000 | 26,971 | 22,856 |
|
||||
| 1024/1024 | 26,611 | 22,201 |
|
||||
| 1024/8192 | 13,497 | 14,272 |
|
||||
| 1024/32768 | 4,494 | 4,925 |
|
||||
| 8192/1024 | 5,735 | 6,201 |
|
||||
| 32768/1024 | 1,265 | 1,380 |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
<a id="qwen3-30b-a3b-rtx-configurations"></a>
|
||||
|
||||
# Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition
|
||||
|
||||
*Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
|
||||
|
||||
| Sequence Length (ISL/OSL) | **2 GPUs**<br/>DEP2,PP1 (FP4) | **4 GPUs**<br/>DEP2,PP2 (FP4) | **8 GPUs**<br/>DEP8,PP1 (FP4) | **1 GPUs**<br/>TP1,PP1 (FP4) |
|
||||
|---|---|---|---|---|
|
||||
| 1000/1000 | 8,409 | 7,059 | 3,985 | 9,938 |
|
||||
| 1024/1024 | | 7,019 | | 9,755 |
|
||||
| 1024/8192 | 3,577 | | 2,406 | 3,621 |
|
||||
| 8192/1024 | | 1,416 | | 1,914 |
|
||||
| 32768/1024 | | | 180 | 374 |
|
||||
|
||||
unit: `output tokens per second per GPU`
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
## Reproducing Benchmarked Results
|
||||
@ -248,7 +264,7 @@ The following tables are references for commands that are used as part of the be
|
||||
|
||||
### Command Overview
|
||||
|
||||
Starting with v0.19, testing was performed using the PyTorch backend - this workflow does not require an engine to be built.
|
||||
Testing was performed using the PyTorch backend - this workflow does not require an engine to be built.
|
||||
|
||||
| Stage | Description | Command |
|
||||
| :- | - | - |
|
||||
@ -291,18 +307,13 @@ because requests enter and exit the system at a much faster rate. For longer inp
|
||||
remain in the system longer and therefore require less requests to achieve steady state.
|
||||
|
||||
|
||||
| Input Length | Output Length | $seq_len | $num_requests |
|
||||
| ------------ | ------------- | ---------- | ------------------ |
|
||||
| 128 | 128 | 256 | 30000 |
|
||||
| 128 | 2048 | 2176 | 3000 |
|
||||
| 128 | 4096 | 4224 | 1500 |
|
||||
| 1000 | 2000 | 3000 | 1500 |
|
||||
| 2048 | 128 | 2176 | 3000 |
|
||||
| 2048 | 2048 | 4096 | 1500 |
|
||||
| 5000 | 500 | 5500 | 1500 |
|
||||
| 1000 | 1000 | 2000 | 3000 |
|
||||
| 500 | 2000 | 2500 | 3000 |
|
||||
| 20000 | 2000 | 22000 | 1000 |
|
||||
| Input Length | Output Length | Number of Requests |
|
||||
|--------------|---------------|---------------------|
|
||||
| 1024 | 1024 | 3000 |
|
||||
| 8192 | 1024 | 1500 |
|
||||
| 1024 | 8192 | 1500 |
|
||||
| 32768 | 1024 | 1000 |
|
||||
| 1024 | 32768 | 1000 |
|
||||
|
||||
### Running the Benchmark
|
||||
|
||||
@ -311,31 +322,16 @@ run an offline maximum throughput scenario such that all requests are queued in
|
||||
a model name (HuggingFace reference or path to a local model), a [generated dataset](#preparing-a-dataset), and a file containing any desired extra options to the LLM APIs (details in [tensorrt_llm/llmapi/llm_args.py:LlmArgs](source:tensorrt_llm/llmapi/llm_args.py)).
|
||||
|
||||
For dense / non-MoE models:
|
||||
|
||||
```shell
|
||||
trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
|
||||
```
|
||||
Llama 3.3
|
||||
|
||||
`llm_options.yml`
|
||||
```yaml
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
batch_sizes:
|
||||
- 1
|
||||
- 2
|
||||
- 4
|
||||
- 8
|
||||
- 16
|
||||
- 32
|
||||
- 64
|
||||
- 128
|
||||
- 256
|
||||
- 384
|
||||
- 512
|
||||
- 1024
|
||||
- 2048
|
||||
- 4096
|
||||
- 8192
|
||||
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
|
||||
```
|
||||
|
||||
For MoE models:
|
||||
@ -344,27 +340,48 @@ For MoE models:
|
||||
trtllm-bench --tp $tp_size --pp $pp_size --ep $ep_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
|
||||
```
|
||||
|
||||
GPT-OSS:
|
||||
|
||||
`llm_options.yml`
|
||||
```yaml
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
|
||||
enable_attention_dp: true
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
# Hopper: use auto
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
# Hopper: use TRITON
|
||||
```
|
||||
|
||||
DeepSeek R1:
|
||||
|
||||
`llm_options.yml`
|
||||
```yaml
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
enable_attention_dp: true
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
|
||||
moe_config:
|
||||
backend: CUTLASS
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
```
|
||||
|
||||
Qwen3 MoE, Llama4 Maverick:
|
||||
|
||||
`llm_options.yml`
|
||||
```yaml
|
||||
enable_attention_dp: true
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
batch_sizes:
|
||||
- 1
|
||||
- 2
|
||||
- 4
|
||||
- 8
|
||||
- 16
|
||||
- 32
|
||||
- 64
|
||||
- 128
|
||||
- 256
|
||||
- 384
|
||||
- 512
|
||||
- 1024
|
||||
- 2048
|
||||
- 4096
|
||||
- 8192
|
||||
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]
|
||||
```
|
||||
|
||||
In many cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` or lower if out-of-memory errors are encountered.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user