kanshan/TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

Zac Patel b280c1620f

[None][doc] Update Perf-Overview.md for release/1.0 (#7848 )

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>

2025-09-22 13:38:16 +08:00

16 KiB

Raw Permalink Blame History

(perf-overview)=

Overview

This document summarizes performance measurements of TensorRT-LLM on a number of GPUs across a set of key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

We attempted to keep commands as simple as possible to ease reproducibility and left many options at their default settings. Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situaiton.

For DeepSeek R1 performance, please check out our performance guide

For more information on benchmarking with trtllm-bench see this NVIDIA blog post.

Throughput Measurements

The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput scenario under maximum load. The reported metric is Total Output Throughput (tokens/sec).

The performance numbers below were collected using the steps described in this document.

Testing was performed on models with weights quantized using ModelOpt and published by NVIDIA on the Model Optimizer HuggingFace Collection.

(NEW for v1.0) RTX 6000 Pro Blackwell Server Edition Benchmarks:

RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.

Hardware

The following GPU variants were used for testing:

H100 SXM 80GB (DGX H100)
H200 SXM 141GB (DGX H200)
B200 180GB (DGX B200)
GB200 192GB (GB200 NVL72)
RTX 6000 Pro Blackwell Server Edition

Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.

FP4 Models

nvidia/Llama-3.3-70B-Instruct-FP4
nvidia/Llama-3.1-405B-Instruct-FP4
nvidia/Qwen3-235B-A22B-FP4
nvidia/Qwen3-30B-A3B-FP4
nvidia/DeepSeek-R1-0528-FP4

FP8 Models

nvidia/Llama-3.1-8B-Instruct-FP8
nvidia/Llama-3.3-70B-Instruct-FP8
nvidia/Llama-3.1-405B-Instruct-FP8
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
nvidia/Qwen3-235B-A22B-FP8

Llama 4 Scout

Sequence Length (ISL/OSL)	B200 TP1 (FP4)	GB200 TP1 (FP4)	H200 TP4 (FP8)	H100 TP4 (FP8)
128/2048	14,699	15,238	34,316	15,130
128/4096	8,932	9,556	21,332	8,603
500/2000	11,977	11,795	24,630	12,399
1000/1000	10,591	7,738	21,636	12,129
1000/2000	9,356	8,581	18,499	9,838
2048/128	3,137	3,295	3,699	3,253
2048/2048	7,152	7,464	14,949	7,972
5000/500	2,937	3,107	4,605	3,342
20000/2000	1,644	1,767	2,105

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	4 GPUs TP2,PP2 (FP4)	8 GPUs TP4,PP2 (FP4)
128/2048	12,321	21,035
128/4096	7,643	13,421
1000/1000	9,476	15,781
1000/2000	8,919	16,434
2048/128	2,615	2,941
2048/2048	6,208	10,410
5000/500	2,662

Llama 3.3 70B

Sequence Length (ISL/OSL)	B200 TP1 (FP4)	GB200 TP1 (FP4)	H200 TP1 (FP8)	H100 TP2 (FP8)
128/2048	9,922	11,309	4,336	6,651
128/4096	6,831	7,849	2,872	4,199
500/2000	7,762	9,028	3,666	5,222
1000/1000	7,007	7,326	2,909	4,205
1000/2000	6,271	6,513	2,994	4,146
2048/128	1,339	1,450	442	762
2048/2048	4,783	5,646	2,003	3,082
5000/500	1,459	1,602	566	898
20000/2000	665	755	283	437

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	1 GPUs TP1,PP1 (FP4)	2 GPUs TP1,PP2 (FP4)	4 GPUs TP1,PP4 (FP4)	8 GPUs TP1,PP8 (FP4)
128/2048	2,422	4,993	7,922	9,833
128/4096	1,349	2,893	4,978	7,352
500/2000	1,856	4,114	6,939	9,435
1000/1000	1,787	3,707	5,961	8,166
1000/2000	1,594	2,993	5,274	6,943
2048/128	393	813	1,511	2,495
2048/2048	1,074	2,336	3,870	6,078
5000/500	401	812	1,511	2,491
20000/2000	142	319	630	1,148

Qwen3-235B-A22B

Sequence Length (ISL/OSL)	B200 TP8 (FP4)	H200 TP8 (FP8)	H100 TP8 (FP8)
128/2048	66,057	42,821	19,658
128/4096	39,496	26,852	12,447
500/2000	57,117	28,026	18,351
1000/1000	42,391	23,789	14,898
1000/2000	34,105	22,061	15,136
2048/128	7,329	3,331
2048/2048	26,854	16,672	9,924
5000/500	8,190	3,623	3,225
20000/2000	4,453	1,876

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	8 GPUs TP2,PP4 (FP4)
128/2048	12,494
128/4096	7,715
500/2000	11,157
1000/1000	10,697
1000/2000	10,109
2048/128	3,181
2048/2048	6,712
5000/500	3,173

Qwen3-30B-A3B

Sequence Length (ISL/OSL)	B200 TP1 (FP4)
128/2048	37,844
128/4096	24,953
500/2000	27,817
1000/1000	25,828
1000/2000	22,051
2048/128	6,251
2048/2048	17,554
5000/500	6,142
20000/2000	2,944

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	1 GPUs TP1,PP1 (FP4)	2 GPUs TP2,PP1 (FP4)	4 GPUs TP4,PP1 (FP4)	8 GPUs TP8,PP1 (FP4)
128/2048	12,540	22,744	35,715	52,676
128/4096	7,491	15,049	28,139	33,895
500/2000	10,695	17,266	26,175	44,088
1000/1000	9,910	16,431	24,046	31,785
1000/2000	8,378	13,323	25,131	28,881
2048/128	3,257	3,785	4,311	4,798
2048/2048	5,908	10,679	18,134	22,391
5000/500	2,530	3,799	5,212	5,965
20000/2000	871	1,558	2,551

DeepSeek R1

Sequence Length (ISL/OSL)	B200 TP8 (FP4)
128/2048	62,599
128/4096	44,046
1000/1000	37,634
1000/2000	40,538
2048/128	5,026
2048/2048	28,852

Llama 4 Maverick

Sequence Length (ISL/OSL)	B200 TP8 (FP4)	H200 TP8 (FP8)	H100 TP8 (FP8)
128/2048	112,676	40,572	10,829
128/4096	68,170	24,616	6,744
500/2000		37,835	10,108
1000/1000	79,617	31,782	9,677
1000/2000	63,766	34,734	9,151
2048/128	18,088	7,307
2048/2048	52,195	20,957	6,916
5000/500		8,456	3,457
20000/2000	12,678	4,106

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	8 GPUs TP4,PP2 (FP4)
128/2048	19,146
128/4096	12,165
500/2000	17,870
1000/1000	15,954
1000/2000	12,456
2048/128	4,463
2048/2048	10,727
5000/500	4,613

Llama 3.1 405B

Sequence Length (ISL/OSL)	B200 TP4 (FP4)	GB200 TP4 (FP4)	H200 TP8 (FP8)	H100 TP8 (FP8)
128/2048	8,020	8,151	5,348	4,340
128/4096	6,345	6,608	4,741	3,116
500/2000	6,244	6,540	4,724	3,994
1000/1000	5,209	5,389	3,330	2,919
1000/2000	4,933	5,135	3,722	2,895
2048/128	749	797	456	453
2048/2048	4,212	4,407	2,948	2,296
5000/500	1,048	1,112	650	610
20000/2000	672	739	505	345

RTX 6000 Pro Blackwell Server Edition

Sequence Length (ISL/OSL)	8 GPUs TP1,PP8 (FP4)
128/2048	2,981
1000/1000	2,369
1000/2000	1,931
2048/128	579
2048/2048	1,442

Llama 3.1 8B

Sequence Length (ISL/OSL)	H200 TP1 (FP8)	H100 TP1 (FP8)
128/2048	26,221	22,714
128/4096	18,027	14,325
500/2000	20,770	17,660
1000/1000	17,744	15,220
1000/2000	16,828	13,899
2048/128	3,538	3,450
2048/2048	12,194	9,305
5000/500	3,902	3,459
20000/2000	1,804	1,351

Reproducing Benchmarked Results

[!NOTE] The only models supported in this workflow are those listed in the table above.

The following tables are references for commands that are used as part of the benchmarking process. For a more detailed description of this benchmarking workflow, see the benchmarking suite documentation.

Command Overview

Starting with v0.19, testing was performed using the PyTorch backend - this workflow does not require an engine to be built.

Stage	Description	Command
Dataset	Create a synthetic dataset	`python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file`
Run	Run a benchmark with a dataset	`trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options`

Variables

Name	Description
`$isl`	Benchmark input sequence length.
`$osl`	Benchmark output sequence length.
`$tp_size`	Tensor parallel mapping degree to run the benchmark with
`$pp_size`	Pipeline parallel mapping degree to run the benchmark with
`$ep_size`	Expert parallel mapping degree to run the benchmark with
`$model_name`	HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory
`$dataset_file`	Location of the dataset file generated by `prepare_dataset.py`
`$num_requests`	The number of requests to generate for dataset generation
`$seq_len`	A sequence length of ISL + OSL
`$llm_options`	(optional) A yaml file containing additional options for the LLM API

Preparing a Dataset

In order to prepare a dataset, you can use the provided script. To generate a synthetic dataset, run the following command:

python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file

The command will generate a text file located at the path specified $dataset_file where all requests are of the same input/output sequence length combinations. The script works by using the tokenizer to retrieve the vocabulary size and randomly sample token IDs from it to create entirely random sequences. In the command above, all requests will be uniform because the standard deviations for both input and output sequences are set to 0.

For each input and output sequence length combination, the table below details the $num_requests that were used. For shorter input and output lengths, a larger number of messages were used to guarantee that the system hit a steady state because requests enter and exit the system at a much faster rate. For longer input/output sequence lengths, requests remain in the system longer and therefore require less requests to achieve steady state.

Input Length	Output Length	$seq_len	$num_requests
128	128	256	30000
128	2048	2176	3000
128	4096	4224	1500
1000	2000	3000	1500
2048	128	2176	3000
2048	2048	4096	1500
5000	500	5500	1500
1000	1000	2000	3000
500	2000	2500	3000
20000	2000	22000	1000

Running the Benchmark

To run the benchmark with the generated data set, simply use the trtllm-bench throughput subcommand. The benchmarker will run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide a model name (HuggingFace reference or path to a local model), a generated dataset, and a file containing any desired extra options to the LLMApi (details in tensorrt_llm/llmapi/llm_args.py:LlmArgs).

For dense / non-MoE models:

trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options

llm_options.yml

cuda_graph_config:
  enable_padding: true
  batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    - 384
    - 512
    - 1024
    - 2048
    - 4096
    - 8192

For MoE models:

trtllm-bench --tp $tp_size --pp $pp_size --ep $ep_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options

llm_options.yml

enable_attention_dp: true
cuda_graph_config:
  enable_padding: true
  batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    - 384
    - 512
    - 1024
    - 2048
    - 4096
    - 8192

In many cases, we also use a higher KV cache percentage by setting --kv_cache_free_gpu_mem_fraction 0.95 in the benchmark command. This allows us to obtain better performance than the default setting of 0.90. We fall back to 0.90 or lower if out-of-memory errors are encountered.

The results will be printed to the terminal upon benchmark completion. For example,

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     43.2089
Total Output Throughput (tokens/sec):             5530.7382
Per User Output Throughput (tokens/sec/user):     2.0563
Per GPU Output Throughput (tokens/sec/gpu):       5530.7382
Total Token Throughput (tokens/sec):              94022.5497
Total Latency (ms):                               115716.9214
Average request latency (ms):                     75903.4456
Per User Output Speed [1/TPOT] (tokens/sec/user): 5.4656
Average time-to-first-token [TTFT] (ms):          52667.0339
Average time-per-output-token [TPOT] (ms):        182.9639

-- Per-Request Time-per-Output-Token [TPOT] Breakdown (ms)

[TPOT] MINIMUM: 32.8005
[TPOT] MAXIMUM: 208.4667
[TPOT] AVERAGE: 182.9639
[TPOT] P50    : 204.0463
[TPOT] P90    : 206.3863
[TPOT] P95    : 206.5064
[TPOT] P99    : 206.5821

-- Per-Request Time-to-First-Token [TTFT] Breakdown (ms)

[TTFT] MINIMUM: 3914.7621
[TTFT] MAXIMUM: 107501.2487
[TTFT] AVERAGE: 52667.0339
[TTFT] P50    : 52269.7072
[TTFT] P90    : 96583.7187
[TTFT] P95    : 101978.4566
[TTFT] P99    : 106563.4497

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 78509.2102
[Latency] P90    : 110804.0017
[Latency] P95    : 111302.9101
[Latency] P99    : 111618.2158
[Latency] MINIMUM: 24189.0838
[Latency] MAXIMUM: 111668.0964
[Latency] AVERAGE: 75903.4456

[!WARNING] In some cases, the benchmarker may not print anything at all. This behavior usually means that the benchmark has hit an out of memory issue. Try reducing the KV cache percentage using the --kv_cache_free_gpu_mem_fraction option to lower the percentage of used memory.