mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

[https://nvbugs/5729847 ][doc] fix broken links to modelopt (#9868 )

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

2025-12-10 02:57:11 -08:00

16 KiB

Raw Blame History

(perf-overview)=

Overview

This document summarizes performance measurements of TensorRT-LLM on a number of GPUs across a set of key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Not all configurations were tested for all GPUs.

We attempted to keep commands as simple as possible to ease reproducibility and left many options at their default settings. Tuning batch sizes, parallelism configurations, and other options may lead to improved performance depending on your situation.

For DeepSeek R1 performance, please check out our performance guide

For more information on benchmarking with trtllm-bench see this NVIDIA blog post.

Throughput Measurements

The below table shows performance data where a local inference client is fed requests at an high rate / no delay between messages, and shows the throughput scenario under maximum load. The reported metric is Output Throughput per GPU (tokens/sec/GPU).

The performance numbers below were collected using the steps described in this document.

Testing was performed on models with weights quantized using ModelOpt and published by NVIDIA on the Model Optimizer HuggingFace Collection.

RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.

Hardware

The following GPU variants were used for testing:

H100 SXM 80GB (DGX H100)
H200 SXM 141GB (DGX H200)
B200 180GB (DGX B200)
GB200 192GB (GB200 NVL72)
RTX 6000 Pro Blackwell Server Edition

Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.

FP4 Models

nvidia/DeepSeek-R1-0528-NVFP4-v2
nvidia/Qwen3-235B-A22B-FP4
nvidia/Qwen3-30B-A3B-FP4
nvidia/Llama-3.3-70B-Instruct-FP4
nvidia/Llama-4-Maverick-17B-128E-Instruct-NVFP4

FP8 Models

deepseek-ai/DeepSeek-R1-0528
nvidia/Qwen3-235B-A22B-FP8
nvidia/Llama-3.3-70B-Instruct-FP8
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8

Performance Summary - All Networks

Units

All performance values are measured in output tokens per second per GPU, where output tokens includes the first and all subsequent generated tokens (input tokens are not included).

Data in these tables is taken from the Per GPU Output Throughput (tps/gpu) metric reported by trtllm-bench. The calculations for metrics reported by trtllm-bench can be found in the dataclasses reporting.py and statistics.py

Deepseek R1 0528
GPT-OSS 120B
GPT-OSS 20B
LLaMA v3.3 70B
- LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition
LLaMA v4 Maverick
Qwen3 235B A22B
- Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition
Qwen3 30B A3B
- Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition

Deepseek R1 0528

Sequence Length (ISL/OSL)	B200 DEP4 (FP4)	GB200 DEP4 (FP4)	H200 DEP8 (FP8)
1000/1000	6,463	6,939	1,627
1024/1024	6,430	6,924	1,620
1024/8192	3,862	4,379	1,218
1024/32768	1,451	1,465	438
8192/1024	1,168	1,192