Performance of TensorRT-LLM

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Methodology

The different performance numbers below were collected using the methodology described in the benchmarks folder.

High Throughput

The below tables provide reference data at large batch sizes, representating high throughput tasks.

H100 GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	10,907
GPT-J 6B	64	1	128	2048	6,179
GPT-J 6B	64	1	2048	128	2,229
GPT-J 6B	64	1	2048	2048	2,980

LLaMA 7B	64	1	128	128	9,193
LLaMA 7B	64	1	128	2048	5,367
LLaMA 7B	64	1	2048	128	2,058
LLaMA 7B	32	1	2048	2048	2,230

LLaMA 70B	64	4	128	128	3,317
LLaMA 70B	64	4	128	2048	2,616
LLaMA 70B	64	4	2048	128	843
LLaMA 70B	64	4	2048	2048	1,583

Falcon 180B	96	8	128	128	2,686
Falcon 180B	96	8	128	2048	2,073
Falcon 180B	64	8	2048	128	465

L40S GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	3,630
GPT-J 6B	64	1	128	2048	1,859
GPT-J 6B	32	1	2048	128	616
GPT-J 6B	32	1	2048	2048	757

LLaMA 7B	64	1	128	128	3,240
LLaMA 7B	64	1	128	2048	1,622
LLaMA 7B	32	1	2048	128	581
LLaMA 7B	16	1	2048	2048	531

A100 GPUs (FP16)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	3,679
GPT-J 6B	32	1	128	2048	1,558
GPT-J 6B	32	1	2048	128	526
GPT-J 6B	16	1	2048	2048	650

LLaMA 7B	64	1	128	128	3,486
LLaMA 7B	32	1	128	2048	1,459
LLaMA 7B	32	1	2048	128	529
LLaMA 7B	16	1	2048	2048	592

LLaMA 70B	64	4	128	128	1,237
LLaMA 70B	64	4	128	2048	1,181
LLaMA 70B	64	4	2048	128	272
LLaMA 70B	64	4	2048	2048	738

Falcon 180B	64	8	128	128	929
Falcon 180B	64	8	128	2048	923
Falcon 180B	64	8	2048	128	202

(1) TP stands for Tensor Parallelism.

Low Latency

The below tables provide reference data at batch size 1 for first token latency, representating end-user’s percieved latency for online streaming tasks.

H100 GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	7
GPT-J 6B	1	1	2048	29

LLaMA 7B	1	1	128	7
LLaMA 7B	1	1	2048	36

LLaMA 70B	1	4	128	26
LLaMA 70B	1	4	2048	109

Falcon 180B	1	8	128	27
Falcon 180B	1	8	2048	205

L40S GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	12
GPT-J 6B	1	1	2048	71

LLaMA 7B	1	1	128	14
LLaMA 7B	1	1	2048	73

A100 GPUs (FP16)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	12
GPT-J 6B	1	1	2048	129

LLaMA 7B	1	1	128	16
LLaMA 7B	1	1	2048	133

LLaMA 70B	1	4	128	47
LLaMA 70B	1	4	2048	377

Falcon 180B	1	8	128	61
Falcon 180B	1	8	2048	509

(1) TP stands for Tensor Parallelism.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

There are different possible implementations for Matmul followed by Gated-SiLU. The simplest implementation uses two Matmul operations and combines the results in a separate CUDA kernel. That’s the current implementation in TensorRT-LLM. The next release will include a more efficient implementation that runs a single Matmul.