Performance of TensorRT-LLM

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Methodology

The different performance numbers below were collected using the methodology described in the benchmarks folder.

High Throughput

The below tables provide reference data at large batch sizes, representating high throughput tasks.

H100 GPUs (FP8)

Model

Batch Size

TP (1)

Input Length

Output Length

Throughput (out tok/s)

GPT-J 6B

64

1

128

128

10,907

GPT-J 6B

64

1

128

2048

6,179

GPT-J 6B

64

1

2048

128

2,229

GPT-J 6B

64

1

2048

2048

2,980

LLaMA 7B

64

1

128

128

9,193

LLaMA 7B

64

1

128

2048

5,367

LLaMA 7B

64

1

2048

128

2,058

LLaMA 7B

32

1

2048

2048

2,230

LLaMA 70B

64

4

128

128

3,317

LLaMA 70B

64

4

128

2048

2,616

LLaMA 70B

64

4

2048

128

843

LLaMA 70B

64

4

2048

2048

1,583

Falcon 180B

96

8

128

128

2,686

Falcon 180B

96

8

128

2048

2,073

Falcon 180B

64

8

2048

128

465

L40S GPUs (FP8)

Model

Batch Size

TP (1)

Input Length

Output Length

Throughput (out tok/s)

GPT-J 6B

64

1

128

128

3,630

GPT-J 6B

64

1

128

2048

1,859

GPT-J 6B

32

1

2048

128

616

GPT-J 6B

32

1

2048

2048

757

LLaMA 7B

64

1

128

128

3,240

LLaMA 7B

64

1

128

2048

1,622

LLaMA 7B

32

1

2048

128

581

LLaMA 7B

16

1

2048

2048

531

A100 GPUs (FP16)

Model

Batch Size

TP (1)

Input Length

Output Length

Throughput (out tok/s)

GPT-J 6B

64

1

128

128

3,679

GPT-J 6B

32

1

128

2048

1,558

GPT-J 6B

32

1

2048

128

526

GPT-J 6B

16

1

2048

2048

650

LLaMA 7B

64

1

128

128

3,486

LLaMA 7B

32

1

128

2048

1,459

LLaMA 7B

32

1

2048

128

529

LLaMA 7B

16

1

2048

2048

592

LLaMA 70B

64

4

128

128

1,237

LLaMA 70B

64

4

128

2048

1,181

LLaMA 70B

64

4

2048

128

272

LLaMA 70B

64

4

2048

2048

738

Falcon 180B

64

8

128

128

929

Falcon 180B

64

8

128

2048

923

Falcon 180B

64

8

2048

128

202

(1) TP stands for Tensor Parallelism.

Low Latency

The below tables provide reference data at batch size 1 for first token latency, representating end-user’s percieved latency for online streaming tasks.

H100 GPUs (FP8)

Model

Batch Size

TP (1)

Input Length

1st Token Latency (ms)

GPT-J 6B

1

1

128

7

GPT-J 6B

1

1

2048

29

LLaMA 7B

1

1

128

7

LLaMA 7B

1

1

2048

36

LLaMA 70B

1

4

128

26

LLaMA 70B

1

4

2048

109

Falcon 180B

1

8

128

27

Falcon 180B

1

8

2048

205

L40S GPUs (FP8)

Model

Batch Size

TP (1)

Input Length

1st Token Latency (ms)

GPT-J 6B

1

1

128

12

GPT-J 6B

1

1

2048

71

LLaMA 7B

1

1

128

14

LLaMA 7B

1

1

2048

73

A100 GPUs (FP16)

Model

Batch Size

TP (1)

Input Length

1st Token Latency (ms)

GPT-J 6B

1

1

128

12

GPT-J 6B

1

1

2048

129

LLaMA 7B

1

1

128

16

LLaMA 7B

1

1

2048

133

LLaMA 70B

1

4

128

47

LLaMA 70B

1

4

2048

377

Falcon 180B

1

8

128

61

Falcon 180B

1

8

2048

509

(1) TP stands for Tensor Parallelism.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

There are different possible implementations for Matmul followed by Gated-SiLU. The simplest implementation uses two Matmul operations and combines the results in a separate CUDA kernel. That’s the current implementation in TensorRT-LLM. The next release will include a more efficient implementation that runs a single Matmul.