mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

* Update

* Update

2023-10-18 22:38:53 +08:00

10 KiB

Raw Blame History

Performance of TensorRT-LLM

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Methodology

The different performance numbers below were collected using the methodology described in the benchmarks folder.

High Throughput

The below tables provide reference data at large batch sizes, representating high throughput tasks.

H100 GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	10,907
GPT-J 6B	64	1	128	2048	6,179
GPT-J 6B	64	1	2048	128	2,229
GPT-J 6B	64	1	2048	2048	2,980

LLaMA 7B	64	1	128	128	9,193
LLaMA 7B	64	1	128	2048	5,367
LLaMA 7B	64	1	2048	128	2,058
LLaMA 7B	32	1	2048	2048	2,230

LLaMA 70B	64	4	128	128	3,317
LLaMA 70B	64	4	128	2048	2,616
LLaMA 70B	64	4	2048	128	843
LLaMA 70B	64	4	2048	2048	1,583

Falcon 180B	96	8	128	128	2,686
Falcon 180B	96	8	128	2048	2,073
Falcon 180B	64	8	2048	128	465

L40S GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	3,630
GPT-J 6B	64	1	128	2048	1,859
GPT-J 6B	32	1	2048	128	616
GPT-J 6B	32	1	2048	2048	757

LLaMA 7B	64	1	128	128	3,240
LLaMA 7B	64	1	128	2048	1,622
LLaMA 7B	32	1	2048	128	581
LLaMA 7B	16	1	2048	2048	531

A100 GPUs (FP16)

Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s)
GPT-J 6B	64	1	128	128	3,679
GPT-J 6B	32	1	128	2048	1,558
GPT-J 6B	32	1	2048	128	526
GPT-J 6B	16	1	2048	2048	650

LLaMA 7B	64	1	128	128	3,486
LLaMA 7B	32	1	128	2048	1,459
LLaMA 7B	32	1	2048	128	529
LLaMA 7B	16	1	2048	2048	592

LLaMA 70B	64	4	128	128	1,237
LLaMA 70B	64	4	128	2048	1,181
LLaMA 70B	64	4	2048	128	272
LLaMA 70B	64	4	2048	2048	738

Falcon 180B	64	8	128	128	929
Falcon 180B	64	8	128	2048	923
Falcon 180B	64	8	2048	128	202

(1) TP stands for Tensor Parallelism.

Low Latency

The below tables provide reference data at batch size 1 for first token latency, representating end-user's percieved latency for online streaming tasks.

H100 GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	7
GPT-J 6B	1	1	2048	29

LLaMA 7B	1	1	128	7
LLaMA 7B	1	1	2048	36

LLaMA 70B	1	4	128	26
LLaMA 70B	1	4	2048	109

Falcon 180B	1	8	128	27
Falcon 180B	1	8	2048	205

L40S GPUs (FP8)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	12
GPT-J 6B	1	1	2048	71

LLaMA 7B	1	1	128	14
LLaMA 7B	1	1	2048	73

A100 GPUs (FP16)

Model	Batch Size	TP (1)	Input Length	1st Token Latency (ms)
GPT-J 6B	1	1	128	12
GPT-J 6B	1	1	2048	129

LLaMA 7B	1	1	128	16
LLaMA 7B	1	1	2048	133

LLaMA 70B	1	4	128	47
LLaMA 70B	1	4	2048	377

Falcon 180B	1	8	128	61
Falcon 180B	1	8	2048	509

(1) TP stands for Tensor Parallelism.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

There are different possible implementations for Matmul followed by Gated-SiLU. The simplest implementation uses two Matmul operations and combines the results in a separate CUDA kernel. That's the current implementation in TensorRT-LLM. The next release will include a more efficient implementation that runs a single Matmul.

10 KiB Raw Blame History

Performance of TensorRT-LLM

Methodology

High Throughput

H100 GPUs (FP8)

L40S GPUs (FP8)

A100 GPUs (FP16)

Low Latency

H100 GPUs (FP8)

L40S GPUs (FP8)

A100 GPUs (FP16)

Known Issues

Fused Matmul + Gated-SiLU (LLaMA)

10 KiB

Raw Blame History