mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie f044eb8d94 Update TensorRT-LLM (#302 ) * Update TensorRT-LLM --------- Co-authored-by: wangruohui <12756472+wangruohui@users.noreply.github.com>		2023-11-07 19:51:58 +08:00
..
all_reduce.py	Kaiyu/update main (#5 )	2023-10-18 22:38:53 +08:00
allowed_configs.py	Update TensorRT-LLM (#302 )	2023-11-07 19:51:58 +08:00
base_benchmark.py	Update code	2023-09-28 09:00:05 -07:00
benchmark.py	Update TensorRT-LLM (#302 )	2023-11-07 19:51:58 +08:00
bert_benchmark.py	Update code	2023-09-28 09:00:05 -07:00
gpt_benchmark.py	Update TensorRT-LLM (#302 )	2023-11-07 19:51:58 +08:00
mem_monitor.py	Update code	2023-09-28 09:00:05 -07:00
README.md	Update code	2023-09-28 09:00:05 -07:00

README.md

Benchmark for Python Runtime

This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The benchmark implementation and entrypoint can be found in benchmarks/python/benchmark.py. There are some other scripts in the directory:

benchmarks/python/allowed_configs.py to define configuration for each supported model.
benchmarks/python/base_benchmark.py to implement the base class for benchmark.
benchmarks/python/gpt_benchmark.py to implement benchmark scripts for GPT and GPT-like(LLaMA/OPT/GPT-J/SmoothQuant-GPT) models.
benchmarks/python/bert_benchmark.py to implement benchmark scripts for BERT models.

Usage

Please use help option for detailed usages.

python benchmark.py -h

1. Single GPU benchmark

Take GPT-350M as an example:

python benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20"

Expected outputs:

[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 1 input_length 60 output_length 20 build_time(s) 89.8 tokens_per_sec 378.12 percentile95(ms) 53.284 percentile99(ms) 53.284 latency(ms) 52.893
[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 8 input_length 60 output_length 20 build_time(s) 89.8 tokens_per_sec 361.06 percentile95(ms) 55.739 percentile99(ms) 55.739 latency(ms) 55.392
[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 64 input_length 60 output_length 20 build_time(s) 89.8 tokens_per_sec 246.03 percentile95(ms) 81.533 percentile99(ms) 81.533 latency(ms) 81.29
...

Please note that the expected outputs is only for reference, specific performance numbers depend on the GPU you're using.

2. Multi-GPU benchmark

Take GPT-175B as an example:

mpirun -n 8 python benchmark.py \
    -m gpt_175b \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20"