TensorRT-LLMs/benchmarks/suite
Kaiyu Xie 035b99e0d0
Update TensorRT-LLM (#1427)
* Update TensorRT-LLM

---------

Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
2024-04-09 17:03:34 +08:00
..
tensorrt_llm_bench Update TensorRT-LLM (#1427) 2024-04-09 17:03:34 +08:00
README.md Update TensorRT-LLM (#1427) 2024-04-09 17:03:34 +08:00
requirements.txt Update TensorRT-LLM (#1427) 2024-04-09 17:03:34 +08:00

TensorRT-LLM Benchmarking

WORK IN PROGRESS

This package is the official benchmarking suite for TensorRT-LLM. This benchmark will be updated as development of TensorRT-LLM continues.

Installation

From this folder, run pip install -r requirements.txt to install the extra dependencies required for this tool.

Available Model Options

The following model options are available for benchmarking models.

Option Required Default Description
--model Y - The name of the model to benchmark.
--dtype N float16 The datatype of the weights.
--kv-dtype N float16 The datatype to store the KV Cache in.
--quantization N None The quantization algorithm to be used when benchmarking. See the documentation for more information
--workspace N /tmp The directory to store benchmarking intermediate files.
--tensor-parallel-size N 1 Number of tensor parallel shards to run the benchmark with.
--pipeline-parallel-size N 1 Number of pipeline parallel shards to run the benchmark with.

Supported Networks for Benchmarking

Support Quantization Modes

TensorRT-LLM supports a number of quanization modes. For more information about quantization, see the documentation.

  • None (no quantization applied)
  • W8A16
  • W4A16
  • W4A16_AWQ
  • W4A8_AWQ
  • W4A16_GPTQ
  • FP8
  • INT8

[!NOTE] Please see the supported quantization methods for each network here

Static Benchmarking a Network

In order to benchmark a static batch for a network, run a command like the following:

cd tensorrt_llm_bench/
python benchmark.py --model tiiuae/falcon-7b static --isl 128 --osl 128 --batch 1

This command line will build a unique engine for the configuration and run the benchmark using the gptSessionBenchmark binary. You need to build the TensorRT-LLM wheel with the --benchmarks flag for this binary to be compiled:

python3 ./scripts/build_wheel.py --benchmarks <other options>

The complete list of arguments are given here:

Option Required Default Description
--batch Y - The batch size to benchmark.
--isl Y - The input sequence length to pass in during benchmark.
--osl Y - The output sequence length to generate in the benchmark.
--gpt-session-path N ../../cpp/build/benchmarks/gptSessionBenchmark The path to the built gptSessionBenchmark binary.
--max-tokens-in-kv-cache N None The maximum number of tokens to store in the KV Cache during benchmarking.
--kv-cache-mem-percent N 0.9 The percentage of free memory that the KV cache is allowed to occupy.
--warm-up-runs N 2 The number of warm up runs to run before benchmarking actual results.
--num-runs N 10 The number runs to generate benchmarking results from.
--duration N 60 The minimum iteration time, in seconds, to measure.

Warning

gptSession will be deprecated for the 1.0 release of TensorRT-LLM. This command line will change in order to match and update benchmarks accordingly.