mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-09 04:31:49 +08:00
* TensorRT-LLM Release 0.10.0 --------- Co-authored-by: Loki <lokravi@amazon.com> Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
504 lines
24 KiB
Markdown
504 lines
24 KiB
Markdown
(perf-overview)=
|
|
|
|
> [!IMPORTANT]
|
|
> As of TensorRT-LLM v0.10, these performance benchmarks have changed methodology to utilize in-flight batching and
|
|
no longer utilize static benchmarking. These numbers are initial measurements and are expected to improve in future
|
|
releases.
|
|
|
|
# Overview
|
|
|
|
This document summarizes performance measurements of TensorRT-LLM on H100
|
|
(Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.
|
|
|
|
The data in the following tables is provided as a reference point to help users
|
|
validate observed performance. It should not be considered as the peak
|
|
performance that can be delivered by TensorRT-LLM.
|
|
|
|
|
|
## Throughput Measurements
|
|
|
|
The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),
|
|
and shows the throughput client-server scenario under maximum load.
|
|
|
|
|
|
The performance numbers below were collected using the steps described in this document.
|
|
|
|
**All data in the table below was generated using version 0.10.0 and presents token throughput in tokens/second.**
|
|
|
|
| | | | | | | | | |
|
|
| ------------ | ------------------------ | ------------- | --------------- | -------------- | -------------- | -------------- | ------- | ------- |
|
|
| | | **GPU** | H200 141GB HBM3 | H100 80GB HBM3 | H100 80GB HBM3 | A100-SXM4-80GB | L40S | L40S |
|
|
| | | **Precision** | FP8 | FP8 | FP16 | FP16 | FP8 | FP16 |
|
|
| **Model** | **Input/Output Lengths** | **TP** | | | | | | |
|
|
| GPTJ 6B | 128/128 | 1 | 40633.96 | 34955.29 | 11206.68 | 5966.69 | 6997.91 | 3448.53 |
|
|
| | 2048/128 | 1 | 2937.91 | 2800.37 | 1354.56 | 682.27 | 747.43 | 352.4 |
|
|
| | 128/2048 | 1 | 9039.72 | 54939.48 | 3896.8 | 2225.09 | 2041.52 | 896.04 |
|
|
| | 2048/2048 | 1 | 5437.97 | 3663.26 | 1498.04 | 882.61 | | |
|
|
| LLaMA v2 7B | 128/128 | 1 | 18229.3 | 16985.6 | 10725.31 | 5303.5 | 6121.1 | 3139.62 |
|
|
| | 2048/128 | 1 | 2496.92 | 2355.47 | 1235.4 | 585.6 | 642.24 | 311.82 |
|
|
| | 128/2048 | 1 | 7612.25 | 6679.36 | 3399.43 | 1903.4 | 1749.4 | |
|
|
| | 2048/2048 | 1 | 3259.74 | 2805.32 | 1335.51 | | | |
|
|
| LLaMA v3 8B | 128/128 | 1 | 16708.84 | 16708.53 | 12085.78 | 5853.96 | 8273.8 | 5207.01 |
|
|
| | 2048/128 | 1 | 2478.94 | 2427.09 | 1604.7 | 737.81 | 1021.64 | 622.15 |
|
|
| | 128/2048 | 1 | 8367.88 | 8013.55 | 6208.23 | 3385.71 | 4568.17 | 2134.72 |
|
|
| | 2048/2048 | 1 | 3674.33 | 3500.48 | 2776.31 | 1514.04 | 1546.84 | 899.2 |
|
|
| Mixtral 8x7B | 128/128 | 2 | 16959.49 | 16051.88 | 12376.52 | 5120.41 | | |
|
|
| | | 4 | | | | | | 5271.48 |
|
|
| | 2048/128 | 2 | 2423.99 | 2276.6 | 1717.37 | 636.5 | | |
|
|
| | | 4 | | | | | | 654.36 |
|
|
| | 128/2048 | 2 | 12944.52 | 11997.24 | 7864.88 | 3946.92 | | |
|
|
| | | 4 | | | | | | 4650.16 |
|
|
| | 2048/2048 | 2 | 6208.97 | 5498.33 | 3722.56 | 1834.36 | | |
|
|
| | | 4 | | | | | | 2262.57 |
|
|
| LLaMA v2 70B | 128/128 | 1 | 4055.97 | 2134.52 | | | | |
|
|
| | | 2 | 6299.21 | 6035.36 | | 963.14 | 980.31 | |
|
|
| | | 4 | 8758.45 | 8148.67 | 5454.76 | 2394.12 | 1450.61 | 838.03 |
|
|
| | | 8 | 10261.44 | 9385.26 | 7491.94 | 3683.42 | 1387.91 | 1204.32 |
|
|
| | 2048/128 | 1 | 493.87 | 222.16 | | | | |
|
|
| | | 2 | 784.47 | 757.55 | | 114.9 | 111.24 | |
|
|
| | | 4 | 1164.15 | 1083.25 | 695.33 | 292.77 | 171.68 | 102.49 |
|
|
| | | 8 | 1441.26 | 1346.9 | 1016.58 | 456.46 | 163.76 | 145.41 |
|
|
| | 128/2048 | 1 | 3199.9 | 635.32 | | | | |
|
|
| | | 2 | 6747 | 4710.45 | | | | |
|
|
| | | 4 | 10960.72 | 8485.56 | 3686.63 | 2047.67 | 1368.09 | |
|
|
| | | 8 | 17250.73 | 12333.24 | 7927.16 | 4166.36 | 1667.57 | 1186.38 |
|
|
| | 2048/2048 | 1 | 1734.58 | | | | | |
|
|
| | | 2 | 3455.34 | 2267.45 | | | | |
|
|
| | | 4 | 6141.39 | 4019.31 | 1814.78 | 1046 | | |
|
|
| | | 8 | 9271.77 | 7061.32 | 3658.42 | 2210.84 | 771.23 | 614.74 |
|
|
| LLaMA v3 70B | 128/128 | 1 | 3988.96 | | | | | |
|
|
| | | 2 | 6155.26 | 5835.57 | | | | |
|
|
| | | 4 | 8454.74 | 7945.64 | 5210.19 | 2405.44 | 1280.9 | |
|
|
| | | 8 | 9893.18 | 9308.51 | 7126.51 | 3621.25 | 1367.56 | 1164.88 |
|
|
| | 2048/128 | 1 | 491.79 | | | | | |
|
|
| | | 2 | 783.26 | 751.14 | | | | |
|
|
| | | 4 | 1154.66 | 1074.31 | 691.99 | 295.87 | 171.16 | |
|
|
| | | 8 | 1434.86 | 1337.36 | 1010.5 | 455.18 | 165.06 | 143.92 |
|
|
| | 128/2048 | 1 | 3015.16 | | | | | |
|
|
| | | 2 | 6758.32 | 4130.4 | | | | |
|
|
| | | 4 | 10532.1 | 7730.54 | 3246.34 | 1974.04 | 1232.53 | |
|
|
| | | 8 | 16467.79 | 11680.94 | 7205.34 | 4091.45 | 1514.93 | 1034.07 |
|
|
| | 2048/2048 | 1 | 1654.25 | | | | | |
|
|
| | | 2 | 3271.6 | 1976.76 | | | | |
|
|
| | | 4 | 6113.93 | 3685.74 | 1612.11 | 992.74 | | |
|
|
| | | 8 | 8986.3 | 6443.85 | 3523.17 | 2118.89 | 691.62 | |
|
|
| Falcon 180B | 128/128 | 4 | 3810.55 | 3698.71 | | | | |
|
|
| | | 8 | 5946.89 | 5608.59 | 3954.58 | 1754.14 | 1243.33 | |
|
|
| | 2048/128 | 4 | 525.6 | 510.85 | | | | |
|
|
| | | 8 | 848.4 | 813.95 | 535.41 | 221.39 | 145.35 | |
|
|
| | 128/2048 | 4 | 2883.67 | 2495.62 | | | | |
|
|
| | | 8 | 5388.34 | 4796.47 | 3051.89 | 1684.6 | 1359.42 | |
|
|
| | 2048/2048 | 4 | 1376.61 | 952.25 | | | | |
|
|
| | | 8 | 2495.66 | 2421.77 | 896.28 | | 609.65 | |
|
|
|
|
*TP stands for Tensor Parallelism*
|
|
|
|
## Reproducing Benchmarked Results
|
|
|
|
### Building the TensorRT-LLM Container
|
|
|
|
---
|
|
In order to benchmark TensorRT-LLM, you will need to follow the [Quick Start](../../README.md#quick-start)
|
|
build process to create a baseline container for building a wheel. Additionally, the development
|
|
container needs a copy of the source code to build the wheel and the benchmarking script. Create the
|
|
right build environment, use the following :
|
|
|
|
```shell
|
|
git clone https://github.com/NVIDIA/TensorRT-LLM.git
|
|
cd TensorRT-LLM
|
|
git submodule update --init --recursive
|
|
git lfs install
|
|
git lfs pull
|
|
make -C docker build
|
|
make -C docker run LOCAL_USER=1
|
|
```
|
|
|
|
> [!WARNING]
|
|
> If you have elevated privileges on your system, then skip the `make -C docker run LOCAL_USER=1`
|
|
command above as it may make it so that you cannot access some required system libraries within the
|
|
container because the build forces your UID and GID to match those that are set for your non-elevated
|
|
user. There are cases where the container will be booted as root (i.e. on some SLURM systems with
|
|
the pyxis plugin) which will cause libraries to be missing.
|
|
|
|
If you are benchmarking in a shared environment, you need to specify the GPU indices that you would
|
|
like the container to use, otherwise the Makefile defaults to loading the container with all GPUs on
|
|
the system. For example, if you only have the 4 higher indices of GPUs on your system you can
|
|
configure it using the following example:
|
|
|
|
```shell
|
|
NV_GPU=0,1,2,3
|
|
make -C docker run LOCAL_USER=1 GPU_OPTS='--gpus \"device=${NV_GPU}\"'
|
|
```
|
|
|
|
Additionally, if you'd like to mount external storage to access persistent storage, or previously
|
|
built engines, you can mount directories as follows (simply replace `source` and `destination` with
|
|
the appropriate paths):
|
|
|
|
```shell
|
|
make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v /source:/destination"
|
|
```
|
|
|
|
Once the container starts, you'll need to build the wheel and the benchmarking scripts. From the
|
|
code root (the default directory when the container is loaded), the following commands will build
|
|
the TensorRT-LLM wheel, install dependencies, and build the benchmark scripts:
|
|
|
|
```shell
|
|
python3 ./scripts/build_wheel.py --benchmarks --trt_root /usr/local/tensorrt
|
|
pip install ./build/tensorrt_llm*.whl
|
|
```
|
|
|
|
## Methodology
|
|
|
|
The following tables are references for commands that are used as part of the benchmarking process.
|
|
|
|
### Commands
|
|
|
|
| Stage | Description | Command |
|
|
| :- | - | - |
|
|
| [Build](#engine-building) | Build a TensorRT-LLM engine | `trtllm-build --model_config $model_cfg --strongly_typed --output_dir $engine_dir --max_batch_size 2048 --max_input_len 2048 --max_output_len 4096 --workers $tp_size --max_num_tokens 2048 --use_paged_context_fmha enable --multiple_profiles enable` |
|
|
| [Dataset](#preparing-a-dataset) | Create a synthetic dataset | `benchmarks/cpp/prepare_dataset.py --output=$dataset_file --tokenizer=$model_name token-norm-dist --num-requests=2000 --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0` |
|
|
| [Run](#running-the-benchmark) | Run a benchmark with a dataset | `mpirun -n $tp_size --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir $engine_dir --type IFB --dataset $dataset_file --scheduler_policy max_utilization --kv_cache_free_gpu_mem_fraction 0.9 --output_csv $results_csv --request_rate -1.0 --enable_chunked_context --streaming --warm_up 0` |
|
|
|
|
### Variables
|
|
|
|
| Name | Description |
|
|
| :- | - |
|
|
| `$isl` | Benchmark input sequence length. |
|
|
|`$osl` | Benchmark output sequence length. |
|
|
| `$tp_size` | Number of GPUs to run the benchmark with |
|
|
| `$engine_dir` | Location to store built engine file (can be deleted after running benchmarks). |
|
|
| `$model_cfg` | Name of the model configuration JSON file to use for building. |
|
|
| `$model_name` | HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory |
|
|
| `$dataset_file` | Location of the dataset file generated by `prepare_dataset.py` |
|
|
| `$results_csv` | Path to store end results to. |
|
|
|
|
|
|
### Engine Building
|
|
|
|
All benchmarks were run using a single engine with a configuration that is capable of handling the
|
|
maximum sequence lengths encountered during benchmarking. For each benchmark, regardless of input/output
|
|
sequence length, you can reuse the single engine to run all tests. Each engine will be built with a paged
|
|
KV cache and in-flight batching enabled. For more information see the
|
|
[documentation about in-flight batching](../overview.md#in-flight-batching-and-paged-attention).
|
|
|
|
In order to build an engine you will need to run the following command by specifying a configuration file
|
|
for the model that you would like to build (see [below](#network-configuration-files)). The general build
|
|
command is as follows:
|
|
|
|
```shell
|
|
trtllm-build --model_config $model_cfg --strongly_typed --output_dir $engine_dir --max_batch_size 2048 --max_input_len 2048 --max_output_len 4096 --workers $tp_size --max_num_tokens 2048 --use_paged_context_fmha enable --multiple_profiles enable
|
|
```
|
|
|
|
Some notes about the command:
|
|
- `--workers` affects the number of threads that build the engine file and does not necessarily need to match
|
|
the TP size. Make sure to set the tensor parallelism in the `$model_cfg` JSON file. See [below](#network-configuration-files)
|
|
- You can run benchmarks for datasets that fit within the bounds of the `max_input_len` and `max_output_len` parameters.
|
|
|
|
### Engine Configuration Files
|
|
|
|
In order to configure the TensorRT-LLM build process for benchmarking, you need to provide
|
|
`trtllm-build` a configuration file that specifies the following the network configuration, parallelism
|
|
mapping, and quantization options.
|
|
|
|
Below we document how to benchmark each model on an H100-HBM3-80GB system and reproduce the throughput
|
|
numbers we document on our [Performance section](#performance of-tensorrt-llm).
|
|
|
|
> [!Important]
|
|
> In order to change the parallelism for a build, you need to modify the `mapping` dictionary in your configuration file. The settings
|
|
must conform to the following condition: `world_size == tp_size * pp_size`.
|
|
|
|
> [!Note]
|
|
> All configurations below are set to run utilizing FP8 by default. If you would like to run on an A100 system, see our notes about [disabling FP8 quantization](#running-on-a100).
|
|
|
|
|
|
### Network Configuration Files
|
|
|
|
Each network has its own configuration file. All networks are configured to run using FP8 quantization
|
|
by default.
|
|
|
|
<table>
|
|
<tr>
|
|
<td> Model </td> <td> Configuration File (FP8) </td>
|
|
</tr>
|
|
<tr>
|
|
<td> EleutherAI/gpt-j-6b </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "GPTJForCausalLM",
|
|
"dtype": "float16",
|
|
"num_hidden_layers": 28,
|
|
"num_attention_heads": 16,
|
|
"hidden_size": 4096,
|
|
"norm_epsilon": 1e-05,
|
|
"vocab_size": 50400,
|
|
"position_embedding_type": "rope_gptj",
|
|
"max_position_embeddings": 2048,
|
|
"hidden_act": "gelu_new",
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"rotary_dim": 64,
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> tiiuae/falcon-180B </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "FalconForCausalLM",
|
|
"dtype": "bfloat16",
|
|
"num_hidden_layers": 80,
|
|
"num_attention_heads": 232,
|
|
"num_key_value_heads": 8,
|
|
"hidden_size": 14848,
|
|
"norm_epsilon": 1e-05,
|
|
"vocab_size": 65024,
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"max_position_embeddings": 2048,
|
|
"hidden_act": "gelu",
|
|
"use_parallel_embedding": false,
|
|
"embedding_sharding_dim": 0,
|
|
"share_embedding_table": false,
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"mapping": {
|
|
"world_size": 8,
|
|
"tp_size": 8,
|
|
"pp_size": 1
|
|
},
|
|
"bias": false,
|
|
"parallel_attention": true,
|
|
"new_decoder_architecture": true,
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> meta-llama/Llama-2-7b-hf </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "LlamaForCausalLM",
|
|
"dtype": "float16",
|
|
"num_hidden_layers": 32,
|
|
"num_attention_heads": 32,
|
|
"hidden_size": 4096,
|
|
"intermediate_size": 11008,
|
|
"num_key_value_heads": 32,
|
|
"vocab_size": 32000,
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"max_position_embeddings": 4096,
|
|
"hidden_act": "silu",
|
|
"rotary_base": 10000.0,
|
|
"rotary_scaling": null,
|
|
"norm_epsilon": 1e-05,
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
</tr>
|
|
<tr>
|
|
<td> meta-llama/Llama-2-70b-hf </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "LlamaForCausalLM",
|
|
"dtype": "float16",
|
|
"num_hidden_layers": 80,
|
|
"num_attention_heads": 64,
|
|
"hidden_size": 8192,
|
|
"intermediate_size": 28672,
|
|
"num_key_value_heads": 8,
|
|
"vocab_size": 32000,
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"max_position_embeddings": 4096,
|
|
"hidden_act": "silu",
|
|
"rotary_base": 10000.0,
|
|
"rotary_scaling": null,
|
|
"norm_epsilon": 1e-05,
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"mapping": {
|
|
"world_size": 4,
|
|
"tp_size": 4,
|
|
"pp_size": 1
|
|
},
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> meta-llama/Meta-Llama-3-8B </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "LlamaForCausalLM",
|
|
"num_hidden_layers": 32,
|
|
"num_attention_heads": 32,
|
|
"num_key_value_heads": 8,
|
|
"hidden_size": 4096,
|
|
"vocab_size": 128256,
|
|
"max_position_embeddings": 8192,
|
|
"hidden_act": "silu",
|
|
"norm_epsilon": 1e-05,
|
|
"dtype": "float16",
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"intermediate_size": 28672,
|
|
"rotary_base": 500000.0,
|
|
"rope_theta": 500000.0,
|
|
"rotary_scaling": null,
|
|
"mapping": {
|
|
"world_size": 1,
|
|
"tp_size": 1,
|
|
"pp_size": 1
|
|
},
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> meta-llama/Meta-Llama-3-70B </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "LlamaForCausalLM",
|
|
"num_hidden_layers": 80,
|
|
"num_attention_heads": 64,
|
|
"num_key_value_heads": 8,
|
|
"hidden_size": 8192,
|
|
"vocab_size": 128256,
|
|
"max_position_embeddings": 8192,
|
|
"hidden_act": "silu",
|
|
"dtype": "float16",
|
|
"norm_epsilon": 1e-05,
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"intermediate_size": 14336,
|
|
"rotary_base": 500000.0,
|
|
"rope_theta": 500000.0,
|
|
"rotary_scaling": null,
|
|
"mapping": {
|
|
"world_size": 4,
|
|
"tp_size": 4,
|
|
"pp_size": 1
|
|
},
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> mistralai/Mixtral-8x7B-v0.1 </td>
|
|
<td>
|
|
|
|
```json
|
|
{
|
|
"architecture": "MixtralForCausalLM",
|
|
"num_hidden_layers": 32,
|
|
"num_attention_heads": 32,
|
|
"num_key_value_heads": 8,
|
|
"hidden_size": 4096,
|
|
"norm_epsilon": 1e-05,
|
|
"vocab_size": 32000,
|
|
"max_position_embeddings": 32768,
|
|
"head_size": 128,
|
|
"hidden_act": "swiglu",
|
|
"dtype": "float16",
|
|
"position_embedding_type": "rope_gpt_neox",
|
|
"intermediate_size": 14336,
|
|
"moe_num_experts": 8,
|
|
"moe_top_k": 2,
|
|
"rotary_base": 1000000.0,
|
|
"rope_theta": 1000000.0,
|
|
"mapping": {
|
|
"world_size": 1,
|
|
"tp_size": 1,
|
|
"pp_size": 1
|
|
},
|
|
"quantization": {
|
|
"quant_algo": "FP8",
|
|
"kv_cache_quant_algo": "FP8"
|
|
},
|
|
"kv_dtype": "float16"
|
|
}
|
|
```
|
|
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
|
|
|
|
### Running on A100
|
|
|
|
To run the benchmarks on A100, you will need to undefine or remove the following
|
|
quantization fields from each config json file, because FP8 computation is a feature in H100 and newer GPUs.
|
|
```json
|
|
"quantization": {
|
|
"quant_algo": null,
|
|
"kv_cache_quant_algo": null,
|
|
}
|
|
```
|
|
|
|
## Preparing a Dataset
|
|
|
|
In order to prepare a dataset, you can use the provided [script](../../../benchmarks/cpp/prepare_dataset.py).
|
|
To generate a synthetic dataset, run the following command:
|
|
|
|
```shell
|
|
benchmarks/cpp/prepare_dataset.py --output=$dataset_file --tokenizer=$model_name token-norm-dist --num-requests=2000 --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0
|
|
```
|
|
|
|
The command will generate a JSON file located at the path specified `$dataset_file` where all requests are of the same
|
|
input/output sequence length combinations. The script works by using the tokenizer to retrieve the vocabulary size and
|
|
randomly sample token IDs from it to create entirely random sequences. In the command above, all requests will be uniform
|
|
because the standard deviations for both input and output sequences are set to 0.
|
|
|
|
## Running the Benchmark
|
|
|
|
To run the benchmark with the generated data set, simply run the following command from the root of the
|
|
TensorRT-LLM repository. See the [variables](#variables) section for reference on variable values.
|
|
|
|
```shell
|
|
mpirun -n $tp_size --allow-run-as-root --oversubscribe cpp/build/benchmarks/gptManagerBenchmark --engine_dir $engine_dir --type IFB --dataset $dataset_file --scheduler_policy max_utilization --kv_cache_free_gpu_mem_fraction 0.9 --output_csv $results_csv --request_rate -1.0 --enable_chunked_context --streaming --warm_up 0
|
|
```
|
|
|
|
The command will run the `gptManagerBenchmark` binary that will report the throughput and other metrics as part of its output
|
|
that can be compared with the table in the [Performance section](#peak-throughput) of this README.
|