mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| benchmark-serve.sh | ||
| l0_dgx_b200.yaml | ||
| l0_dgx_b300.yaml | ||
| parse_benchmark_results.py | ||
| README.md | ||
| run_benchmark_serve.py | ||
TensorRT-LLM Benchmark Test System
Benchmarking scripts for TensorRT-LLM serving performance tests with configuration-driven test cases and CSV report generation.
Overview
- Run performance benchmarks across multiple model configurations
- Manage test cases through YAML configuration files
- Support selective execution of specific test cases
Scripts Overview
1. benchmark_config.yaml - Test Case Configuration
Purpose: Defines all benchmark test cases in a structured YAML format.
Structure:
server_configs:
- name: "r1_fp4_dep4"
model_name: "deepseek_r1_0528_fp4"
tp: 4
ep: 4
pp: 1
attention_backend: "TRTLLM"
moe_backend: "CUTLASS"
moe_max_num_tokens: ""
enable_attention_dp: true
enable_chunked_prefill: false
max_num_tokens: 2176
disable_overlap_scheduler: false
kv_cache_dtype: "fp8"
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
max_batch_size: 256
enable_padding: true
client_configs:
- name: "con1_iter1_1024_1024"
concurrency: 1
iterations: 1
isl: 1024
osl: 1024
random_range_ratio: 0.0
- name: "con8_iter1_1024_1024"
concurrency: 8
iterations: 1
isl: 1024
osl: 1024
random_range_ratio: 0.0
- name: "r1_fp4_tep4"
model_name: "deepseek_r1_0528_fp4"
tp: 4
ep: 4
pp: 1
attention_backend: "TRTLLM"
moe_backend: "CUTLASS"
moe_max_num_tokens: ""
enable_attention_dp: false
enable_chunked_prefill: false
max_num_tokens: 2176
disable_overlap_scheduler: false
kv_cache_dtype: "fp8"
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
max_batch_size: 256
enable_padding: true
client_configs:
- name: "con1_iter1_1024_1024"
concurrency: 1
iterations: 1
isl: 1024
osl: 1024
random_range_ratio: 0.0
- name: "con8_iter1_1024_1024"
concurrency: 8
iterations: 1
isl: 1024
osl: 1024
random_range_ratio: 0.0
2. run_benchmark_serve.py - Main Benchmark Runner
Purpose: Executes performance benchmarks based on YAML configuration files.
Usage:
python run_benchmark_serve.py --log_folder <log_folder> --config_file <config_file> [--select <select_pattern>] [--timeout 5400]
Arguments:
--log_folder: Directory to store benchmark logs (required)--config_file: Path to YAML configuration file (required)--select: Select pattern for specific Server and Client Config. (optional, default: all test cases)--timeout: Timeout for server setup. (optional, default: 3600 seconds)
Examples:
# Select
python run_benchmark_serve.py --log_folder ./results --config_file benchmark_config.yaml --select "r1_fp4_dep4:con8_iter1_1024_1024,r1_fp4_tep4:con1_iter1_1024_1024"
3. parse_benchmark_results.py - Results Parser
Purpose: Print log's perf.
Arguments:
--log_folder: Directory to store benchmark logs (required)
Usage:
python parse_benchmark_results.py --log_folder <log_folder>
4. benchmark-serve.sh - SLURM Job Script
Usage:
sbatch benchmark-serve.sh [IMAGE] [bench_dir] [log_folder] [select_pattern]
Parameters:
IMAGE: Docker image (default: tensorrt-llm-staging/release:main-x86_64)bench_dir: Directory containing config file and benchmark scripts (default: current directory)log_folder: Directory containing output logs and csv. (default: current directory)select_pattern: Select pattern (default: default - all test cases)
Examples:
bench_dir="/path/to/benchmark/scripts"
log_folder="/path/to/store/output/files"
sbatch --reservation=RES--COM-3970 --qos=reservation -D ${log_folder} ${bench_dir}/benchmark-serve.sh urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/release:main-x86_64 ${bench_dir} ${log_folder} "r1_fp4_dep4:con8_iter1_1024_1024,r1_fp4_tep4:con1_iter1_1024_1024"