mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

chenfeiz0326 5cd8c0f6cc [None][test] Add perf-sweep scripts (#6738 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>		2025-08-14 14:04:47 +08:00
..
benchmark_config.yaml	[None][test] Add perf-sweep scripts (#6738 )	2025-08-14 14:04:47 +08:00
benchmark-serve.sh	[None][test] Add perf-sweep scripts (#6738 )	2025-08-14 14:04:47 +08:00
parse_benchmark_results.py	[None][test] Add perf-sweep scripts (#6738 )	2025-08-14 14:04:47 +08:00
README.md	[None][test] Add perf-sweep scripts (#6738 )	2025-08-14 14:04:47 +08:00
run_benchmark_serve.py	[None][test] Add perf-sweep scripts (#6738 )	2025-08-14 14:04:47 +08:00

README.md

TensorRT-LLM Benchmark Test System

Benchmarking scripts for TensorRT-LLM serving performance tests with configuration-driven test cases and CSV report generation.

Overview

Run performance benchmarks across multiple model configurations
Manage test cases through YAML configuration files
Generate comprehensive CSV reports with complete test case coverage
Support selective execution of specific test cases

Scripts Overview

1. `benchmark_config.yaml` - Test Case Configuration

Purpose: Defines all benchmark test cases in a structured YAML format.

Structure:

test_cases:
  - id: 1
    model: "70B-FP8"
    gpus: 1
    tp: 1
    ep: 1
    attn_backend: "TRTLLM"
    moe_backend: ""
    enable_attention_dp: false
    free_gpu_mem_fraction: 0.9
    max_batch_size: 512
    isl: 1024
    osl: 1024
    max_num_tokens: 16384
    moe_max_num_tokens: ""
    concurrency_iterations:
      - [1, 10]
      - [8, 10]
      - [64, 5]
      - [512, 2]

Configuration Fields:

id: Unique identifier for the test case
model: Model name (e.g., "70B-FP8", "Scout-FP4")
gpus: Number of GPUs to use
tp: Tensor parallelism size
ep: Expert parallelism size
attn_backend: Attention backend ("TRTLLM", "FLASHINFER")
moe_backend: MoE backend ("DEEPGEMM", "TRTLLM", "CUTLASS", "")
enable_attention_dp: Enable attention data parallelism
free_gpu_mem_fraction: GPU memory fraction to reserve
max_batch_size: Maximum batch size
isl: Input sequence length
osl: Output sequence length
max_num_tokens: Maximum number of tokens
moe_max_num_tokens: Maximum number of tokens for MoE
concurrency_iterations: List of [concurrency, iteration] pairs

2. `run_benchmark_serve.py` - Main Benchmark Runner

Purpose: Executes performance benchmarks based on YAML configuration files.

Usage:

python run_benchmark_serve.py --output_folder <output_folder> --config_file <config_file> [--skip <skip_pattern>] [--select <select_pattern>]

Arguments:

--output_folder: Directory to store benchmark results (required)
--config_file: Path to YAML configuration file (required)
--skip: Skip pattern for specific test cases/concurrencies (optional, default: no skipping)
--select: Select pattern for specific test cases/concurrencies (optional, default: all test cases)

Examples:

# Run all test cases
python run_benchmark_serve.py --output_folder results --config_file benchmark_config.yaml --skip default --select default

# Skip specific test cases
python run_benchmark_serve.py --output_folder results --config_file benchmark_config.yaml --skip "2-1,4"

# Run specific concurrencies from specific test cases
python run_benchmark_serve.py --output_folder results --config_file benchmark_config.yaml --select "1,2-3"

Skip Pattern: Format: "test_case1,test_case2,test_case3" or "test_case1-concurrency1,test_case2-concurrency3"

"2,4": Skip test cases 2 and 4 entirely
"2-1,4-2": Skip test case 2's 1st concurrency and test case 4's 2nd concurrency
"default" or empty: No skipping (default)

Select Pattern: Format: "test_case1,test_case2,test_case3" or "test_case1-concurrency1,test_case2-concurrency3"

"1,3,5": Run only test cases 1, 3, and 5 (all concurrencies)
"1-1,2-3": Run test case 1's 1st concurrency and test case 2's 3rd concurrency
"default" or empty: Run all test cases (default)

3. `parse_benchmark_results.py` - Results Parser

Purpose: Parses benchmark log files and generates comprehensive CSV reports with all test cases from the configuration file.

Usage:

python parse_benchmark_results.py --input_folder <input_folder> --output_csv <output_csv> --config_file <config_file>

Arguments:

input_folder: Folder containing benchmark log files (serve.*.log) (required)
output_csv: Output CSV filename for the results table (required)
config_file: Path to benchmark_config.yaml file (required)

Examples:

python parse_benchmark_results.py --config_file ./benchmark_logs --output_csv results.csv --input_folder ./benchmark_config.yaml

4. `benchmark-serve.sh` - SLURM Job Script

Usage:

sbatch benchmark-serve.sh [IMAGE] [bench_dir] [output_dir] [select_pattern] [skip_pattern]

Parameters:

IMAGE: Docker image (default: tensorrt-llm-staging/release:main-x86_64)
bench_dir: Directory containing config file and benchmark scripts (default: current directory)
output_dir: Directory containing output logs and csv. (default: current directory)
select_pattern: Select pattern (default: default - all test cases)
skip_pattern: Skip pattern (default: default - no skipping)

Examples:


bench_dir="/path/to/benchmark/scripts"
output_dir="/path/to/store/output/files"
sbatch --reservation=RES--COM-3970 --qos=reservation -D ${output_dir} ${bench_dir}/benchmark-serve.sh urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/release:main-x86_64 ${bench_dir} ${output_dir} "1-1" ""

README.md

TensorRT-LLM Benchmark Test System

Overview

Scripts Overview

1. benchmark_config.yaml - Test Case Configuration

2. run_benchmark_serve.py - Main Benchmark Runner

3. parse_benchmark_results.py - Results Parser

4. benchmark-serve.sh - SLURM Job Script

1. `benchmark_config.yaml` - Test Case Configuration

2. `run_benchmark_serve.py` - Main Benchmark Runner

3. `parse_benchmark_results.py` - Results Parser

4. `benchmark-serve.sh` - SLURM Job Script