Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
4.0 KiB
TensorRT-LLM Wide-EP Benchmark Scripts
This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
⚠️ DISCLAIMER
These scripts are currently not QA'ed and are provided for demonstration purposes only.
Please note that:
- These scripts have not undergone formal quality assurance testing
- They are intended for demonstration and educational purposes
- Use at your own risk in production environments
- Always review and test scripts thoroughly before running in your specific environment
Scripts Overview
Core Scripts
submit.sh- Main entry point for submitting benchmark jobsdisaggr_torch.slurm- SLURM job script orchestrating the entire benchmarkgen_yaml.py- Generates configuration files for serving setupstart_server.sh- Starts the inference serverstart_worker.sh- Starts the worker processesrun_benchmark.sh- Executes the benchmark workloadprocess_gen_iterlog.py- Processes benchmark results and generates reports
Usage
Prerequisites
Before running the scripts, ensure you have:
- Access to a SLURM cluster
- Container image with TensorRT-LLM installed
- Model files accessible on the cluster
- Required environment variables set
Configuration
Edit the following variables in submit.sh and disaggr_torch.slurm:
# In disaggr_torch.slurm
container_image=${container_image} # Your container image
mount_dir=${mount_dir} # Mount directory path
model_dir=${model_dir} # Model directory path
Running Benchmarks
-
Submit benchmark jobs:
./submit.sh -
Monitor job progress:
squeue -u $USER -
View results: Results are saved in
bm_20250703_deepseek-r1-{isl}-{osl}/directory
Script Details
submit.sh
Main entry script that submits multiple SLURM jobs with different configurations:
- DEP8: 8-way parallelism for decode servers
- DEP16: 16-way parallelism with different EPLB slot configurations
- DEP32: 32-way parallelism for high-throughput scenarios
Parameters tested:
- Concurrency levels: 1x, 64x, 1024x multipliers
- EPLB slots: 0, 256, 288
- Different parallelism sizes
disaggr_torch.slurm
SLURM job script that:
- Sets up container environment
- Generates configuration files
- Starts server and workers
- Executes benchmarks
- Cleans up processes
Key parameters:
num_ctx_servers: Number of context serversctx_tp_size: Tensor parallel size for context serversnum_gen_servers: Number of generation serversgen_tp_size: Tensor parallel size for generation serversconcurrency: Number of concurrent requests
gen_yaml.py
Generates YAML configuration files with:
- Server topology and resource allocation
- Network configuration (hostnames, ports)
- Memory and batch size settings
- Optimization parameters (CUDA graphs, KV cache)
Key features:
- Automatic node and task allocation
- Support for attention data parallelism
- MoE load balancing configuration
- Speculative decoding (MTP) support
start_server.sh & start_worker.sh
- Server: Starts the main inference server with API endpoint
- Workers: Starts MPI workers for distributed processing
- Support for profiling with NSight Systems
- Environment variable configuration for optimizations
run_benchmark.sh
Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
- Downloads ShareGPT dataset for realistic workloads
- Waits for server health checks
- Runs load testing with specified concurrency
- Collects performance metrics
- Gracefully shuts down services
Metrics collected:
- Throughput (tokens/second)
- Latency (request completion time)
- Context vs generation only statistics
process_gen_iterlog.py
Post-processes benchmark results:
- Parses iteration logs from workers
- Calculates throughput metrics
- Generates CSV reports
- Supports MTP (Multi-Token Prediction) analysis