mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Xianjie Qiao b1976c2add

Add wide-ep benchmarking scripts (#5760 )

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2025-07-05 19:29:39 +08:00

4.0 KiB

Raw Blame History

TensorRT-LLM Wide-EP Benchmark Scripts

This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.

⚠️ DISCLAIMER

These scripts are currently not QA'ed and are provided for demonstration purposes only.

Please note that:

These scripts have not undergone formal quality assurance testing
They are intended for demonstration and educational purposes
Use at your own risk in production environments
Always review and test scripts thoroughly before running in your specific environment

Scripts Overview

Core Scripts

submit.sh - Main entry point for submitting benchmark jobs
disaggr_torch.slurm - SLURM job script orchestrating the entire benchmark
gen_yaml.py - Generates configuration files for serving setup
start_server.sh - Starts the inference server
start_worker.sh - Starts the worker processes
run_benchmark.sh - Executes the benchmark workload
process_gen_iterlog.py - Processes benchmark results and generates reports

Usage

Prerequisites

Before running the scripts, ensure you have:

Access to a SLURM cluster
Container image with TensorRT-LLM installed
Model files accessible on the cluster
Required environment variables set

Configuration

Edit the following variables in submit.sh and disaggr_torch.slurm:

# In disaggr_torch.slurm
container_image=${container_image}     # Your container image
mount_dir=${mount_dir}                 # Mount directory path
model_dir=${model_dir}                 # Model directory path

Running Benchmarks

Submit benchmark jobs:
```
./submit.sh
```
Monitor job progress:
```
squeue -u $USER
```
View results: Results are saved in bm_20250703_deepseek-r1-{isl}-{osl}/ directory

Script Details

`submit.sh`

Main entry script that submits multiple SLURM jobs with different configurations:

DEP8: 8-way parallelism for decode servers
DEP16: 16-way parallelism with different EPLB slot configurations
DEP32: 32-way parallelism for high-throughput scenarios

Parameters tested:

Concurrency levels: 1x, 64x, 1024x multipliers
EPLB slots: 0, 256, 288
Different parallelism sizes

`disaggr_torch.slurm`

SLURM job script that:

Sets up container environment
Generates configuration files
Starts server and workers
Executes benchmarks
Cleans up processes

Key parameters:

num_ctx_servers: Number of context servers
ctx_tp_size: Tensor parallel size for context servers
num_gen_servers: Number of generation servers
gen_tp_size: Tensor parallel size for generation servers
concurrency: Number of concurrent requests

`gen_yaml.py`

Generates YAML configuration files with:

Server topology and resource allocation
Network configuration (hostnames, ports)
Memory and batch size settings
Optimization parameters (CUDA graphs, KV cache)

Key features:

Automatic node and task allocation
Support for attention data parallelism
MoE load balancing configuration
Speculative decoding (MTP) support

`start_server.sh` & `start_worker.sh`

Server: Starts the main inference server with API endpoint
Workers: Starts MPI workers for distributed processing
Support for profiling with NSight Systems
Environment variable configuration for optimizations

`run_benchmark.sh`

Executes benchmarking using TensorRT-LLM's benchmark_serving tool:

Downloads ShareGPT dataset for realistic workloads
Waits for server health checks
Runs load testing with specified concurrency
Collects performance metrics
Gracefully shuts down services

Metrics collected:

Throughput (tokens/second)
Latency (request completion time)
Context vs generation only statistics

`process_gen_iterlog.py`

Post-processes benchmark results:

Parses iteration logs from workers
Calculates throughput metrics
Generates CSV reports
Supports MTP (Multi-Token Prediction) analysis

4.0 KiB Raw Blame History