TensorRT-LLMs/examples/wide_ep/slurm_scripts/README.md
Xianjie Qiao b1976c2add
Add wide-ep benchmarking scripts (#5760)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-05 19:29:39 +08:00

130 lines
4.0 KiB
Markdown

# TensorRT-LLM Wide-EP Benchmark Scripts
This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
## ⚠️ DISCLAIMER
**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
Please note that:
- These scripts have not undergone formal quality assurance testing
- They are intended for demonstration and educational purposes
- Use at your own risk in production environments
- Always review and test scripts thoroughly before running in your specific environment
## Scripts Overview
### Core Scripts
1. **`submit.sh`** - Main entry point for submitting benchmark jobs
2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark
3. **`gen_yaml.py`** - Generates configuration files for serving setup
4. **`start_server.sh`** - Starts the inference server
5. **`start_worker.sh`** - Starts the worker processes
6. **`run_benchmark.sh`** - Executes the benchmark workload
7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports
## Usage
### Prerequisites
Before running the scripts, ensure you have:
- Access to a SLURM cluster
- Container image with TensorRT-LLM installed
- Model files accessible on the cluster
- Required environment variables set
### Configuration
Edit the following variables in `submit.sh` and `disaggr_torch.slurm`:
```bash
# In disaggr_torch.slurm
container_image=${container_image} # Your container image
mount_dir=${mount_dir} # Mount directory path
model_dir=${model_dir} # Model directory path
```
### Running Benchmarks
1. **Submit benchmark jobs**:
```bash
./submit.sh
```
2. **Monitor job progress**:
```bash
squeue -u $USER
```
3. **View results**:
Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory
## Script Details
### `submit.sh`
Main entry script that submits multiple SLURM jobs with different configurations:
- **DEP8**: 8-way parallelism for decode servers
- **DEP16**: 16-way parallelism with different EPLB slot configurations
- **DEP32**: 32-way parallelism for high-throughput scenarios
Parameters tested:
- Concurrency levels: 1x, 64x, 1024x multipliers
- EPLB slots: 0, 256, 288
- Different parallelism sizes
### `disaggr_torch.slurm`
SLURM job script that:
1. Sets up container environment
2. Generates configuration files
3. Starts server and workers
4. Executes benchmarks
5. Cleans up processes
**Key parameters**:
- `num_ctx_servers`: Number of context servers
- `ctx_tp_size`: Tensor parallel size for context servers
- `num_gen_servers`: Number of generation servers
- `gen_tp_size`: Tensor parallel size for generation servers
- `concurrency`: Number of concurrent requests
### `gen_yaml.py`
Generates YAML configuration files with:
- Server topology and resource allocation
- Network configuration (hostnames, ports)
- Memory and batch size settings
- Optimization parameters (CUDA graphs, KV cache)
**Key features**:
- Automatic node and task allocation
- Support for attention data parallelism
- MoE load balancing configuration
- Speculative decoding (MTP) support
### `start_server.sh` & `start_worker.sh`
- **Server**: Starts the main inference server with API endpoint
- **Workers**: Starts MPI workers for distributed processing
- Support for profiling with NSight Systems
- Environment variable configuration for optimizations
### `run_benchmark.sh`
Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
- Downloads ShareGPT dataset for realistic workloads
- Waits for server health checks
- Runs load testing with specified concurrency
- Collects performance metrics
- Gracefully shuts down services
**Metrics collected**:
- Throughput (tokens/second)
- Latency (request completion time)
- Context vs generation only statistics
### `process_gen_iterlog.py`
Post-processes benchmark results:
- Parses iteration logs from workers
- Calculates throughput metrics
- Generates CSV reports
- Supports MTP (Multi-Token Prediction) analysis