mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-12 05:53:33 +08:00
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
130 lines
4.0 KiB
Markdown
130 lines
4.0 KiB
Markdown
# TensorRT-LLM Wide-EP Benchmark Scripts
|
|
|
|
This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
|
|
|
|
## ⚠️ DISCLAIMER
|
|
|
|
**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
|
|
|
|
Please note that:
|
|
|
|
- These scripts have not undergone formal quality assurance testing
|
|
- They are intended for demonstration and educational purposes
|
|
- Use at your own risk in production environments
|
|
- Always review and test scripts thoroughly before running in your specific environment
|
|
|
|
## Scripts Overview
|
|
|
|
### Core Scripts
|
|
|
|
1. **`submit.sh`** - Main entry point for submitting benchmark jobs
|
|
2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark
|
|
3. **`gen_yaml.py`** - Generates configuration files for serving setup
|
|
4. **`start_server.sh`** - Starts the inference server
|
|
5. **`start_worker.sh`** - Starts the worker processes
|
|
6. **`run_benchmark.sh`** - Executes the benchmark workload
|
|
7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports
|
|
|
|
## Usage
|
|
|
|
### Prerequisites
|
|
|
|
Before running the scripts, ensure you have:
|
|
- Access to a SLURM cluster
|
|
- Container image with TensorRT-LLM installed
|
|
- Model files accessible on the cluster
|
|
- Required environment variables set
|
|
|
|
### Configuration
|
|
|
|
Edit the following variables in `submit.sh` and `disaggr_torch.slurm`:
|
|
|
|
```bash
|
|
# In disaggr_torch.slurm
|
|
container_image=${container_image} # Your container image
|
|
mount_dir=${mount_dir} # Mount directory path
|
|
model_dir=${model_dir} # Model directory path
|
|
```
|
|
|
|
### Running Benchmarks
|
|
|
|
1. **Submit benchmark jobs**:
|
|
```bash
|
|
./submit.sh
|
|
```
|
|
|
|
2. **Monitor job progress**:
|
|
```bash
|
|
squeue -u $USER
|
|
```
|
|
|
|
3. **View results**:
|
|
Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory
|
|
|
|
## Script Details
|
|
|
|
### `submit.sh`
|
|
Main entry script that submits multiple SLURM jobs with different configurations:
|
|
- **DEP8**: 8-way parallelism for decode servers
|
|
- **DEP16**: 16-way parallelism with different EPLB slot configurations
|
|
- **DEP32**: 32-way parallelism for high-throughput scenarios
|
|
|
|
Parameters tested:
|
|
- Concurrency levels: 1x, 64x, 1024x multipliers
|
|
- EPLB slots: 0, 256, 288
|
|
- Different parallelism sizes
|
|
|
|
### `disaggr_torch.slurm`
|
|
SLURM job script that:
|
|
1. Sets up container environment
|
|
2. Generates configuration files
|
|
3. Starts server and workers
|
|
4. Executes benchmarks
|
|
5. Cleans up processes
|
|
|
|
**Key parameters**:
|
|
- `num_ctx_servers`: Number of context servers
|
|
- `ctx_tp_size`: Tensor parallel size for context servers
|
|
- `num_gen_servers`: Number of generation servers
|
|
- `gen_tp_size`: Tensor parallel size for generation servers
|
|
- `concurrency`: Number of concurrent requests
|
|
|
|
### `gen_yaml.py`
|
|
Generates YAML configuration files with:
|
|
- Server topology and resource allocation
|
|
- Network configuration (hostnames, ports)
|
|
- Memory and batch size settings
|
|
- Optimization parameters (CUDA graphs, KV cache)
|
|
|
|
**Key features**:
|
|
- Automatic node and task allocation
|
|
- Support for attention data parallelism
|
|
- MoE load balancing configuration
|
|
- Speculative decoding (MTP) support
|
|
|
|
### `start_server.sh` & `start_worker.sh`
|
|
- **Server**: Starts the main inference server with API endpoint
|
|
- **Workers**: Starts MPI workers for distributed processing
|
|
- Support for profiling with NSight Systems
|
|
- Environment variable configuration for optimizations
|
|
|
|
### `run_benchmark.sh`
|
|
Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
|
|
- Downloads ShareGPT dataset for realistic workloads
|
|
- Waits for server health checks
|
|
- Runs load testing with specified concurrency
|
|
- Collects performance metrics
|
|
- Gracefully shuts down services
|
|
|
|
**Metrics collected**:
|
|
- Throughput (tokens/second)
|
|
- Latency (request completion time)
|
|
- Context vs generation only statistics
|
|
|
|
### `process_gen_iterlog.py`
|
|
Post-processes benchmark results:
|
|
- Parses iteration logs from workers
|
|
- Calculates throughput metrics
|
|
- Generates CSV reports
|
|
- Supports MTP (Multi-Token Prediction) analysis
|