# TensorRT-LLM Wide-EP Benchmark Scripts

This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.

## ⚠️ DISCLAIMER

**These scripts are currently not QA'ed and are provided for demonstration purposes only.**

Please note that:

- These scripts have not undergone formal quality assurance testing
- They are intended for demonstration and educational purposes
- Use at your own risk in production environments
- Always review and test scripts thoroughly before running in your specific environment

## Scripts Overview

### Core Scripts

1. **`submit.sh`** - Main entry point for submitting benchmark jobs
2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark
3. **`gen_yaml.py`** - Generates configuration files for serving setup
4. **`start_server.sh`** - Starts the inference server
5. **`start_worker.sh`** - Starts the worker processes
6. **`run_benchmark.sh`** - Executes the benchmark workload
7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports

## Usage

### Prerequisites

Before running the scripts, ensure you have:
- Access to a SLURM cluster
- Container image with TensorRT-LLM installed
- Model files accessible on the cluster
- Required environment variables set

### Configuration

Edit the following variables in `submit.sh` and `disaggr_torch.slurm`:

```bash
# In disaggr_torch.slurm
container_image=${container_image}     # Your container image
mount_dir=${mount_dir}                 # Mount directory path
model_dir=${model_dir}                 # Model directory path
```

### Running Benchmarks

1. **Submit benchmark jobs**:
   ```bash
   ./submit.sh
   ```

2. **Monitor job progress**:
   ```bash
   squeue -u $USER
   ```

3. **View results**:
   Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory

## Script Details

### `submit.sh`
Main entry script that submits multiple SLURM jobs with different configurations:
- **DEP8**: 8-way parallelism for decode servers
- **DEP16**: 16-way parallelism with different EPLB slot configurations
- **DEP32**: 32-way parallelism for high-throughput scenarios

Parameters tested:
- Concurrency levels: 1x, 64x, 1024x multipliers
- EPLB slots: 0, 256, 288
- Different parallelism sizes

### `disaggr_torch.slurm`
SLURM job script that:
1. Sets up container environment
2. Generates configuration files
3. Starts server and workers
4. Executes benchmarks
5. Cleans up processes

**Key parameters**:
- `num_ctx_servers`: Number of context servers
- `ctx_tp_size`: Tensor parallel size for context servers
- `num_gen_servers`: Number of generation servers
- `gen_tp_size`: Tensor parallel size for generation servers
- `concurrency`: Number of concurrent requests

### `gen_yaml.py`
Generates YAML configuration files with:
- Server topology and resource allocation
- Network configuration (hostnames, ports)
- Memory and batch size settings
- Optimization parameters (CUDA graphs, KV cache)

**Key features**:
- Automatic node and task allocation
- Support for attention data parallelism
- MoE load balancing configuration
- Speculative decoding (MTP) support

### `start_server.sh` & `start_worker.sh`
- **Server**: Starts the main inference server with API endpoint
- **Workers**: Starts MPI workers for distributed processing
- Support for profiling with NSight Systems
- Environment variable configuration for optimizations

### `run_benchmark.sh`
Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
- Downloads ShareGPT dataset for realistic workloads
- Waits for server health checks
- Runs load testing with specified concurrency
- Collects performance metrics
- Gracefully shuts down services

**Metrics collected**:
- Throughput (tokens/second)
- Latency (request completion time)
- Context vs generation only statistics

### `process_gen_iterlog.py`
Post-processes benchmark results:
- Parses iteration logs from workers
- Calculates throughput metrics
- Generates CSV reports
- Supports MTP (Multi-Token Prediction) analysis