TensorRT-LLMs/examples/wide_ep/slurm_scripts
Xianjie Qiao b1976c2add
Add wide-ep benchmarking scripts (#5760)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-05 19:29:39 +08:00
..
disaggr_torch.slurm Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
gen_yaml.py Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
process_gen_iterlog.py Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
README.md Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
run_benchmark.sh Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
start_server.sh Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
start_worker.sh Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00
submit.sh Add wide-ep benchmarking scripts (#5760) 2025-07-05 19:29:39 +08:00

TensorRT-LLM Wide-EP Benchmark Scripts

This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.

⚠️ DISCLAIMER

These scripts are currently not QA'ed and are provided for demonstration purposes only.

Please note that:

  • These scripts have not undergone formal quality assurance testing
  • They are intended for demonstration and educational purposes
  • Use at your own risk in production environments
  • Always review and test scripts thoroughly before running in your specific environment

Scripts Overview

Core Scripts

  1. submit.sh - Main entry point for submitting benchmark jobs
  2. disaggr_torch.slurm - SLURM job script orchestrating the entire benchmark
  3. gen_yaml.py - Generates configuration files for serving setup
  4. start_server.sh - Starts the inference server
  5. start_worker.sh - Starts the worker processes
  6. run_benchmark.sh - Executes the benchmark workload
  7. process_gen_iterlog.py - Processes benchmark results and generates reports

Usage

Prerequisites

Before running the scripts, ensure you have:

  • Access to a SLURM cluster
  • Container image with TensorRT-LLM installed
  • Model files accessible on the cluster
  • Required environment variables set

Configuration

Edit the following variables in submit.sh and disaggr_torch.slurm:

# In disaggr_torch.slurm
container_image=${container_image}     # Your container image
mount_dir=${mount_dir}                 # Mount directory path
model_dir=${model_dir}                 # Model directory path

Running Benchmarks

  1. Submit benchmark jobs:

    ./submit.sh
    
  2. Monitor job progress:

    squeue -u $USER
    
  3. View results: Results are saved in bm_20250703_deepseek-r1-{isl}-{osl}/ directory

Script Details

submit.sh

Main entry script that submits multiple SLURM jobs with different configurations:

  • DEP8: 8-way parallelism for decode servers
  • DEP16: 16-way parallelism with different EPLB slot configurations
  • DEP32: 32-way parallelism for high-throughput scenarios

Parameters tested:

  • Concurrency levels: 1x, 64x, 1024x multipliers
  • EPLB slots: 0, 256, 288
  • Different parallelism sizes

disaggr_torch.slurm

SLURM job script that:

  1. Sets up container environment
  2. Generates configuration files
  3. Starts server and workers
  4. Executes benchmarks
  5. Cleans up processes

Key parameters:

  • num_ctx_servers: Number of context servers
  • ctx_tp_size: Tensor parallel size for context servers
  • num_gen_servers: Number of generation servers
  • gen_tp_size: Tensor parallel size for generation servers
  • concurrency: Number of concurrent requests

gen_yaml.py

Generates YAML configuration files with:

  • Server topology and resource allocation
  • Network configuration (hostnames, ports)
  • Memory and batch size settings
  • Optimization parameters (CUDA graphs, KV cache)

Key features:

  • Automatic node and task allocation
  • Support for attention data parallelism
  • MoE load balancing configuration
  • Speculative decoding (MTP) support

start_server.sh & start_worker.sh

  • Server: Starts the main inference server with API endpoint
  • Workers: Starts MPI workers for distributed processing
  • Support for profiling with NSight Systems
  • Environment variable configuration for optimizations

run_benchmark.sh

Executes benchmarking using TensorRT-LLM's benchmark_serving tool:

  • Downloads ShareGPT dataset for realistic workloads
  • Waits for server health checks
  • Runs load testing with specified concurrency
  • Collects performance metrics
  • Gracefully shuts down services

Metrics collected:

  • Throughput (tokens/second)
  • Latency (request completion time)
  • Context vs generation only statistics

process_gen_iterlog.py

Post-processes benchmark results:

  • Parses iteration logs from workers
  • Calculates throughput metrics
  • Generates CSV reports
  • Supports MTP (Multi-Token Prediction) analysis