TensorRT-LLMs/examples/wide_ep/slurm_scripts/README.md
Kaiyu Xie 0788635d6c
[TRTLLM-9762] [doc] Update documents for GB300 NVL72 (#9987)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-12-14 19:30:28 -08:00

4.4 KiB

Wide-EP SLURM Benchmark Scripts

This directory contains configuration files and utilities for benchmarking TensorRT-LLM Wide Expert Parallelism (Wide-EP) performance on SLURM-managed clusters.

Overview

The Wide-EP benchmarking infrastructure leverages the disaggregated serving benchmark framework to evaluate MoE model performance with expert parallelism at scale. This directory provides:

  • Configuration templates for Wide-EP deployments (config.yaml)
  • Post-processing utilities for benchmark analysis (process_gen_iterlog.py)

Core Implementation

The core SLURM submission and execution logic is implemented in examples/disaggregated/slurm/benchmark/. The scripts in that directory handle:

  • Job submission to SLURM clusters
  • Multi-node distributed execution
  • Worker initialization and coordination
  • Benchmark execution and result collection

Files in This Directory

config.yaml

Example configuration file for Wide-EP benchmarks. Key sections include:

  • SLURM Configuration: Cluster-specific settings (partition, account, job parameters)
  • Benchmark Mode: Testing parameters (concurrency, sequence lengths, streaming mode)
  • Hardware Configuration: GPU topology and server counts
  • Environment: Container images, model paths, and environment variables
  • Worker Configuration: Detailed settings for generation and context workers, including:
    • Parallelism settings (TP, EP, PP)
    • MoE configuration with load balancer settings
    • CUDA graph and KV cache configurations
    • Speculative decoding parameters

See the inline comments in config.yaml for detailed parameter descriptions.

process_gen_iterlog.py

Post-processing script that analyzes benchmark iteration logs to generate performance reports. This script:

  • Parses generation worker iteration logs
  • Computes throughput and latency statistics
  • Generates summary reports for benchmark results

Usage

Prerequisites

Before running benchmarks, ensure you have:

  1. SLURM Cluster Access: Valid account and partition allocation
  2. Container Environment:
    • NVIDIA Container Toolkit configured
    • Required device mappings (e.g., /dev/nvidia-caps-imex-channels for GB200/GB300 NVL72, /dev/gdrdrv for GDRCopy)
  3. Model Files: Checkpoint files accessible from all cluster nodes
  4. Configuration: Updated config.yaml with your cluster-specific settings

Configuration Setup

  1. Copy and customize the example configuration:
cp config.yaml my_benchmark_config.yaml
  1. Update the following required fields in my_benchmark_config.yaml:

    • slurm.partition: Your SLURM partition name
    • slurm.account: Your SLURM account
    • environment.container_image: Path to your TensorRT-LLM container
    • environment.model_path: Path to your model checkpoint
    • environment.work_dir: Working directory for benchmark outputs
    • environment.container_mount: Mount paths for the container
  2. Adjust hardware configuration to match your setup:

    • hardware.gpus_per_node: GPUs available per node
    • hardware.num_ctx_servers: Number of context processing servers
    • hardware.num_gen_servers: Number of generation servers

Running Benchmarks

Submit a benchmark job using the submit.py script from the disaggregated benchmark directory:

# Navigate to the benchmark submission directory
cd ../../disaggregated/slurm/benchmark/

# Submit the job with your configuration
python3 submit.py -c ../../../wide_ep/slurm_scripts/my_benchmark_config.yaml

The script will:

  1. Validate your configuration
  2. Submit a SLURM job with the specified parameters
  3. Launch distributed workers across the allocated nodes
  4. Execute the benchmark workload
  5. Collect results in the specified working directory

Monitoring and Results

After submission, monitor your job:

# Check job status
squeue -u $USER

# View job output (replace <job_id> with your SLURM job ID)
tail -f slurm-<job_id>.out

# Check worker logs in the working directory
ls <work_dir>/logs/

Benchmark results will be saved in your configured work_dir, including:

  • Iteration logs from generation and context workers
  • Performance metrics and throughput statistics
  • System logs and error reports

Post-Processing Results

Process generation iteration logs to extract performance metrics:

python3 process_gen_iterlog.py <path_to_gen_iter_log>