TensorRT-LLMs/tests/scripts/allreduce_perf
Yukun He d272f1a9bc
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-05 15:44:37 +08:00
..
allreduce_heuristic_code_gen.py [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce (#9314) 2025-12-07 09:43:26 -08:00
allreduce_perf_viz.py [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. (#8531) 2026-01-05 15:44:37 +08:00
README.md [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870) 2025-11-04 16:42:31 +08:00

AllReduce Performance Offline Autotuning and Visualization Tools

This directory contains tools for benchmarking, analyzing, and visualizing AllReduce performance in TensorRT-LLM. The toolkit consists of two main components:

  1. allreduce_heuristic_code_gen.py - Generates optimized C++ lookup tables for AllReduce strategy selection
  2. allreduce_perf_viz.py - Creates comprehensive visualizations of AllReduce performance data

Overview

The AllReduce performance analysis workflow involves:

  1. Benchmarking: Run performance tests across different configurations
  2. Analysis: Generate optimal strategy lookup tables
  3. Visualization: Create heatmaps and performance comparison charts

Prerequisites

  • TensorRT-LLM environment with MPI support
  • Python packages: pandas, numpy, matplotlib, seaborn, scipy
  • CUDA-capable GPU(s) for benchmarking

Tool 1: allreduce_heuristic_code_gen.py

Purpose

Generates C++ lookup tables that contain the optimal AllReduce strategy for different parameter combinations (tensor parallel size, fusion operations, hidden sizes, and token counts).

Usage

Basic Usage (Auto-benchmark and generate)

python allreduce_heuristic_code_gen.py

Advanced Usage

python allreduce_heuristic_code_gen.py \
    --data_dir /path/to/benchmark/data \
    --sm_version 89 \
    --save_csv_dir /path/to/save/csv \
    --enable_auto

Parameters

  • --data_dir: Directory containing existing benchmark CSV files (optional)
  • --sm_version: CUDA SM version (auto-detected if not specified)
  • --save_csv_dir: Directory to save benchmark CSV files
  • --enable_auto: Enable AUTO strategy in benchmarking

Workflow

  1. Benchmark Generation (if --data_dir not provided):

    • Automatically runs all_reduce.py microbenchmark using MPI
    • Tests multiple tensor parallel sizes (currently TP=2)
    • Generates CSV files: benchmark.tp{size}.sm{version}.csv
  2. Strategy Analysis:

    • Loads benchmark data from CSV files
    • Filters data based on predefined thresholds
    • Finds optimal strategy for each parameter combination
    • Creates 4D lookup table: [tp_size][fusion][hidden_size][num_tokens]
  3. Code Generation:

    • Converts lookup table to C++ array format
    • Outputs to gen_heuristic_code/generated_lookup_table.cpp
    • Ready for integration into TensorRT-LLM codebase

Output Example

// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy
// TP:[2, 4, 8] Fusion:['NONE', 'RESIDUAL_RMS_NORM', ...]
inline AllReduceBestStrategyTableType AllReduceBestStrategyTableSM89 = {
    {
        // TP=2
        { // Fusion=NONE
            {0,0,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=128
            {0,4,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=256
            // ... more rows
        },
        // ... more fusion types
    },
    // ... more TP sizes
};

Tool 2: allreduce_perf_viz.py

Purpose

Creates comprehensive visualizations of AllReduce performance data including performance heatmaps, strategy comparison charts, and difference analysis.

Usage

Basic Usage

python allreduce_perf_viz.py --data_dir /path/to/benchmark/data

Parameters

  • --data_dir: Directory containing benchmark CSV files (default: 'data')

Generated Visualizations

The tool generates three types of visualizations for each configuration:

1. Performance Heatmaps (*_heatmap.png)

  • Purpose: Show raw performance times for each strategy
  • Layout: Side-by-side heatmaps for each AllReduce strategy
  • Axes: X=num_tokens, Y=hidden_size
  • Colors: Performance time in microseconds (μs)
  • Features: Shared colorbar, logarithmic scaling for better visualization

2. Best Strategy Maps (*_best_strategy.png)

  • Purpose: Show optimal strategy for each parameter combination
  • Layout: Single heatmap with categorical colors
  • Axes: X=num_tokens, Y=hidden_size
  • Colors: Different strategies (NCCL, ONESHOT, TWOSHOT, etc.)
  • Features: Custom colorbar with strategy labels, distribution statistics

3. Strategy Difference Heatmaps (*_strategy_difference_heatmap.png)

  • Purpose: Show performance difference from optimal strategy
  • Layout: Side-by-side heatmaps for each strategy
  • Axes: X=num_tokens, Y=hidden_size
  • Colors: Percentage difference from best strategy (white=optimal, red=slower)
  • Features: Annotated cells with exact difference values

Visualization Functions

The script provides three main visualization functions that can be used programmatically:

# 1. Performance heatmaps
visualize_2d_heatmap(df, fusion_op='NONE', save_path='heatmap.png')

# 2. Best strategy visualization
visualize_2d_best_strategy(df, fusion_op='NONE', save_path='best_strategy.png')

# 3. Strategy difference analysis
visualize_strategy_difference_heatmaps(df, fusion_op='NONE', save_path='diff.png')

Output Structure

data/
├── viz/
│   ├── NONE/
│   │   ├── benchmark.tp2.sm89_heatmap.png
│   │   ├── benchmark.tp2.sm89_best_strategy.png
│   │   └── benchmark.tp2.sm89_strategy_difference_heatmap.png
│   ├── RESIDUAL_RMS_NORM/
│   │   └── ... (similar files)
│   └── ... (other fusion operations)
└── benchmark.tp2.sm89.csv

Configuration Details

Supported Strategies

  • NCCL (0): Standard NCCL AllReduce
  • ONESHOT (4): Custom single-phase AllReduce
  • TWOSHOT (5): Custom two-phase AllReduce

Supported Fusion Operations

  • NONE: No fusion
  • RESIDUAL_RMS_NORM: Residual + RMS normalization
  • RESIDUAL_RMS_NORM_QUANT_FP8: RESIDUAL_RMS_NORM + FP8 quantization
  • RESIDUAL_RMS_NORM_QUANT_NVFP4: RESIDUAL_RMS_NORM + NVFP4 quantization

Parameter Ranges

  • Tensor Parallel Sizes: 2, 4, 8
  • Hidden Sizes: 128 to 8192 (powers of 2)
  • Token Counts: 1 to 16384 (powers of 2)

Performance Tips

  • Run benchmarks on target hardware for accurate results
  • Use multiple runs and average results for stability
  • Consider different fusion operations based on your use case
  • Monitor GPU memory usage during benchmarking

Integration with TensorRT-LLM

The generated lookup tables can be integrated into TensorRT-LLM's AllReduce implementation to automatically select optimal strategies based on runtime parameters. The C++ arrays follow the format expected by the TensorRT-LLM AllReduce subsystem.

Contributing

When adding new strategies or fusion operations:

  1. Update Configuration: Modify the Constants class in allreduce_heuristic_code_gen.py
  2. Add Strategy Mapping: Update strategy_name_to_enum dictionary with new strategy entries
  3. Generate New Lookup Tables: Run allreduce_heuristic_code_gen.py to create updated lookup tables for optimal AllReduce strategies
  4. Integrate into Codebase: Copy the generated C++ array into the appropriate lookup table in cpp/tensorrt_llm/common/customAllReduceUtils.h
  5. Update Visualizations: Modify color schemes in allreduce_perf_viz.py if needed for new strategies
  6. Validate: Test with representative workloads to ensure performance improvements