mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yukun He fd4311e6a3 [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 ) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>		2025-10-16 14:15:25 +08:00
..
allreduce_heuristic_code_gen.py	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 )	2025-10-16 14:15:25 +08:00
allreduce_perf_viz.py	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 )	2025-10-16 14:15:25 +08:00
README.md	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 )	2025-10-16 14:15:25 +08:00

README.md

AllReduce Performance Offline Autotuning and Visualization Tools

This directory contains tools for benchmarking, analyzing, and visualizing AllReduce performance in TensorRT-LLM. The toolkit consists of two main components:

allreduce_heuristic_code_gen.py - Generates optimized C++ lookup tables for AllReduce strategy selection
allreduce_perf_viz.py - Creates comprehensive visualizations of AllReduce performance data

Overview

The AllReduce performance analysis workflow involves:

Benchmarking: Run performance tests across different configurations
Analysis: Generate optimal strategy lookup tables
Visualization: Create heatmaps and performance comparison charts

Prerequisites

TensorRT-LLM environment with MPI support
Python packages: pandas, numpy, matplotlib, seaborn, scipy
CUDA-capable GPU(s) for benchmarking

Tool 1: allreduce_heuristic_code_gen.py

Purpose

Generates C++ lookup tables that contain the optimal AllReduce strategy for different parameter combinations (tensor parallel size, fusion operations, hidden sizes, and token counts).

Usage

Basic Usage (Auto-benchmark and generate)

python allreduce_heuristic_code_gen.py

Advanced Usage

python allreduce_heuristic_code_gen.py \
    --data_dir /path/to/benchmark/data \
    --sm_version 89 \
    --save_csv_dir /path/to/save/csv \
    --enable_auto

Parameters

--data_dir: Directory containing existing benchmark CSV files (optional)
--sm_version: CUDA SM version (auto-detected if not specified)
--save_csv_dir: Directory to save benchmark CSV files
--enable_auto: Enable AUTO strategy in benchmarking

Workflow

Benchmark Generation (if --data_dir not provided):
- Automatically runs all_reduce.py microbenchmark using MPI
- Tests multiple tensor parallel sizes (currently TP=2)
- Generates CSV files: benchmark.tp{size}.sm{version}.csv
Strategy Analysis:
- Loads benchmark data from CSV files
- Filters data based on predefined thresholds
- Finds optimal strategy for each parameter combination
- Creates 4D lookup table: [tp_size][fusion][hidden_size][num_tokens]
Code Generation:
- Converts lookup table to C++ array format
- Outputs to gen_heuristic_code/generated_lookup_table.cpp
- Ready for integration into TensorRT-LLM codebase

Output Example

// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy
// TP:[2, 4, 8] Fusion:['NONE', 'RESIDUAL_RMS_NORM', ...]
inline AllReduceBestStrategyTableType AllReduceBestStrategyTableSM89 = {
    {
        // TP=2
        { // Fusion=NONE
            {0,0,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=128
            {0,4,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=256
            // ... more rows
        },
        // ... more fusion types
    },
    // ... more TP sizes
};

Tool 2: allreduce_perf_viz.py

Purpose

Creates comprehensive visualizations of AllReduce performance data including performance heatmaps, strategy comparison charts, and difference analysis.

Usage

Basic Usage

python allreduce_perf_viz.py --data_dir /path/to/benchmark/data

Parameters

--data_dir: Directory containing benchmark CSV files (default: 'data')

Generated Visualizations

The tool generates three types of visualizations for each configuration:

1. Performance Heatmaps (`*_heatmap.png`)

Purpose: Show raw performance times for each strategy
Layout: Side-by-side heatmaps for each AllReduce strategy
Axes: X=num_tokens, Y=hidden_size
Colors: Performance time in microseconds (μs)
Features: Shared colorbar, logarithmic scaling for better visualization

2. Best Strategy Maps (`*_best_strategy.png`)

Purpose: Show optimal strategy for each parameter combination
Layout: Single heatmap with categorical colors
Axes: X=num_tokens, Y=hidden_size
Colors: Different strategies (NCCL, ONESHOT, TWOSHOT, etc.)
Features: Custom colorbar with strategy labels, distribution statistics

3. Strategy Difference Heatmaps (`*_strategy_difference_heatmap.png`)

Purpose: Show performance difference from optimal strategy
Layout: Side-by-side heatmaps for each strategy
Axes: X=num_tokens, Y=hidden_size
Colors: Percentage difference from best strategy (white=optimal, red=slower)
Features: Annotated cells with exact difference values

Visualization Functions

The script provides three main visualization functions that can be used programmatically:

# 1. Performance heatmaps
visualize_2d_heatmap(df, fusion_op='NONE', save_path='heatmap.png')

# 2. Best strategy visualization
visualize_2d_best_strategy(df, fusion_op='NONE', save_path='best_strategy.png')

# 3. Strategy difference analysis
visualize_strategy_difference_heatmaps(df, fusion_op='NONE', save_path='diff.png')

Output Structure

data/
├── viz/
│   ├── NONE/
│   │   ├── benchmark.tp2.sm89_heatmap.png
│   │   ├── benchmark.tp2.sm89_best_strategy.png
│   │   └── benchmark.tp2.sm89_strategy_difference_heatmap.png
│   ├── RESIDUAL_RMS_NORM/
│   │   └── ... (similar files)
│   └── ... (other fusion operations)
└── benchmark.tp2.sm89.csv

Configuration Details

Supported Strategies

NCCL (0): Standard NCCL AllReduce
ONESHOT (4): Custom single-phase AllReduce
TWOSHOT (5): Custom two-phase AllReduce

Supported Fusion Operations

NONE: No fusion
RESIDUAL_RMS_NORM: Residual + RMS normalization
RESIDUAL_RMS_NORM_QUANT_FP8: RESIDUAL_RMS_NORM + FP8 quantization
RESIDUAL_RMS_NORM_QUANT_NVFP4: RESIDUAL_RMS_NORM + NVFP4 quantization

Parameter Ranges

Tensor Parallel Sizes: 2, 4, 8
Hidden Sizes: 128 to 8192 (powers of 2)
Token Counts: 1 to 16384 (powers of 2)

Performance Tips

Run benchmarks on target hardware for accurate results
Use multiple runs and average results for stability
Consider different fusion operations based on your use case
Monitor GPU memory usage during benchmarking

Integration with TensorRT-LLM

The generated lookup tables can be integrated into TensorRT-LLM's AllReduce implementation to automatically select optimal strategies based on runtime parameters. The C++ arrays follow the format expected by the TensorRT-LLM AllReduce subsystem.

Contributing

When adding new strategies or fusion operations:

Update Configuration: Modify the Constants class in allreduce_heuristic_code_gen.py
Add Strategy Mapping: Update strategy_name_to_enum dictionary with new strategy entries
Generate New Lookup Tables: Run allreduce_heuristic_code_gen.py to create updated lookup tables for optimal AllReduce strategies
Integrate into Codebase: Copy the generated C++ array into the appropriate lookup table in cpp/tensorrt_llm/common/customAllReduceUtils.h
Update Visualizations: Modify color schemes in allreduce_perf_viz.py if needed for new strategies
Validate: Test with representative workloads to ensure performance improvements

README.md

AllReduce Performance Offline Autotuning and Visualization Tools

Overview

Prerequisites

Tool 1: allreduce_heuristic_code_gen.py

Purpose

Usage

Basic Usage (Auto-benchmark and generate)

Advanced Usage

Parameters

Workflow

Output Example

Tool 2: allreduce_perf_viz.py

Purpose

Usage

Basic Usage

Parameters

Generated Visualizations

1. Performance Heatmaps (*_heatmap.png)

2. Best Strategy Maps (*_best_strategy.png)

3. Strategy Difference Heatmaps (*_strategy_difference_heatmap.png)

Visualization Functions

Output Structure

Configuration Details

Supported Strategies

Supported Fusion Operations

Parameter Ranges

Performance Tips

Integration with TensorRT-LLM

Contributing

1. Performance Heatmaps (`*_heatmap.png`)

2. Best Strategy Maps (`*_best_strategy.png`)

3. Strategy Difference Heatmaps (`*_strategy_difference_heatmap.png`)