Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| allreduce_heuristic_code_gen.py | ||
| allreduce_perf_viz.py | ||
| README.md | ||
AllReduce Performance Offline Autotuning and Visualization Tools
This directory contains tools for benchmarking, analyzing, and visualizing AllReduce performance in TensorRT-LLM. The toolkit consists of two main components:
allreduce_heuristic_code_gen.py- Generates optimized C++ lookup tables for AllReduce strategy selectionallreduce_perf_viz.py- Creates comprehensive visualizations of AllReduce performance data
Overview
The AllReduce performance analysis workflow involves:
- Benchmarking: Run performance tests across different configurations
- Analysis: Generate optimal strategy lookup tables
- Visualization: Create heatmaps and performance comparison charts
Prerequisites
- TensorRT-LLM environment with MPI support
- Python packages:
pandas,numpy,matplotlib,seaborn,scipy - CUDA-capable GPU(s) for benchmarking
Tool 1: allreduce_heuristic_code_gen.py
Purpose
Generates C++ lookup tables that contain the optimal AllReduce strategy for different parameter combinations (tensor parallel size, fusion operations, hidden sizes, and token counts).
Usage
Basic Usage (Auto-benchmark and generate)
python allreduce_heuristic_code_gen.py
Advanced Usage
python allreduce_heuristic_code_gen.py \
--data_dir /path/to/benchmark/data \
--sm_version 89 \
--save_csv_dir /path/to/save/csv \
--enable_auto
Parameters
--data_dir: Directory containing existing benchmark CSV files (optional)--sm_version: CUDA SM version (auto-detected if not specified)--save_csv_dir: Directory to save benchmark CSV files--enable_auto: Enable AUTO strategy in benchmarking
Workflow
-
Benchmark Generation (if
--data_dirnot provided):- Automatically runs
all_reduce.pymicrobenchmark using MPI - Tests multiple tensor parallel sizes (currently TP=2)
- Generates CSV files:
benchmark.tp{size}.sm{version}.csv
- Automatically runs
-
Strategy Analysis:
- Loads benchmark data from CSV files
- Filters data based on predefined thresholds
- Finds optimal strategy for each parameter combination
- Creates 4D lookup table:
[tp_size][fusion][hidden_size][num_tokens]
-
Code Generation:
- Converts lookup table to C++ array format
- Outputs to
gen_heuristic_code/generated_lookup_table.cpp - Ready for integration into TensorRT-LLM codebase
Output Example
// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy
// TP:[2, 4, 8] Fusion:['NONE', 'RESIDUAL_RMS_NORM', ...]
inline AllReduceBestStrategyTableType AllReduceBestStrategyTableSM89 = {
{
// TP=2
{ // Fusion=NONE
{0,0,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=128
{0,4,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=256
// ... more rows
},
// ... more fusion types
},
// ... more TP sizes
};
Tool 2: allreduce_perf_viz.py
Purpose
Creates comprehensive visualizations of AllReduce performance data including performance heatmaps, strategy comparison charts, and difference analysis.
Usage
Basic Usage
python allreduce_perf_viz.py --data_dir /path/to/benchmark/data
Parameters
--data_dir: Directory containing benchmark CSV files (default: 'data')
Generated Visualizations
The tool generates three types of visualizations for each configuration:
1. Performance Heatmaps (*_heatmap.png)
- Purpose: Show raw performance times for each strategy
- Layout: Side-by-side heatmaps for each AllReduce strategy
- Axes: X=num_tokens, Y=hidden_size
- Colors: Performance time in microseconds (μs)
- Features: Shared colorbar, logarithmic scaling for better visualization
2. Best Strategy Maps (*_best_strategy.png)
- Purpose: Show optimal strategy for each parameter combination
- Layout: Single heatmap with categorical colors
- Axes: X=num_tokens, Y=hidden_size
- Colors: Different strategies (NCCL, ONESHOT, TWOSHOT, etc.)
- Features: Custom colorbar with strategy labels, distribution statistics
3. Strategy Difference Heatmaps (*_strategy_difference_heatmap.png)
- Purpose: Show performance difference from optimal strategy
- Layout: Side-by-side heatmaps for each strategy
- Axes: X=num_tokens, Y=hidden_size
- Colors: Percentage difference from best strategy (white=optimal, red=slower)
- Features: Annotated cells with exact difference values
Visualization Functions
The script provides three main visualization functions that can be used programmatically:
# 1. Performance heatmaps
visualize_2d_heatmap(df, fusion_op='NONE', save_path='heatmap.png')
# 2. Best strategy visualization
visualize_2d_best_strategy(df, fusion_op='NONE', save_path='best_strategy.png')
# 3. Strategy difference analysis
visualize_strategy_difference_heatmaps(df, fusion_op='NONE', save_path='diff.png')
Output Structure
data/
├── viz/
│ ├── NONE/
│ │ ├── benchmark.tp2.sm89_heatmap.png
│ │ ├── benchmark.tp2.sm89_best_strategy.png
│ │ └── benchmark.tp2.sm89_strategy_difference_heatmap.png
│ ├── RESIDUAL_RMS_NORM/
│ │ └── ... (similar files)
│ └── ... (other fusion operations)
└── benchmark.tp2.sm89.csv
Configuration Details
Supported Strategies
- NCCL (0): Standard NCCL AllReduce
- ONESHOT (4): Custom single-phase AllReduce
- TWOSHOT (5): Custom two-phase AllReduce
Supported Fusion Operations
NONE: No fusionRESIDUAL_RMS_NORM: Residual + RMS normalizationRESIDUAL_RMS_NORM_QUANT_FP8: RESIDUAL_RMS_NORM + FP8 quantizationRESIDUAL_RMS_NORM_QUANT_NVFP4: RESIDUAL_RMS_NORM + NVFP4 quantization
Parameter Ranges
- Tensor Parallel Sizes: 2, 4, 8
- Hidden Sizes: 128 to 8192 (powers of 2)
- Token Counts: 1 to 16384 (powers of 2)
Performance Tips
- Run benchmarks on target hardware for accurate results
- Use multiple runs and average results for stability
- Consider different fusion operations based on your use case
- Monitor GPU memory usage during benchmarking
Integration with TensorRT-LLM
The generated lookup tables can be integrated into TensorRT-LLM's AllReduce implementation to automatically select optimal strategies based on runtime parameters. The C++ arrays follow the format expected by the TensorRT-LLM AllReduce subsystem.
Contributing
When adding new strategies or fusion operations:
- Update Configuration: Modify the
Constantsclass inallreduce_heuristic_code_gen.py - Add Strategy Mapping: Update
strategy_name_to_enumdictionary with new strategy entries - Generate New Lookup Tables: Run
allreduce_heuristic_code_gen.pyto create updated lookup tables for optimal AllReduce strategies - Integrate into Codebase: Copy the generated C++ array into the appropriate lookup table in
cpp/tensorrt_llm/common/customAllReduceUtils.h - Update Visualizations: Modify color schemes in
allreduce_perf_viz.pyif needed for new strategies - Validate: Test with representative workloads to ensure performance improvements