mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-28 14:44:24 +08:00
Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
194 lines
7.0 KiB
Markdown
194 lines
7.0 KiB
Markdown
# AllReduce Performance Offline Autotuning and Visualization Tools
|
|
|
|
This directory contains tools for benchmarking, analyzing, and visualizing AllReduce performance in TensorRT-LLM. The toolkit consists of two main components:
|
|
|
|
1. **`allreduce_heuristic_code_gen.py`** - Generates optimized C++ lookup tables for AllReduce strategy selection
|
|
2. **`allreduce_perf_viz.py`** - Creates comprehensive visualizations of AllReduce performance data
|
|
|
|
## Overview
|
|
|
|
The AllReduce performance analysis workflow involves:
|
|
1. **Benchmarking**: Run performance tests across different configurations
|
|
2. **Analysis**: Generate optimal strategy lookup tables
|
|
3. **Visualization**: Create heatmaps and performance comparison charts
|
|
|
|
## Prerequisites
|
|
|
|
- TensorRT-LLM environment with MPI support
|
|
- Python packages: `pandas`, `numpy`, `matplotlib`, `seaborn`, `scipy`
|
|
- CUDA-capable GPU(s) for benchmarking
|
|
|
|
## Tool 1: allreduce_heuristic_code_gen.py
|
|
|
|
### Purpose
|
|
Generates C++ lookup tables that contain the optimal AllReduce strategy for different parameter combinations (tensor parallel size, fusion operations, hidden sizes, and token counts).
|
|
|
|
### Usage
|
|
|
|
#### Basic Usage (Auto-benchmark and generate)
|
|
```bash
|
|
python allreduce_heuristic_code_gen.py
|
|
```
|
|
|
|
#### Advanced Usage
|
|
```bash
|
|
python allreduce_heuristic_code_gen.py \
|
|
--data_dir /path/to/benchmark/data \
|
|
--sm_version 89 \
|
|
--save_csv_dir /path/to/save/csv \
|
|
--enable_auto
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- `--data_dir`: Directory containing existing benchmark CSV files (optional)
|
|
- `--sm_version`: CUDA SM version (auto-detected if not specified)
|
|
- `--save_csv_dir`: Directory to save benchmark CSV files
|
|
- `--enable_auto`: Enable AUTO strategy in benchmarking
|
|
|
|
### Workflow
|
|
|
|
1. **Benchmark Generation** (if `--data_dir` not provided):
|
|
- Automatically runs `all_reduce.py` microbenchmark using MPI
|
|
- Tests multiple tensor parallel sizes (currently TP=2)
|
|
- Generates CSV files: `benchmark.tp{size}.sm{version}.csv`
|
|
|
|
2. **Strategy Analysis**:
|
|
- Loads benchmark data from CSV files
|
|
- Filters data based on predefined thresholds
|
|
- Finds optimal strategy for each parameter combination
|
|
- Creates 4D lookup table: `[tp_size][fusion][hidden_size][num_tokens]`
|
|
|
|
3. **Code Generation**:
|
|
- Converts lookup table to C++ array format
|
|
- Outputs to `gen_heuristic_code/generated_lookup_table.cpp`
|
|
- Ready for integration into TensorRT-LLM codebase
|
|
|
|
### Output Example
|
|
```cpp
|
|
// AllReduce lookup: [tp][fusion][hidden][tokens] = strategy
|
|
// TP:[2, 4, 8] Fusion:['NONE', 'RESIDUAL_RMS_NORM', ...]
|
|
inline AllReduceBestStrategyTableType AllReduceBestStrategyTableSM89 = {
|
|
{
|
|
// TP=2
|
|
{ // Fusion=NONE
|
|
{0,0,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=128
|
|
{0,4,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=256
|
|
// ... more rows
|
|
},
|
|
// ... more fusion types
|
|
},
|
|
// ... more TP sizes
|
|
};
|
|
```
|
|
|
|
## Tool 2: allreduce_perf_viz.py
|
|
|
|
### Purpose
|
|
Creates comprehensive visualizations of AllReduce performance data including performance heatmaps, strategy comparison charts, and difference analysis.
|
|
|
|
### Usage
|
|
|
|
#### Basic Usage
|
|
```bash
|
|
python allreduce_perf_viz.py --data_dir /path/to/benchmark/data
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- `--data_dir`: Directory containing benchmark CSV files (default: 'data')
|
|
|
|
### Generated Visualizations
|
|
|
|
The tool generates three types of visualizations for each configuration:
|
|
|
|
#### 1. Performance Heatmaps (`*_heatmap.png`)
|
|
- **Purpose**: Show raw performance times for each strategy
|
|
- **Layout**: Side-by-side heatmaps for each AllReduce strategy
|
|
- **Axes**: X=num_tokens, Y=hidden_size
|
|
- **Colors**: Performance time in microseconds (μs)
|
|
- **Features**: Shared colorbar, logarithmic scaling for better visualization
|
|
|
|
#### 2. Best Strategy Maps (`*_best_strategy.png`)
|
|
- **Purpose**: Show optimal strategy for each parameter combination
|
|
- **Layout**: Single heatmap with categorical colors
|
|
- **Axes**: X=num_tokens, Y=hidden_size
|
|
- **Colors**: Different strategies (NCCL, ONESHOT, TWOSHOT, etc.)
|
|
- **Features**: Custom colorbar with strategy labels, distribution statistics
|
|
|
|
#### 3. Strategy Difference Heatmaps (`*_strategy_difference_heatmap.png`)
|
|
- **Purpose**: Show performance difference from optimal strategy
|
|
- **Layout**: Side-by-side heatmaps for each strategy
|
|
- **Axes**: X=num_tokens, Y=hidden_size
|
|
- **Colors**: Percentage difference from best strategy (white=optimal, red=slower)
|
|
- **Features**: Annotated cells with exact difference values
|
|
|
|
### Visualization Functions
|
|
|
|
The script provides three main visualization functions that can be used programmatically:
|
|
|
|
```python
|
|
# 1. Performance heatmaps
|
|
visualize_2d_heatmap(df, fusion_op='NONE', save_path='heatmap.png')
|
|
|
|
# 2. Best strategy visualization
|
|
visualize_2d_best_strategy(df, fusion_op='NONE', save_path='best_strategy.png')
|
|
|
|
# 3. Strategy difference analysis
|
|
visualize_strategy_difference_heatmaps(df, fusion_op='NONE', save_path='diff.png')
|
|
```
|
|
|
|
### Output Structure
|
|
```
|
|
data/
|
|
├── viz/
|
|
│ ├── NONE/
|
|
│ │ ├── benchmark.tp2.sm89_heatmap.png
|
|
│ │ ├── benchmark.tp2.sm89_best_strategy.png
|
|
│ │ └── benchmark.tp2.sm89_strategy_difference_heatmap.png
|
|
│ ├── RESIDUAL_RMS_NORM/
|
|
│ │ └── ... (similar files)
|
|
│ └── ... (other fusion operations)
|
|
└── benchmark.tp2.sm89.csv
|
|
```
|
|
|
|
## Configuration Details
|
|
|
|
### Supported Strategies
|
|
- **NCCL** (0): Standard NCCL AllReduce
|
|
- **ONESHOT** (4): Custom single-phase AllReduce
|
|
- **TWOSHOT** (5): Custom two-phase AllReduce
|
|
|
|
### Supported Fusion Operations
|
|
- `NONE`: No fusion
|
|
- `RESIDUAL_RMS_NORM`: Residual + RMS normalization
|
|
- `RESIDUAL_RMS_NORM_QUANT_FP8`: RESIDUAL_RMS_NORM + FP8 quantization
|
|
- `RESIDUAL_RMS_NORM_QUANT_NVFP4`: RESIDUAL_RMS_NORM + NVFP4 quantization
|
|
|
|
### Parameter Ranges
|
|
- **Tensor Parallel Sizes**: 2, 4, 8
|
|
- **Hidden Sizes**: 128 to 8192 (powers of 2)
|
|
- **Token Counts**: 1 to 16384 (powers of 2)
|
|
|
|
## Performance Tips
|
|
|
|
- Run benchmarks on target hardware for accurate results
|
|
- Use multiple runs and average results for stability
|
|
- Consider different fusion operations based on your use case
|
|
- Monitor GPU memory usage during benchmarking
|
|
|
|
## Integration with TensorRT-LLM
|
|
|
|
The generated lookup tables can be integrated into TensorRT-LLM's AllReduce implementation to automatically select optimal strategies based on runtime parameters. The C++ arrays follow the format expected by the TensorRT-LLM AllReduce subsystem.
|
|
|
|
## Contributing
|
|
|
|
When adding new strategies or fusion operations:
|
|
|
|
1. **Update Configuration**: Modify the `Constants` class in `allreduce_heuristic_code_gen.py`
|
|
2. **Add Strategy Mapping**: Update `strategy_name_to_enum` dictionary with new strategy entries
|
|
3. **Generate New Lookup Tables**: Run `allreduce_heuristic_code_gen.py` to create updated lookup tables for optimal AllReduce strategies
|
|
4. **Integrate into Codebase**: Copy the generated C++ array into the appropriate lookup table in `cpp/tensorrt_llm/common/customAllReduceUtils.h`
|
|
5. **Update Visualizations**: Modify color schemes in `allreduce_perf_viz.py` if needed for new strategies
|
|
6. **Validate**: Test with representative workloads to ensure performance improvements
|