# AllReduce Performance Offline Autotuning and Visualization Tools This directory contains tools for benchmarking, analyzing, and visualizing AllReduce performance in TensorRT-LLM. The toolkit consists of two main components: 1. **`allreduce_heuristic_code_gen.py`** - Generates optimized C++ lookup tables for AllReduce strategy selection 2. **`allreduce_perf_viz.py`** - Creates comprehensive visualizations of AllReduce performance data ## Overview The AllReduce performance analysis workflow involves: 1. **Benchmarking**: Run performance tests across different configurations 2. **Analysis**: Generate optimal strategy lookup tables 3. **Visualization**: Create heatmaps and performance comparison charts ## Prerequisites - TensorRT-LLM environment with MPI support - Python packages: `pandas`, `numpy`, `matplotlib`, `seaborn`, `scipy` - CUDA-capable GPU(s) for benchmarking ## Tool 1: allreduce_heuristic_code_gen.py ### Purpose Generates C++ lookup tables that contain the optimal AllReduce strategy for different parameter combinations (tensor parallel size, fusion operations, hidden sizes, and token counts). ### Usage #### Basic Usage (Auto-benchmark and generate) ```bash python allreduce_heuristic_code_gen.py ``` #### Advanced Usage ```bash python allreduce_heuristic_code_gen.py \ --data_dir /path/to/benchmark/data \ --sm_version 89 \ --save_csv_dir /path/to/save/csv \ --enable_auto ``` ### Parameters - `--data_dir`: Directory containing existing benchmark CSV files (optional) - `--sm_version`: CUDA SM version (auto-detected if not specified) - `--save_csv_dir`: Directory to save benchmark CSV files - `--enable_auto`: Enable AUTO strategy in benchmarking ### Workflow 1. **Benchmark Generation** (if `--data_dir` not provided): - Automatically runs `all_reduce.py` microbenchmark using MPI - Tests multiple tensor parallel sizes (currently TP=2) - Generates CSV files: `benchmark.tp{size}.sm{version}.csv` 2. **Strategy Analysis**: - Loads benchmark data from CSV files - Filters data based on predefined thresholds - Finds optimal strategy for each parameter combination - Creates 4D lookup table: `[tp_size][fusion][hidden_size][num_tokens]` 3. **Code Generation**: - Converts lookup table to C++ array format - Outputs to `gen_heuristic_code/generated_lookup_table.cpp` - Ready for integration into TensorRT-LLM codebase ### Output Example ```cpp // AllReduce lookup: [tp][fusion][hidden][tokens] = strategy // TP:[2, 4, 8] Fusion:['NONE', 'RESIDUAL_RMS_NORM', ...] inline AllReduceBestStrategyTableType AllReduceBestStrategyTableSM89 = { { // TP=2 { // Fusion=NONE {0,0,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=128 {0,4,4,4,5,5,5,5,5,5,5,5,5,5,5}, // hidden_size=256 // ... more rows }, // ... more fusion types }, // ... more TP sizes }; ``` ## Tool 2: allreduce_perf_viz.py ### Purpose Creates comprehensive visualizations of AllReduce performance data including performance heatmaps, strategy comparison charts, and difference analysis. ### Usage #### Basic Usage ```bash python allreduce_perf_viz.py --data_dir /path/to/benchmark/data ``` ### Parameters - `--data_dir`: Directory containing benchmark CSV files (default: 'data') ### Generated Visualizations The tool generates three types of visualizations for each configuration: #### 1. Performance Heatmaps (`*_heatmap.png`) - **Purpose**: Show raw performance times for each strategy - **Layout**: Side-by-side heatmaps for each AllReduce strategy - **Axes**: X=num_tokens, Y=hidden_size - **Colors**: Performance time in microseconds (μs) - **Features**: Shared colorbar, logarithmic scaling for better visualization #### 2. Best Strategy Maps (`*_best_strategy.png`) - **Purpose**: Show optimal strategy for each parameter combination - **Layout**: Single heatmap with categorical colors - **Axes**: X=num_tokens, Y=hidden_size - **Colors**: Different strategies (NCCL, ONESHOT, TWOSHOT, etc.) - **Features**: Custom colorbar with strategy labels, distribution statistics #### 3. Strategy Difference Heatmaps (`*_strategy_difference_heatmap.png`) - **Purpose**: Show performance difference from optimal strategy - **Layout**: Side-by-side heatmaps for each strategy - **Axes**: X=num_tokens, Y=hidden_size - **Colors**: Percentage difference from best strategy (white=optimal, red=slower) - **Features**: Annotated cells with exact difference values ### Visualization Functions The script provides three main visualization functions that can be used programmatically: ```python # 1. Performance heatmaps visualize_2d_heatmap(df, fusion_op='NONE', save_path='heatmap.png') # 2. Best strategy visualization visualize_2d_best_strategy(df, fusion_op='NONE', save_path='best_strategy.png') # 3. Strategy difference analysis visualize_strategy_difference_heatmaps(df, fusion_op='NONE', save_path='diff.png') ``` ### Output Structure ``` data/ ├── viz/ │ ├── NONE/ │ │ ├── benchmark.tp2.sm89_heatmap.png │ │ ├── benchmark.tp2.sm89_best_strategy.png │ │ └── benchmark.tp2.sm89_strategy_difference_heatmap.png │ ├── RESIDUAL_RMS_NORM/ │ │ └── ... (similar files) │ └── ... (other fusion operations) └── benchmark.tp2.sm89.csv ``` ## Configuration Details ### Supported Strategies - **NCCL** (0): Standard NCCL AllReduce - **ONESHOT** (4): Custom single-phase AllReduce - **TWOSHOT** (5): Custom two-phase AllReduce ### Supported Fusion Operations - `NONE`: No fusion - `RESIDUAL_RMS_NORM`: Residual + RMS normalization - `RESIDUAL_RMS_NORM_QUANT_FP8`: RESIDUAL_RMS_NORM + FP8 quantization - `RESIDUAL_RMS_NORM_QUANT_NVFP4`: RESIDUAL_RMS_NORM + NVFP4 quantization ### Parameter Ranges - **Tensor Parallel Sizes**: 2, 4, 8 - **Hidden Sizes**: 128 to 8192 (powers of 2) - **Token Counts**: 1 to 16384 (powers of 2) ## Performance Tips - Run benchmarks on target hardware for accurate results - Use multiple runs and average results for stability - Consider different fusion operations based on your use case - Monitor GPU memory usage during benchmarking ## Integration with TensorRT-LLM The generated lookup tables can be integrated into TensorRT-LLM's AllReduce implementation to automatically select optimal strategies based on runtime parameters. The C++ arrays follow the format expected by the TensorRT-LLM AllReduce subsystem. ## Contributing When adding new strategies or fusion operations: 1. **Update Configuration**: Modify the `Constants` class in `allreduce_heuristic_code_gen.py` 2. **Add Strategy Mapping**: Update `strategy_name_to_enum` dictionary with new strategy entries 3. **Generate New Lookup Tables**: Run `allreduce_heuristic_code_gen.py` to create updated lookup tables for optimal AllReduce strategies 4. **Integrate into Codebase**: Copy the generated C++ array into the appropriate lookup table in `cpp/tensorrt_llm/common/customAllReduceUtils.h` 5. **Update Visualizations**: Modify color schemes in `allreduce_perf_viz.py` if needed for new strategies 6. **Validate**: Test with representative workloads to ensure performance improvements