TensorRT-LLMs/docs/source/advanced/lowprecision-pcie-allreduce.md
kanghui0204 6f3922f318
feat: Low Precision Allreduce for PCIe based GPU (#4344)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00

66 lines
3.0 KiB
Markdown

# Low-Precision-AllReduce
```{note}
Note:
This feature is optimized for PCIe-based GPU topologies and may affect model accuracy. Please evaluate precision impact for your specific workload.
```
TRT-LLM supports `low-precision-allreduce`, a communication optimization that accelerates AllReduce operations in PCIe-based GPU environments. This feature quantizes FP16/BF16 data to FP8 during network transmission, reducing communication volume and improving performance.
## Algorithm
The Low-Precision-AllReduce algorithm works by:
1. Quantizing input FP16/BF16 tensors to FP8 format before network transmission
**Quantization details**: We use a "per-warp" quantization approach where each CUDA warp (32 threads) processes a batch of data. In each warp, 31 threads quantize FP16/BF16 values to FP8 e4m3 format (16 bytes per thread), while the last thread transmits a scalar value. This results in each warp collectively quantizing 496 elements plus one scalar at a time.
2. Transmitting the quantized data through the network
3. Dequantizing received data back to the original precision
4. Performing the reduction operation
In 8-GPU scenarios, this approach shifts the communication bottleneck from cross-NUMA QPI to the PCIe switch, resulting in better overall performance.
## Topology Requirements
![8x L20/L40s Node Architecture](images/8x_l20_L40S_node_architecture.png)
Low-Precision-AllReduce is specifically designed for the topology shown above, where:
- Each node contains 2 NUMA domains
- Each NUMA domain has 4 GPUs connected via PCIe switch
- GPUs within the same NUMA node communicate via the PCIe switch
**Important:** This optimization will not accelerate performance in different topologies (e.g., where each GPU is in a separate NUMA domain).
## Usage
The Low-Precision-AllReduce algorithm can be enabled in two ways:
1. **Direct specification** in your code:
```
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.LOWPRECISION);
```
2. **Environment variable control** with AUTO strategy:
```
// In your code
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.AUTO);
// Set environment variable before running
export FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY=1
```
## Performance and Accuracy Considerations
Low-Precision-AllReduce reduces communication volume by using FP8 data format for transmission. This optimization:
- Improves performance for large message sizes in PCIe-based topologies
- May slightly reduce numerical precision
- Automatically falls back to other strategies when no performance benefit is expected (e.g., with NVLink or small messages)
Users should evaluate the precision impact on their specific models and workloads.
## Environment Variables
- `FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY`: When set to `1`, forces the use of low-precision algorithm with AUTO strategy. If the algorithm determines it cannot provide performance benefits, it will automatically fall back to other strategies.
**Note**: When compiling TensorRT-LLM without enabling the `ENABLE_FP8` option, setting Low Precision allreduce will not take effect.