mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-28 06:33:15 +08:00
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>
66 lines
3.0 KiB
Markdown
66 lines
3.0 KiB
Markdown
# Low-Precision-AllReduce
|
|
|
|
```{note}
|
|
Note:
|
|
This feature is optimized for PCIe-based GPU topologies and may affect model accuracy. Please evaluate precision impact for your specific workload.
|
|
```
|
|
|
|
|
|
TRT-LLM supports `low-precision-allreduce`, a communication optimization that accelerates AllReduce operations in PCIe-based GPU environments. This feature quantizes FP16/BF16 data to FP8 during network transmission, reducing communication volume and improving performance.
|
|
|
|
## Algorithm
|
|
|
|
The Low-Precision-AllReduce algorithm works by:
|
|
1. Quantizing input FP16/BF16 tensors to FP8 format before network transmission
|
|
|
|
|
|
**Quantization details**: We use a "per-warp" quantization approach where each CUDA warp (32 threads) processes a batch of data. In each warp, 31 threads quantize FP16/BF16 values to FP8 e4m3 format (16 bytes per thread), while the last thread transmits a scalar value. This results in each warp collectively quantizing 496 elements plus one scalar at a time.
|
|
|
|
2. Transmitting the quantized data through the network
|
|
3. Dequantizing received data back to the original precision
|
|
4. Performing the reduction operation
|
|
|
|
In 8-GPU scenarios, this approach shifts the communication bottleneck from cross-NUMA QPI to the PCIe switch, resulting in better overall performance.
|
|
|
|
## Topology Requirements
|
|
|
|

|
|
|
|
Low-Precision-AllReduce is specifically designed for the topology shown above, where:
|
|
- Each node contains 2 NUMA domains
|
|
- Each NUMA domain has 4 GPUs connected via PCIe switch
|
|
- GPUs within the same NUMA node communicate via the PCIe switch
|
|
|
|
**Important:** This optimization will not accelerate performance in different topologies (e.g., where each GPU is in a separate NUMA domain).
|
|
|
|
## Usage
|
|
|
|
The Low-Precision-AllReduce algorithm can be enabled in two ways:
|
|
|
|
1. **Direct specification** in your code:
|
|
```
|
|
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.LOWPRECISION);
|
|
```
|
|
2. **Environment variable control** with AUTO strategy:
|
|
```
|
|
// In your code
|
|
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.AUTO);
|
|
// Set environment variable before running
|
|
export FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY=1
|
|
```
|
|
|
|
## Performance and Accuracy Considerations
|
|
|
|
Low-Precision-AllReduce reduces communication volume by using FP8 data format for transmission. This optimization:
|
|
- Improves performance for large message sizes in PCIe-based topologies
|
|
- May slightly reduce numerical precision
|
|
- Automatically falls back to other strategies when no performance benefit is expected (e.g., with NVLink or small messages)
|
|
|
|
Users should evaluate the precision impact on their specific models and workloads.
|
|
|
|
## Environment Variables
|
|
|
|
- `FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY`: When set to `1`, forces the use of low-precision algorithm with AUTO strategy. If the algorithm determines it cannot provide performance benefits, it will automatically fall back to other strategies.
|
|
|
|
**Note**: When compiling TensorRT-LLM without enabling the `ENABLE_FP8` option, setting Low Precision allreduce will not take effect.
|