TensorRT-LLMs/docs/source/advanced/lowprecision-pcie-allreduce.md
kanghui0204 6f3922f318
feat: Low Precision Allreduce for PCIe based GPU (#4344)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00

3.0 KiB

Low-Precision-AllReduce

Note:
This feature is optimized for PCIe-based GPU topologies and may affect model accuracy. Please evaluate precision impact for your specific workload.

TRT-LLM supports low-precision-allreduce, a communication optimization that accelerates AllReduce operations in PCIe-based GPU environments. This feature quantizes FP16/BF16 data to FP8 during network transmission, reducing communication volume and improving performance.

Algorithm

The Low-Precision-AllReduce algorithm works by:

  1. Quantizing input FP16/BF16 tensors to FP8 format before network transmission

    Quantization details: We use a "per-warp" quantization approach where each CUDA warp (32 threads) processes a batch of data. In each warp, 31 threads quantize FP16/BF16 values to FP8 e4m3 format (16 bytes per thread), while the last thread transmits a scalar value. This results in each warp collectively quantizing 496 elements plus one scalar at a time.

  2. Transmitting the quantized data through the network

  3. Dequantizing received data back to the original precision

  4. Performing the reduction operation

In 8-GPU scenarios, this approach shifts the communication bottleneck from cross-NUMA QPI to the PCIe switch, resulting in better overall performance.

Topology Requirements

8x L20/L40s Node Architecture

Low-Precision-AllReduce is specifically designed for the topology shown above, where:

  • Each node contains 2 NUMA domains
  • Each NUMA domain has 4 GPUs connected via PCIe switch
  • GPUs within the same NUMA node communicate via the PCIe switch

Important: This optimization will not accelerate performance in different topologies (e.g., where each GPU is in a separate NUMA domain).

Usage

The Low-Precision-AllReduce algorithm can be enabled in two ways:

  1. Direct specification in your code:
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.LOWPRECISION);
  1. Environment variable control with AUTO strategy:
// In your code
AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.AUTO);
// Set environment variable before running
export FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY=1

Performance and Accuracy Considerations

Low-Precision-AllReduce reduces communication volume by using FP8 data format for transmission. This optimization:

  • Improves performance for large message sizes in PCIe-based topologies
  • May slightly reduce numerical precision
  • Automatically falls back to other strategies when no performance benefit is expected (e.g., with NVLink or small messages)

Users should evaluate the precision impact on their specific models and workloads.

Environment Variables

  • FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY: When set to 1, forces the use of low-precision algorithm with AUTO strategy. If the algorithm determines it cannot provide performance benefits, it will automatically fall back to other strategies.

Note: When compiling TensorRT-LLM without enabling the ENABLE_FP8 option, setting Low Precision allreduce will not take effect.