# Low-Precision-AllReduce ```{note} Note: This feature is optimized for PCIe-based GPU topologies and may affect model accuracy. Please evaluate precision impact for your specific workload. ``` TRT-LLM supports `low-precision-allreduce`, a communication optimization that accelerates AllReduce operations in PCIe-based GPU environments. This feature quantizes FP16/BF16 data to FP8 during network transmission, reducing communication volume and improving performance. ## Algorithm The Low-Precision-AllReduce algorithm works by: 1. Quantizing input FP16/BF16 tensors to FP8 format before network transmission **Quantization details**: We use a "per-warp" quantization approach where each CUDA warp (32 threads) processes a batch of data. In each warp, 31 threads quantize FP16/BF16 values to FP8 e4m3 format (16 bytes per thread), while the last thread transmits a scalar value. This results in each warp collectively quantizing 496 elements plus one scalar at a time. 2. Transmitting the quantized data through the network 3. Dequantizing received data back to the original precision 4. Performing the reduction operation In 8-GPU scenarios, this approach shifts the communication bottleneck from cross-NUMA QPI to the PCIe switch, resulting in better overall performance. ## Topology Requirements ![8x L20/L40s Node Architecture](images/8x_l20_L40S_node_architecture.png) Low-Precision-AllReduce is specifically designed for the topology shown above, where: - Each node contains 2 NUMA domains - Each NUMA domain has 4 GPUs connected via PCIe switch - GPUs within the same NUMA node communicate via the PCIe switch **Important:** This optimization will not accelerate performance in different topologies (e.g., where each GPU is in a separate NUMA domain). ## Usage The Low-Precision-AllReduce algorithm can be enabled in two ways: 1. **Direct specification** in your code: ``` AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.LOWPRECISION); ``` 2. **Environment variable control** with AUTO strategy: ``` // In your code AllReduce allreduce(mapping=mapping, strategy=AllReduceStrategy.AUTO); // Set environment variable before running export FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY=1 ``` ## Performance and Accuracy Considerations Low-Precision-AllReduce reduces communication volume by using FP8 data format for transmission. This optimization: - Improves performance for large message sizes in PCIe-based topologies - May slightly reduce numerical precision - Automatically falls back to other strategies when no performance benefit is expected (e.g., with NVLink or small messages) Users should evaluate the precision impact on their specific models and workloads. ## Environment Variables - `FORCE_LOW_PRECISION_ALL_REDUCE_STRATEGY`: When set to `1`, forces the use of low-precision algorithm with AUTO strategy. If the algorithm determines it cannot provide performance benefits, it will automatically fall back to other strategies. **Note**: When compiling TensorRT-LLM without enabling the `ENABLE_FP8` option, setting Low Precision allreduce will not take effect.