TensorRT-LLMs/docs/source/features/parallel-strategy.md

183 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Parallelism in TensorRT LLM
Parallelism across multiple GPUs becomes necessary when either
* the model cannot fit in a single GPUs memory, or
* a single GPU cannot deliver the desired performance.
TensorRT LLM supports multiple parallelism strategies for deployment on both single and multiple nodes:
* **Tensor Parallel (TP)** - Shards model weights across GPUs
* **Pipeline Parallel (PP)** - Distributes model layers across GPUs
* **Data Parallel (DP)** - Replicates model across GPUs for different requests
* **Expert Parallel (EP)** - Distributes experts across GPUs for MoE models
* **Context Parallel (CP)** - Distributes context processing across GPUs
* **Wide Expert Parallel (Wide-EP)** - Advanced EP with load balancing for large-scale MoE models
## Overview of Parallelism Strategies
### Tensor Parallelism (TP)
Tensor parallelism splits the model weights across multiple GPUs. Each GPU holds a portion of the weights and processes the same input tokens, with results combined through communication.
**Best for:** Small batch sizes, memory-constrained scenarios
### Pipeline Parallelism (PP)
Pipeline parallelism distributes different layers of the model across multiple GPUs. Each GPU processes a subset of layers, with activations passed between GPUs.
**Best for:** Large models that don't fit in single GPU memory
### Data Parallelism (DP)
Data parallelism replicates the entire model across multiple GPUs. Each GPU processes different requests independently.
**Best for:** Large batch sizes, high throughput scenarios
### Expert Parallelism (EP)
Expert parallelism is specifically designed for Mixture of Experts (MoE) models, where different experts are distributed across GPUs.
**Best for:** MoE models with high expert count
### Context Parallelism (CP)
Context parallelism distributes the processing of long sequences across multiple GPUs.
**Best for:** Long context scenarios
### Wide Expert Parallelism (Wide-EP)
Wide-EP is an advanced form of expert parallelism that addresses the inherent workload imbalance in large-scale MoE models through intelligent load balancing and expert replication.
**Best for:** Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, Qwen3
## Module-level Parallelism Guide
### Attention Module
TensorRT LLM supports two strategies for attention modules:
- **Tensor Parallelism (TP)** — best for small batch sizes
- **Data Parallelism (DP)** — best for large batch sizes
#### Tensor Parallelism (TP)
* The GEMM weights before and after the attention kernel are evenly sharded across GPUs, as are the attention `num_heads`.
* Exceptions:
1. **DeepSeek-R1**: the `fused_A` GEMM is *not* sharded.
2. **GQA / MQA / MLA**: if `num_heads < tensor_parallel_size`, the KV-cache is replicated on every GPU.
#### Data Parallelism (DP)
* All GEMM weights are **replicated** on every GPU.
* The KV-cache is **partitioned**, because different user requests are routed to different DP ranks.
#### How to Enable Attention Parallelism
To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml`:
```bash
cat <<EOF > parallel_config.yaml
# TP-8
tensor_parallel_size: 8
enable_attention_dp: false # default
# DP-8
tensor_parallel_size: 8
enable_attention_dp: true
EOF
```
then set `--config parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.
### FFN Module
#### Dense Models
Tensor Parallelism is supported for the FFN layers of dense models.
#### Mixture of Experts (MoE)
MoE replaces a single FFN with multiple experts. A router selects the top-k experts for each token and dispatches the corresponding hidden states.
TensorRT LLM supports three execution patterns for MoE:
* **TP** - Every expert's weight matrix is sliced across all GPUs. Each GPU sees all tokens.
* **EP** - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts.
* **Hybrid ETP** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.
#### How to Enable MoE Parallelism
To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml` as follows:
```bash
cat <<EOF > parallel_config.yaml
# TP only
tensor_parallel_size: 8
moe_tensor_parallel_size: 8
# EP only
tensor_parallel_size: 8
moe_expert_parallel_size: 8
# Hybrid (TP-4 × EP-2)
tensor_parallel_size: 8 # 4 × 2
moe_tensor_parallel_size: 4
moe_expert_parallel_size: 2
EOF
```
```{note}
The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`.
```
## Wide Expert Parallelism (Wide-EP)
Wide Expert Parallelism (Wide-EP) is TensorRT LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies.
### Motivation for Wide-EP
Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges:
- **High memory demands** for expert weights
- **Inherent expert-level workload imbalance** due to sparse execution patterns
- **Communication overhead** in distributed expert parallelism
- **Hot expert problem** where certain experts receive significantly more tokens than others
### Key Features of Wide-EP
#### 1. Expert Replication and Load Balancing
Wide-EP introduces the concept of **expert slots** that are decoupled from specific experts. This allows:
- Multiple replicas of hot experts across different GPUs
- Dynamic expert placement based on workload patterns
- Both offline and online load balancing strategies
#### 2. Custom EP Communication Kernels
- Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
- Efficient all-to-all communication for expert dispatch and combine
- Reduced communication overhead compared to traditional EP
#### 3. Expert Parallelism Load Balancer (EPLB)
- **Offline EPLB**: Pre-computed expert placement based on historical workload statistics
- **Online EPLB**: Dynamic expert placement that adapts to real-time traffic patterns
- Layer-wise weight redistribution to minimize inference disruption
### Architecture Overview
Wide-EP separates the concepts of **experts** and **slots**:
- **Expert**: The concept from the model's perspective (e.g., Expert 0, Expert 1, etc.)
- **Slot**: The concept from the model engine's perspective (e.g., Slot 0, Slot 1, etc.)
The system maintains a routing table that maps Expert IDs to Slot IDs, which can be updated by the load balancing policy.
### Best Practices
1. **Start with offline EPLB** for production deployments with known workload patterns
2. **Use online EPLB** for dynamic workloads or when traffic patterns change frequently
3. **Monitor expert statistics** to understand workload distribution
4. **Tune max_num_tokens** based on your memory constraints and EP size
5. **Test with representative datasets** to validate load balancing effectiveness
### References
- [Technical Blog: Scaling Expert Parallelism in TensorRT LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
- [DeepSeek-V3 Paper](https://arxiv.org/abs/2412.19437)
- [EPLB Implementation](https://github.com/deepseek-ai/EPLB)
For detailed implementation examples and advanced usage, see:
- [`examples/wide_ep/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/): Complete Wide-EP examples
- [`examples/wide_ep/ep_load_balancer/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer/): Load balancing tools
- [`examples/wide_ep/slurm_scripts/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts/): Cluster deployment scripts