mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
183 lines
7.8 KiB
Markdown
183 lines
7.8 KiB
Markdown
# Parallelism in TensorRT LLM
|
||
|
||
Parallelism across multiple GPUs becomes necessary when either
|
||
* the model cannot fit in a single GPU’s memory, or
|
||
* a single GPU cannot deliver the desired performance.
|
||
|
||
TensorRT LLM supports multiple parallelism strategies for deployment on both single and multiple nodes:
|
||
* **Tensor Parallel (TP)** - Shards model weights across GPUs
|
||
* **Pipeline Parallel (PP)** - Distributes model layers across GPUs
|
||
* **Data Parallel (DP)** - Replicates model across GPUs for different requests
|
||
* **Expert Parallel (EP)** - Distributes experts across GPUs for MoE models
|
||
* **Context Parallel (CP)** - Distributes context processing across GPUs
|
||
* **Wide Expert Parallel (Wide-EP)** - Advanced EP with load balancing for large-scale MoE models
|
||
|
||
## Overview of Parallelism Strategies
|
||
|
||
### Tensor Parallelism (TP)
|
||
Tensor parallelism splits the model weights across multiple GPUs. Each GPU holds a portion of the weights and processes the same input tokens, with results combined through communication.
|
||
|
||
**Best for:** Small batch sizes, memory-constrained scenarios
|
||
|
||
### Pipeline Parallelism (PP)
|
||
Pipeline parallelism distributes different layers of the model across multiple GPUs. Each GPU processes a subset of layers, with activations passed between GPUs.
|
||
|
||
**Best for:** Large models that don't fit in single GPU memory
|
||
|
||
### Data Parallelism (DP)
|
||
Data parallelism replicates the entire model across multiple GPUs. Each GPU processes different requests independently.
|
||
|
||
**Best for:** Large batch sizes, high throughput scenarios
|
||
|
||
### Expert Parallelism (EP)
|
||
Expert parallelism is specifically designed for Mixture of Experts (MoE) models, where different experts are distributed across GPUs.
|
||
|
||
**Best for:** MoE models with high expert count
|
||
|
||
### Context Parallelism (CP)
|
||
Context parallelism distributes the processing of long sequences across multiple GPUs.
|
||
|
||
**Best for:** Long context scenarios
|
||
|
||
### Wide Expert Parallelism (Wide-EP)
|
||
Wide-EP is an advanced form of expert parallelism that addresses the inherent workload imbalance in large-scale MoE models through intelligent load balancing and expert replication.
|
||
|
||
**Best for:** Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, Qwen3
|
||
|
||
## Module-level Parallelism Guide
|
||
|
||
### Attention Module
|
||
|
||
TensorRT LLM supports two strategies for attention modules:
|
||
|
||
- **Tensor Parallelism (TP)** — best for small batch sizes
|
||
- **Data Parallelism (DP)** — best for large batch sizes
|
||
|
||
#### Tensor Parallelism (TP)
|
||
|
||
* The GEMM weights before and after the attention kernel are evenly sharded across GPUs, as are the attention `num_heads`.
|
||
* Exceptions:
|
||
1. **DeepSeek-R1**: the `fused_A` GEMM is *not* sharded.
|
||
2. **GQA / MQA / MLA**: if `num_heads < tensor_parallel_size`, the KV-cache is replicated on every GPU.
|
||
|
||
#### Data Parallelism (DP)
|
||
|
||
* All GEMM weights are **replicated** on every GPU.
|
||
* The KV-cache is **partitioned**, because different user requests are routed to different DP ranks.
|
||
|
||
#### How to Enable Attention Parallelism
|
||
|
||
To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml`:
|
||
|
||
```bash
|
||
cat <<EOF > parallel_config.yaml
|
||
# TP-8
|
||
tensor_parallel_size: 8
|
||
enable_attention_dp: false # default
|
||
# DP-8
|
||
tensor_parallel_size: 8
|
||
enable_attention_dp: true
|
||
EOF
|
||
```
|
||
|
||
then set `--config parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.
|
||
|
||
### FFN Module
|
||
|
||
#### Dense Models
|
||
|
||
Tensor Parallelism is supported for the FFN layers of dense models.
|
||
|
||
#### Mixture of Experts (MoE)
|
||
|
||
MoE replaces a single FFN with multiple experts. A router selects the top-k experts for each token and dispatches the corresponding hidden states.
|
||
|
||
TensorRT LLM supports three execution patterns for MoE:
|
||
|
||
* **TP** - Every expert's weight matrix is sliced across all GPUs. Each GPU sees all tokens.
|
||
* **EP** - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts.
|
||
* **Hybrid ETP** - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.
|
||
|
||
#### How to Enable MoE Parallelism
|
||
|
||
To deploy a model with the above parallel strategies using `trtllm-serve` or run benchmarking with `trtllm-bench`, create a YAML configuration file named `parallel_config.yaml` as follows:
|
||
|
||
```bash
|
||
cat <<EOF > parallel_config.yaml
|
||
# TP only
|
||
tensor_parallel_size: 8
|
||
moe_tensor_parallel_size: 8
|
||
|
||
# EP only
|
||
tensor_parallel_size: 8
|
||
moe_expert_parallel_size: 8
|
||
|
||
# Hybrid (TP-4 × EP-2)
|
||
tensor_parallel_size: 8 # 4 × 2
|
||
moe_tensor_parallel_size: 4
|
||
moe_expert_parallel_size: 2
|
||
EOF
|
||
```
|
||
```{note}
|
||
The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`.
|
||
```
|
||
|
||
## Wide Expert Parallelism (Wide-EP)
|
||
|
||
Wide Expert Parallelism (Wide-EP) is TensorRT LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies.
|
||
|
||
### Motivation for Wide-EP
|
||
|
||
Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges:
|
||
|
||
- **High memory demands** for expert weights
|
||
- **Inherent expert-level workload imbalance** due to sparse execution patterns
|
||
- **Communication overhead** in distributed expert parallelism
|
||
- **Hot expert problem** where certain experts receive significantly more tokens than others
|
||
|
||
### Key Features of Wide-EP
|
||
|
||
#### 1. Expert Replication and Load Balancing
|
||
Wide-EP introduces the concept of **expert slots** that are decoupled from specific experts. This allows:
|
||
- Multiple replicas of hot experts across different GPUs
|
||
- Dynamic expert placement based on workload patterns
|
||
- Both offline and online load balancing strategies
|
||
|
||
#### 2. Custom EP Communication Kernels
|
||
- Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
|
||
- Efficient all-to-all communication for expert dispatch and combine
|
||
- Reduced communication overhead compared to traditional EP
|
||
|
||
#### 3. Expert Parallelism Load Balancer (EPLB)
|
||
- **Offline EPLB**: Pre-computed expert placement based on historical workload statistics
|
||
- **Online EPLB**: Dynamic expert placement that adapts to real-time traffic patterns
|
||
- Layer-wise weight redistribution to minimize inference disruption
|
||
|
||
### Architecture Overview
|
||
|
||
Wide-EP separates the concepts of **experts** and **slots**:
|
||
- **Expert**: The concept from the model's perspective (e.g., Expert 0, Expert 1, etc.)
|
||
- **Slot**: The concept from the model engine's perspective (e.g., Slot 0, Slot 1, etc.)
|
||
|
||
The system maintains a routing table that maps Expert IDs to Slot IDs, which can be updated by the load balancing policy.
|
||
|
||
|
||
### Best Practices
|
||
|
||
1. **Start with offline EPLB** for production deployments with known workload patterns
|
||
2. **Use online EPLB** for dynamic workloads or when traffic patterns change frequently
|
||
3. **Monitor expert statistics** to understand workload distribution
|
||
4. **Tune max_num_tokens** based on your memory constraints and EP size
|
||
5. **Test with representative datasets** to validate load balancing effectiveness
|
||
|
||
### References
|
||
|
||
- [Technical Blog: Scaling Expert Parallelism in TensorRT LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
|
||
- [DeepSeek-V3 Paper](https://arxiv.org/abs/2412.19437)
|
||
- [EPLB Implementation](https://github.com/deepseek-ai/EPLB)
|
||
|
||
For detailed implementation examples and advanced usage, see:
|
||
- [`examples/wide_ep/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/): Complete Wide-EP examples
|
||
- [`examples/wide_ep/ep_load_balancer/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer/): Load balancing tools
|
||
- [`examples/wide_ep/slurm_scripts/`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts/): Cluster deployment scripts
|