mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-28 22:56:13 +08:00
84 lines
3.7 KiB
Markdown
84 lines
3.7 KiB
Markdown
# Wide Expert Parallelism (Wide-EP) in TensorRT-LLM
|
|
|
|
TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities.
|
|
|
|
## Overview
|
|
|
|
Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems:
|
|
|
|
- **High memory demands** for expert weights
|
|
- **Inherent expert-level workload imbalance** due to sparse execution patterns
|
|
- **Communication overhead** in distributed expert parallelism
|
|
|
|
Wide-EP solves these challenges through:
|
|
|
|
- **Custom EP communication kernels** optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
|
|
- **Expert Parallelism Load Balancer (EPLB)** with both offline and online modes
|
|
- **Dynamic expert placement and replication** strategies
|
|
- **Layer-wise weight redistribution** to minimize inference disruption
|
|
|
|
## Quick Start
|
|
|
|
### 1. Configurations
|
|
|
|
An example yaml file to enable wide EP:
|
|
```yaml
|
|
moe_config:
|
|
backend: WIDEEP
|
|
max_num_tokens: 9216
|
|
load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
|
|
```
|
|
|
|
| Parameter | Description | Default | Notes |
|
|
|-----------|-------------|---------|-------|
|
|
| `backend` | MoE backend type | `CUTLASS` | Set to `WIDEEP` to enable wide EP |
|
|
| `max_num_tokens` | If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. | `None` | If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used. |
|
|
| `load_balancer` | Configuration for MoE load balancing | `None` | Set path to the yaml file |
|
|
|
|
#### Load Balancer Configuration
|
|
|
|
An example `moe_load_balancer.yaml` file to configure online EP balancer:
|
|
```yaml
|
|
num_slots: 288
|
|
layer_updates_per_iter: 1
|
|
```
|
|
|
|
| Parameter | Description | Default | Notes |
|
|
|-----------|-------------|---------|-------|
|
|
| `num_slots` | Total number of expert slots | `None` | Must be ≥ total experts |
|
|
| `layer_updates_per_iter` | Number of layers updated per iteration | `0` | `0` = offline, `>0` = online |
|
|
|
|
Refer to the [ep_load_balancer](./ep_load_balancer/) directory for more details on EP load balancer.
|
|
|
|
### 2. Execute Wide-EP on SLURM Clusters
|
|
|
|
Refer to the [slurm_scripts](./slurm_scripts/) directory, which reuses [disaggregated slurm scripts](../disaggregated/slurm/) to automatically generate configuration files and submit jobs to SLURM clusters.
|
|
|
|
## Trouble shooting
|
|
|
|
### Transparent HugePages failure
|
|
|
|
When getting exception `madvise(MADV_HUGEPAGE) failed.`, check if Transparent Hugepages has been enabled.
|
|
```bash
|
|
>$ cat /sys/kernel/mm/transparent_hugepage/enabled
|
|
always [madvise] never
|
|
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
|
|
always defer defer+madvise [madvise] never
|
|
```
|
|
If `never` is highlighted, enable Transparent HugePages by the following command.
|
|
```bash
|
|
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
|
|
```
|
|
|
|
### Disaggregated serving related issues
|
|
|
|
Refer to the [Troubleshooting and FAQ](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md#troubleshooting-and-faq) section of Disaggregated-Service.
|
|
|
|
## References
|
|
|
|
- [Technical Blog: Scaling Expert Parallelism in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
|
|
|
|
For detailed implementation examples and advanced usage, see the subdirectories:
|
|
- [`ep_load_balancer/`](ep_load_balancer/): Load balancing tools and examples
|
|
- [`slurm_scripts/`](slurm_scripts/): Cluster deployment scripts
|