|
|
||
|---|---|---|
| .. | ||
| ep_load_balancer | ||
| slurm_scripts | ||
| README.md | ||
Wide Expert Parallelism (Wide-EP) in TensorRT-LLM
TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities.
Overview
Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems:
- High memory demands for expert weights
- Inherent expert-level workload imbalance due to sparse execution patterns
- Communication overhead in distributed expert parallelism
Wide-EP solves these challenges through:
- Custom EP communication kernels optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
- Expert Parallelism Load Balancer (EPLB) with both offline and online modes
- Dynamic expert placement and replication strategies
- Layer-wise weight redistribution to minimize inference disruption
Quick Start
1. Configurations
An example yaml file to enable wide EP:
moe_config:
backend: WIDEEP
max_num_tokens: 9216
load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
| Parameter | Description | Default | Notes |
|---|---|---|---|
backend |
MoE backend type | CUTLASS |
Set to WIDEEP to enable wide EP |
max_num_tokens |
If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. | None |
If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used. |
load_balancer |
Configuration for MoE load balancing | None |
Set path to the yaml file |
Load Balancer Configuration
An example moe_load_balancer.yaml file to configure online EP balancer:
num_slots: 288
layer_updates_per_iter: 1
| Parameter | Description | Default | Notes |
|---|---|---|---|
num_slots |
Total number of expert slots | None |
Must be ≥ total experts |
layer_updates_per_iter |
Number of layers updated per iteration | 0 |
0 = offline, >0 = online |
Refer to the ep_load_balancer directory for more details on EP load balancer.
2. Execute Wide-EP on SLURM Clusters
Refer to the slurm_scripts directory, which reuses disaggregated slurm scripts to automatically generate configuration files and submit jobs to SLURM clusters.
Trouble shooting
Transparent HugePages failure
When getting exception madvise(MADV_HUGEPAGE) failed., check if Transparent Hugepages has been enabled.
>$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
If never is highlighted, enable Transparent HugePages by the following command.
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Disaggregated serving related issues
Refer to the Troubleshooting and FAQ section of Disaggregated-Service.
References
For detailed implementation examples and advanced usage, see the subdirectories:
ep_load_balancer/: Load balancing tools and examplesslurm_scripts/: Cluster deployment scripts