TensorRT-LLMs/examples/wide_ep
Guoming Zhang af3ea37176
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-19 22:05:42 +08:00
..
ep_load_balancer doc: remove backend parameter for trtllm-bench when backend is set to… (#6428) 2025-07-29 11:01:21 -04:00
slurm_scripts [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850) 2025-09-19 22:05:42 +08:00
README.md [TRTLLM-5930][doc] 1.0 Documentation. (#6696) 2025-09-04 05:29:43 -04:00

Wide Expert Parallelism (Wide-EP) in TensorRT-LLM

TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities.

Overview

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems:

  • High memory demands for expert weights
  • Inherent expert-level workload imbalance due to sparse execution patterns
  • Communication overhead in distributed expert parallelism

Wide-EP solves these challenges through:

  • Custom EP communication kernels optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
  • Expert Parallelism Load Balancer (EPLB) with both offline and online modes
  • Dynamic expert placement and replication strategies
  • Layer-wise weight redistribution to minimize inference disruption

Quick Start

1. Configurations

An example yaml file to enable wide EP:

moe_config:
    backend: WIDEEP
    max_num_tokens: 9216
    load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
Parameter Description Default Notes
backend MoE backend type CUTLASS Set to WIDEEP to enable wide EP
max_num_tokens If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. None If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used.
load_balancer Configuration for MoE load balancing None Set path to the yaml file

Load Balancer Configuration

An example moe_load_balancer.yaml file to configure online EP balancer:

num_slots: 288
layer_updates_per_iter: 1
Parameter Description Default Notes
num_slots Total number of expert slots None Must be ≥ total experts
layer_updates_per_iter Number of layers updated per iteration 0 0 = offline, >0 = online

Refer to the ep_load_balancer directory for more details on EP load balancer.

2. Execute Wide-EP on SLURM Clusters

Refer to the slurm_scripts directory, which reuses disaggregated slurm scripts to automatically generate configuration files and submit jobs to SLURM clusters.

Trouble shooting

Transparent HugePages failure

When getting exception madvise(MADV_HUGEPAGE) failed., check if Transparent Hugepages has been enabled.

>$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never

If never is highlighted, enable Transparent HugePages by the following command.

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Refer to the Troubleshooting and FAQ section of Disaggregated-Service.

References

For detailed implementation examples and advanced usage, see the subdirectories: