TensorRT-LLMs/examples/wide_ep
Shi Xiaowei fe7dda834d
[TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-08-13 17:39:27 +08:00
..
ep_load_balancer doc: remove backend parameter for trtllm-bench when backend is set to… (#6428) 2025-07-29 11:01:21 -04:00
slurm_scripts [TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766) 2025-08-13 17:39:27 +08:00
README.md doc: Add README for wide EP (#6356) 2025-07-29 00:36:12 -04:00

Wide Expert Parallelism (Wide-EP) in TensorRT-LLM

TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities.

Overview

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems:

  • High memory demands for expert weights
  • Inherent expert-level workload imbalance due to sparse execution patterns
  • Communication overhead in distributed expert parallelism

Wide-EP solves these challenges through:

  • Custom EP communication kernels optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
  • Expert Parallelism Load Balancer (EPLB) with both offline and online modes
  • Dynamic expert placement and replication strategies
  • Layer-wise weight redistribution to minimize inference disruption

Quick Start

1. Configurations

An example yaml file to enable wide EP:

moe_config:
    backend: WIDEEP
    max_num_tokens: 9216
    load_balancer: moe_load_balancer.yaml # (optional) enable load balancer
Parameter Description Default Notes
backend MoE backend type CUTLASS Set to WIDEEP to enable wide EP
max_num_tokens If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. None If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used.
load_balancer Configuration for MoE load balancing None Set path to the yaml file

Load Balancer Configuration

An example moe_load_balancer.yaml file to configure online EP balancer:

num_slots: 288
layer_updates_per_iter: 1
Parameter Description Default Notes
num_slots Total number of expert slots None Must be ≥ total experts
layer_updates_per_iter Number of layers updated per iteration 0 0 = offline, >0 = online

Refer to the ep_load_balancer directory for more details on EP load balancer.

2. Execute Wide-EP on SLURM Clusters

Refer to the slurm_scripts directory, which reuses disaggregated slurm scripts to automatically generate configuration files and submit jobs to SLURM clusters.

Trouble shooting

Transparent HugePages failure

When getting exception madvise(MADV_HUGEPAGE) failed., check if Transparent Hugepages has been enabled.

>$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never

If never is highlighted, enable Transparent HugePages by the following command.

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Refer to the Troubleshooting and FAQ section of Disaggregated-Service.

References

For detailed implementation examples and advanced usage, see the subdirectories: