mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Shi Xiaowei fe7dda834d [TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>		2025-08-13 17:39:27 +08:00
..
ep_load_balancer	doc: remove backend parameter for trtllm-bench when backend is set to… (#6428 )	2025-07-29 11:01:21 -04:00
slurm_scripts	[TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766 )	2025-08-13 17:39:27 +08:00
README.md	doc: Add README for wide EP (#6356 )	2025-07-29 00:36:12 -04:00

README.md

Wide Expert Parallelism (Wide-EP) in TensorRT-LLM

TensorRT-LLM's Wide Expert Parallelism (Wide-EP) feature enables efficient inference of large-scale Mixture-of-Experts (MoE) models by scaling expert parallelism beyond traditional limits. This feature addresses the inherent workload imbalance challenges in large-scale MoE models and provides both offline and online load balancing capabilities.

Overview

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges for inference systems:

High memory demands for expert weights
Inherent expert-level workload imbalance due to sparse execution patterns
Communication overhead in distributed expert parallelism

Wide-EP solves these challenges through:

Custom EP communication kernels optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
Expert Parallelism Load Balancer (EPLB) with both offline and online modes
Dynamic expert placement and replication strategies
Layer-wise weight redistribution to minimize inference disruption

Quick Start

1. Configurations

An example yaml file to enable wide EP:

moe_config:
    backend: WIDEEP
    max_num_tokens: 9216
    load_balancer: moe_load_balancer.yaml # (optional) enable load balancer

Parameter	Description	Default	Notes
`backend`	MoE backend type	`CUTLASS`	Set to `WIDEEP` to enable wide EP
`max_num_tokens`	If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time.	`None`	If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used.
`load_balancer`	Configuration for MoE load balancing	`None`	Set path to the yaml file

Load Balancer Configuration

An example moe_load_balancer.yaml file to configure online EP balancer:

num_slots: 288
layer_updates_per_iter: 1

Parameter	Description	Default	Notes
`num_slots`	Total number of expert slots	`None`	Must be ≥ total experts
`layer_updates_per_iter`	Number of layers updated per iteration	`0`	`0` = offline, `>0` = online

Refer to the ep_load_balancer directory for more details on EP load balancer.

2. Execute Wide-EP on SLURM Clusters

Refer to the slurm_scripts directory, which reuses disaggregated slurm scripts to automatically generate configuration files and submit jobs to SLURM clusters.

Trouble shooting

Transparent HugePages failure

When getting exception madvise(MADV_HUGEPAGE) failed., check if Transparent Hugepages has been enabled.

>$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
>$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never

If never is highlighted, enable Transparent HugePages by the following command.

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Refer to the Troubleshooting and FAQ section of Disaggregated-Service.

References

Technical Blog: Scaling Expert Parallelism in TensorRT-LLM

For detailed implementation examples and advanced usage, see the subdirectories:

ep_load_balancer/: Load balancing tools and examples
slurm_scripts/: Cluster deployment scripts