kanshan/TensorRT-LLMs

Fork 0

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

Venky dfa11d810e

[TRTC-102][docs] --extra_llm_api_options->--config in docs/examples/tests (#10005 )

2025-12-19 13:48:43 -05:00

7.8 KiB

Raw Blame History

Parallelism in TensorRT LLM

Parallelism across multiple GPUs becomes necessary when either

the model cannot fit in a single GPU’s memory, or
a single GPU cannot deliver the desired performance.

TensorRT LLM supports multiple parallelism strategies for deployment on both single and multiple nodes:

Tensor Parallel (TP) - Shards model weights across GPUs
Pipeline Parallel (PP) - Distributes model layers across GPUs
Data Parallel (DP) - Replicates model across GPUs for different requests
Expert Parallel (EP) - Distributes experts across GPUs for MoE models
Context Parallel (CP) - Distributes context processing across GPUs
Wide Expert Parallel (Wide-EP) - Advanced EP with load balancing for large-scale MoE models

Overview of Parallelism Strategies

Tensor Parallelism (TP)

Tensor parallelism splits the model weights across multiple GPUs. Each GPU holds a portion of the weights and processes the same input tokens, with results combined through communication.

Best for: Small batch sizes, memory-constrained scenarios

Pipeline Parallelism (PP)

Pipeline parallelism distributes different layers of the model across multiple GPUs. Each GPU processes a subset of layers, with activations passed between GPUs.

Best for: Large models that don't fit in single GPU memory

Data Parallelism (DP)

Data parallelism replicates the entire model across multiple GPUs. Each GPU processes different requests independently.

Best for: Large batch sizes, high throughput scenarios

Expert Parallelism (EP)

Expert parallelism is specifically designed for Mixture of Experts (MoE) models, where different experts are distributed across GPUs.

Best for: MoE models with high expert count

Context Parallelism (CP)

Context parallelism distributes the processing of long sequences across multiple GPUs.

Best for: Long context scenarios

Wide Expert Parallelism (Wide-EP)

Wide-EP is an advanced form of expert parallelism that addresses the inherent workload imbalance in large-scale MoE models through intelligent load balancing and expert replication.

Best for: Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, Qwen3

Module-level Parallelism Guide

Attention Module

TensorRT LLM supports two strategies for attention modules:

Tensor Parallelism (TP) — best for small batch sizes
Data Parallelism (DP) — best for large batch sizes

Tensor Parallelism (TP)

The GEMM weights before and after the attention kernel are evenly sharded across GPUs, as are the attention num_heads.
Exceptions:
1. DeepSeek-R1: the fused_A GEMM is not sharded.
2. GQA / MQA / MLA: if num_heads < tensor_parallel_size, the KV-cache is replicated on every GPU.

Data Parallelism (DP)

All GEMM weights are replicated on every GPU.
The KV-cache is partitioned, because different user requests are routed to different DP ranks.

How to Enable Attention Parallelism

To deploy a model with the above parallel strategies using trtllm-serve or run benchmarking with trtllm-bench, create a YAML configuration file named parallel_config.yaml:

cat <<EOF > parallel_config.yaml
# TP-8
tensor_parallel_size: 8
enable_attention_dp: false    # default
# DP-8
tensor_parallel_size: 8
enable_attention_dp: true
EOF

then set --config parallel_config.yaml in trtllm-serve or trtllm-bench.

FFN Module

Dense Models

Tensor Parallelism is supported for the FFN layers of dense models.

Mixture of Experts (MoE)

MoE replaces a single FFN with multiple experts. A router selects the top-k experts for each token and dispatches the corresponding hidden states.

TensorRT LLM supports three execution patterns for MoE:

TP - Every expert's weight matrix is sliced across all GPUs. Each GPU sees all tokens.
EP - Full weights of each expert reside on a single GPU. Each GPU only sees tokens routed to its local experts.
Hybrid ETP - Each GPU stores a subset of experts (EP) and shards those weights further (TP), balancing workload and kernel efficiency.

How to Enable MoE Parallelism

To deploy a model with the above parallel strategies using trtllm-serve or run benchmarking with trtllm-bench, create a YAML configuration file named parallel_config.yaml as follows:

cat <<EOF > parallel_config.yaml
# TP only
tensor_parallel_size: 8
moe_tensor_parallel_size: 8

# EP only
tensor_parallel_size: 8
moe_expert_parallel_size: 8

# Hybrid (TP-4 × EP-2)
tensor_parallel_size: 8      # 4 × 2
moe_tensor_parallel_size: 4
moe_expert_parallel_size: 2
EOF

The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`.

Wide Expert Parallelism (Wide-EP)

Wide Expert Parallelism (Wide-EP) is TensorRT LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies.

Motivation for Wide-EP

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 use fine-grained expert designs that introduce new challenges:

High memory demands for expert weights
Inherent expert-level workload imbalance due to sparse execution patterns
Communication overhead in distributed expert parallelism
Hot expert problem where certain experts receive significantly more tokens than others

Key Features of Wide-EP

1. Expert Replication and Load Balancing

Wide-EP introduces the concept of expert slots that are decoupled from specific experts. This allows:

Multiple replicas of hot experts across different GPUs
Dynamic expert placement based on workload patterns
Both offline and online load balancing strategies

2. Custom EP Communication Kernels

Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
Efficient all-to-all communication for expert dispatch and combine
Reduced communication overhead compared to traditional EP

3. Expert Parallelism Load Balancer (EPLB)

Offline EPLB: Pre-computed expert placement based on historical workload statistics
Online EPLB: Dynamic expert placement that adapts to real-time traffic patterns
Layer-wise weight redistribution to minimize inference disruption

Architecture Overview

Wide-EP separates the concepts of experts and slots:

Expert: The concept from the model's perspective (e.g., Expert 0, Expert 1, etc.)
Slot: The concept from the model engine's perspective (e.g., Slot 0, Slot 1, etc.)

The system maintains a routing table that maps Expert IDs to Slot IDs, which can be updated by the load balancing policy.

Best Practices

Start with offline EPLB for production deployments with known workload patterns
Use online EPLB for dynamic workloads or when traffic patterns change frequently
Monitor expert statistics to understand workload distribution
Tune max_num_tokens based on your memory constraints and EP size
Test with representative datasets to validate load balancing effectiveness

References

For detailed implementation examples and advanced usage, see:

examples/wide_ep/: Complete Wide-EP examples
examples/wide_ep/ep_load_balancer/: Load balancing tools
examples/wide_ep/slurm_scripts/: Cluster deployment scripts

7.8 KiB Raw Blame History Unescape Escape

Parallelism in TensorRT LLM

Overview of Parallelism Strategies

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Data Parallelism (DP)

Expert Parallelism (EP)

Context Parallelism (CP)

Wide Expert Parallelism (Wide-EP)

Module-level Parallelism Guide

Attention Module

Tensor Parallelism (TP)

Data Parallelism (DP)

How to Enable Attention Parallelism

FFN Module

Dense Models

Mixture of Experts (MoE)

How to Enable MoE Parallelism

Wide Expert Parallelism (Wide-EP)

Motivation for Wide-EP

Key Features of Wide-EP

1. Expert Replication and Load Balancing

2. Custom EP Communication Kernels

3. Expert Parallelism Load Balancer (EPLB)

Architecture Overview

Best Practices

References

7.8 KiB

Raw Blame History