mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yukun He 9e7182b603 [TRTLLM-9615][feat] Implement a distributed tuning system (#9621 ) Four distinct strategies are implemented to accommodate different distributed tuning scenarios, including BROADCAST, INDEPENDENT, MERGE, PARALLEL. * Distributed tuning is disabled by default, with the INDEPENDENT strategy as the fallback. This conservative approach prevents unexpected behavior in standard use cases. * Only operations with significant tuning time overhead have been assigned the PARALLEL strategy, which allows the same tensor parallelism (TP) rank to tune tactics concurrently across different ranks. This targeted approach balances performance gains with stability. * Operations with nested tuning structures, such as NVFP4GemmUnifiedRunner, currently support only the INDEPENDENT strategy. This restriction exists because the synchronization mechanism is optimized only for leaf operations and doesn't yet handle nested hierarchies. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>		2025-12-15 21:08:53 +08:00
..
config_ctx.yaml	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00
config_gen.yaml	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00
mpi_launch.sh	[None][feat] Add layer wise benchmarks (#8777 )	2025-10-30 20:29:34 +08:00
parse.py	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00
README.md	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00
run.py	[TRTLLM-9615][feat] Implement a distributed tuning system (#9621 )	2025-12-15 21:08:53 +08:00
run.sh	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00
slurm_alloc.sh	[None][feat] Add layer wise benchmarks (#8777 )	2025-10-30 20:29:34 +08:00
slurm_init_containers.sh	[None][feat] Add a parser to layer-wise benchmarks (#9440 )	2025-11-25 05:45:16 -08:00
slurm_launch.sh	[None][feat] Add layer wise benchmarks (#8777 )	2025-10-30 20:29:34 +08:00
slurm_query_container_name.sh	[None][feat] Add a parser to layer-wise benchmarks (#9440 )	2025-11-25 05:45:16 -08:00
template.html	[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks (#9667 )	2025-12-04 13:41:15 +08:00

README.md

Layer-wise Benchmarks

Generate profiles

Run with OpenMPI

Step 1: Start a container using Docker, Enroot or others. Please refer to ../../jenkins/current_image_tags.properties for the Docker image URI.

Step 2: In the container, install tensorrt_llm:

pip install -e ../..

Step 3: In the container, run benchmarks and generate profiles:

# Run DeepSeek-R1 NVFP4
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml

# Run DeepSeek-V3.2-Exp
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM

# Run DeepSeek-V3.2-Exp with 32k context length
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $((32768 + 1024 + 4)) --moe-backend DEEPGEMM --batch-size 1 --seq-len-q 32769
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $((32768 + 1024 + 4)) --moe-backend DEEPGEMM --seq-len-kv-cache 32769

# Run with attention TP
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --no-enable-attention-dp
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --no-enable-attention-dp

# Run with attention TP and TRTLLMGen
NP=4 ./mpi_launch.sh -x TRTLLM_ENABLE_PDL=1 ./run.sh config_ctx.yaml --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified
NP=4 ./mpi_launch.sh -x TRTLLM_ENABLE_PDL=1 ./run.sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM --balance-method NotModified

# Run with MTP3
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --batch-size 32 --seq-len-q 4

# Run 4 layers
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --layer-indices 5,6,7,8
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --layer-indices 5,6,7,8

# Scale DEP=16 to 4 GPUs: reduce the number of experts, uses MNNVL A2A if applicable
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP

# Scale TEP=16 to 4 GPUs: reduce the number of attention heads and experts
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --no-enable-attention-dp

# Run Qwen3-Next (balanced routing is not implemented)
NP=2 ./mpi_launch.sh ./run.sh config_ctx.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --batch-size 4
NP=2 ./mpi_launch.sh ./run.sh config_gen.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --batch-size 512

# Run with DeepEP A2A
NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_ctx.yaml --moe-backend WIDEEP
NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_gen.yaml --moe-backend WIDEEP

# Run with imbalanced ranks: except for activating all experts, a% of the tokens are sent to the 1st rank
# Note: if balance ratio is 0, ignore activating all experts
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedRanks --balance-ratio 0.5
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedRanks --balance-ratio 0.5

# Run with imbalanced experts and balanced ranks: except for activating all experts, a% of the tokens are sent to the front experts on each rank
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedExperts --balance-ratio 0.5
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedExperts --balance-ratio 0.5

Run with Slurm

Tips:

If you have a running Slurm job, please skip step 1 and go straight to step 2 and 3.

Further, if you have installed tensorrt_llm in the Slurm job, you can also skip step 2 and run step 3 with export CONTAINER_NAME=aaa specified. If you don't know the container name, run export CONTAINER_NAME=$(SLURM_JOB_ID=$SLURM_JOB_ID ./slurm_query_container_name.sh) to get it.

Step 1: On the controller node, allocate one or multiple nodes, and record the SLURM_JOB_ID:

SLURM_JOB_ID=$(NODES=4 TIME=02:00:00 ./slurm_alloc.sh)

Please fill the variables in ./slurm_alloc.sh.

Step 2: Start a container and install tensorrt_llm. Run the following command on the controller node:

SLURM_JOB_ID=$SLURM_JOB_ID ./slurm_init_containers.sh

It uses the image recorded in ../../jenkins/current_image_tags.properties. The image will be downloaded to ../../enroot/ for once.

Tips: If you want to change the image, no need to reallocate Slurm jobs. Just start another container by running step 2 with export CONTAINER_NAME=aaa, and step 3 will run in the container specified by the CONTAINER_NAME env.

Step 3: Run benchmarks to generate profiles. Run the following command on the controller node, where NODES ≤ the number of allocated nodes:

# Run DeepSeek-R1 NVFP4 with wide ep: uses MNNVL A2A if applicable
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend WIDEEP

# Run with TRTLLMGen
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_ENABLE_PDL=1 ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend TRTLLM

# Run with DeepEPLowLatency
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEPLowLatency ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend WIDEEP

# You can run 4-GPU and 8-GPU tasks without reallocate the slurm job
SLURM_JOB_ID=$SLURM_JOB_ID NODES=1 NP=4 ./slurm_launch.sh ./run.sh config_ctx.yaml
SLURM_JOB_ID=$SLURM_JOB_ID NODES=2 NP=8 ./slurm_launch.sh ./run.sh config_gtx.yaml

Batched run

By specifying a list for --batch-size on the command line (or batch_size in the YAML file), the script runs multiple configurations in a single process. This significantly reduces the total runtime because it avoids repeated library initialization and model initialization.

Supported list arguments:

--batch-size (or batch_size in YAML)
--seq-len-q (or seq_len_q in YAML)
--seq-len-kv-cache (or seq_len_kv_cache in YAML)
--balance-ratio (or balance_ratio in YAML)

Command line arguments are comma separated, for example, --batch-size 1,2,4. Configs in the YAML file are lists, for example, batch_size: [1, 2, 4].

Run with OpenMPI:

NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --batch-size 1,2,4 --seq-len-q 1024,8192
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP --batch-size 32,64,128,256,512 --seq-len-q 1,2,3,4

Parse profiles

Run the following command in the container:

python3 parse.py --world-size 4

# Specify the location of the .nsys-rep file
python3 parse.py --profile-dir ./profiles --world-size 4 --rank 0

# Parse a specific module. The module must appear exactly once in each run.
python3 parse.py --world-size 4 --module MoE

You will receive three reports, each containing kernel timing statistics grouped by module:

A printed report on stdout
A CSV report at profiles/report_np4_rank0.csv
An HTML report at profiles/report_np4_rank0.html

Trouble shooting

Error fp8 blockscale gemm only support Hopper on Blackwell.

The default MoE backend "CUTLASS" does not support FP8 weights. Please choose the same MoE backend as your end-to-end config. A typical choice is adding --moe-backend DEEPGEMM, --moe-backend TRTLLM, or --moe-backend WIDEEP option.