|
|
||
|---|---|---|
| .. | ||
| middleware | ||
| breakdown_template.html | ||
| config_ctx.yaml | ||
| config_gen.yaml | ||
| correlation_template.html | ||
| correlation.py | ||
| mpi_launch.sh | ||
| parse_e2e.py | ||
| parse.py | ||
| parser_utils.py | ||
| README.md | ||
| run.py | ||
| run.sh | ||
| sample_performance_alignment.sh | ||
| slurm_alloc.sh | ||
| slurm_init_containers.sh | ||
| slurm_launch.sh | ||
| slurm_query_container_name.sh | ||
Layer-wise Benchmarks
This tool profiles individual layers of LLM models to help understand the performance characteristics of each layer and compare layer-wise benchmarks with end-to-end profiling results.
Generate profiles
Run with OpenMPI
Step 1: Start a container using Docker, Enroot, or other container runtimes. Please refer to ../../jenkins/current_image_tags.properties for the Docker image URI.
Step 2: In the container, install tensorrt_llm:
pip install -e ../..
Step 3: In the container, run benchmarks and generate profiles:
# Run DeepSeek-R1 NVFP4
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml
# Run with weights loaded (requires a local model directory)
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --model "$LLM_MODELS_ROOT/DeepSeek-R1/DeepSeek-R1-0528-FP4-v2" --load-format AUTO
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --model "$LLM_MODELS_ROOT/DeepSeek-R1/DeepSeek-R1-0528-FP4-v2" --load-format AUTO
# Run DeepSeek-V3.2-Exp
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM --moe-backend-for-prefill DEEPGEMM
# Run DeepSeek-V3.2-Exp with 32k context length
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM --batch-size 1 --seq-len-q 32769
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM --moe-backend-for-prefill DEEPGEMM --seq-len-kv-cache 32769
# Run with attention TP
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --no-enable-attention-dp
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --no-enable-attention-dp
# Run with attention TP and TRTLLMGen
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --no-enable-attention-dp --moe-backend TRTLLM
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM
# Run with MTP3
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --batch-size 32 --seq-len-q 4
# Run 4 layers
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --layer-indices 5,6,7,8
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --layer-indices 5,6,7,8
# Scale DEP=16 to 4 GPUs: reduces the number of experts; uses MNNVL A2A if applicable
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP
# Scale TEP=16 to 4 GPUs: reduces the number of attention heads and experts
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --no-enable-attention-dp
# Run Nemotron-3-Nano
NP=1 ./mpi_launch.sh ./run.sh config_ctx.yaml --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 --layer-indices 4,5,6 --mamba-ssm-cache-dtype float16
NP=1 ./mpi_launch.sh ./run.sh config_gen.yaml --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 --layer-indices 4,5,6 --mamba-ssm-cache-dtype float16
# Run Qwen3-Next
NP=2 ./mpi_launch.sh ./run.sh config_ctx.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --mamba-ssm-cache-dtype float16 --batch-size 4
NP=2 ./mpi_launch.sh ./run.sh config_gen.yaml --model Qwen/Qwen3-Next-80B-A3B-Instruct --layer-indices 6,7 --no-enable-attention-dp --mamba-ssm-cache-dtype float16 --batch-size 512
# Run with DeepEP A2A
NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_ctx.yaml --moe-backend WIDEEP
NP=4 ./mpi_launch.sh -x TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./run.sh config_gen.yaml --moe-backend WIDEEP
# Run with imbalanced ranks: in addition to activating all experts, the specified ratio of tokens is sent to rank 0
# Note: if balance ratio is 0, the "activate all experts" behavior is not applied
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedRanks --balance-ratio 0.5
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedRanks --balance-ratio 0.5
# Run with imbalanced experts and balanced ranks: in addition to activating all experts, the specified ratio of tokens is sent to the front experts on each rank
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --balance-method ImbalancedExperts --balance-ratio 0.5
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --balance-method ImbalancedExperts --balance-ratio 0.5
Run with Slurm
Tips:
- If you have a running Slurm job, you can set the environment variable by running
export SLURM_JOB_ID=<job_id>and skip Step 1.- Further, if you have already installed
tensorrt_llmin the Slurm job, you can also skip Step 2. Just run Step 3 withexport CONTAINER_NAME=<name>specified. If you don't know the container name, runexport CONTAINER_NAME=$(./slurm_query_container_name.sh)to get it.
Step 1: On the controller node, allocate one or multiple nodes, and export the SLURM_JOB_ID:
export SLURM_JOB_ID=$(NODES=4 TIME=02:00:00 ./slurm_alloc.sh)
Please set the variables in ./slurm_alloc.sh before running.
Step 2: Start a container and install tensorrt_llm. Run the following command on the controller node:
./slurm_init_containers.sh
This script uses the image recorded in ../../jenkins/current_image_tags.properties. The image will be downloaded to ../../enroot/ once.
Tip: If you want to change the image, there is no need to reallocate Slurm jobs. Just start another container by running Step 2 with
export CONTAINER_NAME=<new_name>, and Step 3 will run in the container specified by theCONTAINER_NAMEenvironment variable.
(Optional) Get an interactive shell
NODES=1 NP=1 ./slurm_launch.sh --overlap --pty middleware/exclude_slurm_envs bash
The --overlap option allows this shell to share the node with other jobs. The middleware enables nested MPI process spawning from within Slurm jobs.
You may compile C++ extensions in the interactive shell:
cd ../..
export CCACHE_DIR=$(realpath cpp/.ccache)
python3 scripts/build_wheel.py --cuda_architectures native --no-venv --skip_building_wheel -G Ninja --use_ccache --clean
Step 3: Run benchmarks to generate profiles. Run the following command on the controller node, where NODES ≤ the number of allocated nodes:
# Run DeepSeek-R1 NVFP4 with wide EP; uses MNNVL A2A if applicable
NODES=4 NP=16 ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend WIDEEP
# Run with TRTLLMGen
NODES=4 NP=16 ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend TRTLLM
# Run with DeepEPLowLatency
NODES=4 NP=16 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEPLowLatency ./slurm_launch.sh ./run.sh config_gen.yaml --moe-backend WIDEEP
# You can run 4-GPU and 8-GPU tasks without reallocating the Slurm job
NODES=1 NP=4 ./slurm_launch.sh ./run.sh config_ctx.yaml
NODES=2 NP=8 ./slurm_launch.sh ./run.sh config_gen.yaml
Batched run
By specifying a list for --batch-size on the command line (or batch_size in the YAML file), the script runs multiple configurations in a single process. This significantly reduces the total runtime because it avoids repeated library initialization and model initialization.
Supported list arguments:
--batch-size(orbatch_sizein YAML)--seq-len-q(orseq_len_qin YAML)--seq-len-kv-cache(orseq_len_kv_cachein YAML)--balance-ratio(orbalance_ratioin YAML)
Command-line arguments are comma-separated, for example, --batch-size 1,2,4. Values in the YAML file are lists, for example, batch_size: [1, 2, 4].
Run with OpenMPI:
NP=4 ./mpi_launch.sh ./run.sh config_ctx.yaml --batch-size 1,2,4 --seq-len-q 1024,8192
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP --batch-size 32,64,128,256,512 --seq-len-q 1,2,3,4
Parse profiles
Run the following command in the container:
# Parse the profile at the default directory
python3 parse.py --world-size 4
# Specify the file path
python3 parse.py --file-path profiles/report_np4_rank0.nsys-rep
python3 parse.py --profile-dir ./profiles --world-size 4 --rank 0
# Parse a specific module. The module must appear exactly once in each run.
python3 parse.py --world-size 4 --module MoE
You will receive four reports, each containing kernel timing statistics grouped by module:
- A printed report on stdout
- A CSV report at
profiles/report_np4_rank0.csv - An HTML report at
profiles/report_np4_rank0.html - A JSON report at
profiles/report_np4_rank0.json(for correlation analysis)
Performance alignment between end-to-end performance and layer-wise benchmarks
A complete example can be found in sample_performance_alignment.sh. Below is an overview of the main steps.
-
Run end-to-end serving in COLLECT mode and capture nsys profiles. This step generates a calibration file.
Requirements:
-
Add the following fields to
config.yaml.layer_wise_benchmarks_config: calibration_mode: COLLECT calibration_file_path: profiles/calibration_data.json -
Set
TLLM_PROFILE_START_STOPto a range that captures some iterations (typically tens of iterations) of the GEN phase. Ensure that every iteration has the same batch size. Capture 5 extra iterations at the beginning, because the first 5 iterations are treated as warm-ups and will be dropped by the parser by default. -
Capture per-rank nsys profiles; each rank should produce a separate file.
Place
nsys profileaftermpirunorsrun. To minimize profiling overhead and file size, there is no need to capture samples or GPU metrics.If you use
trtllm-serveortrtllm-bench, use the following command order. If you useexamples/disaggregated/slurm/benchmark/submit.py, settinggen_profile_rangeis sufficient.NP=$NP ./mpi_launch.sh middleware/mpi_env_from_ompi \ nsys profile \ -t cuda,nvtx \ --cpuctxsw none --cuda-event-trace false \ --cuda-graph-trace node \ -c cudaProfilerApi --capture-range-end stop \ -o profiles/report_e2e_collect_rank%q{RANK}.nsys-rep \ --force-overwrite true \ trtllm-llmapi-launch \ trtllm-bench \ --model ... -
For more accurate results, set the same
TLLM_AUTOTUNER_CACHE_PATHfor all steps. The autotuner cache file should be generated in Step 1 and reused in Steps 2 and 3.
-
-
If the end-to-end serving uses CUDA Graphs, run Step 1 again in MARK mode without CUDA Graphs and capture nsys profiles.
The differences from Step 1 are as follows:
-
Add the following fields to
config.yaml.cuda_graph_config: null layer_wise_benchmarks_config: calibration_mode: MARK -
Change the paths of profiles. The recommended argument is
-o profiles/report_e2e_mark_rank%q{RANK}.nsys-rep.
-
-
Run layer-wise benchmarks with the calibration file obtained by Step 1.
NP=4 ./mpi_launch.sh ./run.sh config_gen.yaml \ --model "$LLM_MODELS_ROOT/DeepSeek-R1/DeepSeek-R1-0528-FP4-v2" \ --load-format AUTO \ --layer-indices 5,6,7 \ --batch-size 32 \ --seq-len-q 1 \ --seq-len-kv-cache 2090 \ --balance-method NotModified \ --replay-file-path profiles/calibration_data.json \ --replay-start 47 \ --replay-stop 67Argument explanations:
Argument/Parameter Explanation NP=4Should match the end-to-end run. --load-format AUTOInstructs the benchmark to load model weights instead of using random weights. --layer-indices 5,6,7A list of contiguous layers to calibrate. --batch-size 32Should match the end-to-end run. --seq-len-q 1Should match (1 + MTP) of the end-to-end run. --seq-len-kv-cache 2090An estimate of the average context length for the captured iterations. The first 5 iterations should be excluded from this estimate because they will be dropped by the parser. --replay-file-pathThe calibration file obtained from Step 1. --replay-startand--replay-stopShould match the end-to-end TLLM_PROFILE_START_STOP. Do not replay the first 5 iterations because they will be dropped by the parser. -
Parse end-to-end profiles with
parse_e2e.py, and parse layer-wise benchmarks profiles withparse.py.seq 0 $((NP - 1)) | xargs -I% python3 parse_e2e.py \ --eager-trace profiles/report_e2e_mark_rank%.nsys-rep \ --graph-trace profiles/report_e2e_collect_rank%.nsys-rep \ --layer-indices 5,6,7 \ --warmup-times 5 \ -o profiles/report_e2e_collect_rank%.json seq 0 $((NP - 1)) | xargs -I% python3 parse.py \ --world-size $NP \ --rank % -
Run
correlation.pyto generate the correlation report.python3 correlation.py \ --reference profiles/report_e2e_collect_rank0.json \ $(seq 1 $((NP - 1)) | xargs -I% echo "--target profiles/report_e2e_collect_rank%.json") \ $(seq 0 $((NP - 1)) | xargs -I% echo "--target profiles/report_np${NP}_rank%.json") \ -o profiles/correlation.htmlThe report can be found at
profiles/correlation.html.
Limitations:
- Pipeline parallelism is not supported.
- Only the CUTLASS and WIDEEP MoE backends are supported.
- Only tested with the GEN phase and attention DP.
Developer utilities
- Reduce startup time when debugging a model
- Set autotuner cache or disable autotuner
- Set autotuner cache: set the
TLLM_AUTOTUNER_CACHE_PATH=autotuner_cache/cacheenvironment variable. Use this at your own risk; you may need to delete the cache ifNPchanges or the code changes - Disable autotuner: add the
--no-enable-autotuneroption
- Set autotuner cache: set the
- Disable nsys profiling: set the
PROFILE=0environment variable
- Set autotuner cache or disable autotuner
- Capture more information
- Enable GPU metrics: set the
GPU_METRICS=1environment variable - Enable backtrace: set the
BACKTRACE=1environment variable
- Enable GPU metrics: set the
Troubleshooting
-
Error
fp8 blockscale gemm only support Hopperon Blackwell.The default MoE backend "CUTLASS" does not support FP8 weights. Please choose the same MoE backend as your end-to-end config. A typical solution is to add the
--moe-backend DEEPGEMM(orTRTLLM,WIDEEP) and--moe-backend-for-prefill DEEPGEMM(orWIDEEP) options. -
Error
huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4-v2/resolve/main/config.json.Please use a local model through the
--modeloption, or follow Hugging Face's instructions: "We had to rate limit your IP. To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API."