TensorRT-LLMs/docs/source/performance/perf-analysis.md
Kaiyu Xie d1fa80dee3
doc: TRTLLM-4797 Update perf-analysis.md (#4100)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-08 17:24:44 +08:00

93 lines
4.0 KiB
Markdown

(perf-analysis)=
# Performance Analysis
NVIDIA Nsight Systems reports at the application level are highly informative. Metric sampling capabilities have increased over generations and provide a clean middle-ground between timing analysis and kernel-level deep dives with NVIDIA Nsight Compute.
Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, we have added features to TensorRT-LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.
## Feature Descriptions
The main functionality here:
* Relies on toggling the CUDA profiler runtime API on and off.
* (PyTorch workflow only) Toggling the PyTorch profiler on and off.
* Provides a means to understand which regions a user may want to focus on.
Toggling the CUDA profiler runtime API on and off:
* Allows users to know specifically what the profiled region corresponds to.
* Results in smaller files to post-process (for metric extraction or similar).
(PyTorch workflow only) Toggling the PyTorch profiler on and off:
* Help users to analysis the performance breakdown in the model.
* Results in smaller files to post-process (for metric extraction or similar).
## Coordinating with NVIDIA Nsight Systems Launch
Consult the Nsight Systems User Guide for full overview of options.
On the PyTorch workflow, basic NVTX markers are by default provided. On the C++/TensorRT workflow, append `--nvtx` when calling `scripts/build_wheel.py` script to compile, and clean build the code.
### Only collect specific iterations
To reduce the Nsight Systems profile size, and to control that only specific iterations are collected, set environment variable `TLLM_PROFILE_START_STOP=A-B`, and append `-c cudaProfilerApi` to `nsys profile` command.
### Enable more NVTX markers for debugging
Set environment variable `TLLM_NVTX_DEBUG=1`.
### Enable garbage collection (GC) NVTX markers
Set environment variable `TLLM_PROFILE_RECORD_GC=1`.
### Enable GIL information in NVTX markers
Append “python-gil” to Nsys “-t” option.
## Coordinating with PyTorch profiler (PyTorch workflow only)
### Collect PyTorch profiler results
1. Set environment variable `TLLM_PROFILE_START_STOP=A-B` to specify the range of the iterations to be collected.
2. Set environment variable `TLLM_TORCH_PROFILE_TRACE=<path>`, and the results will be saved to `<path>`.
### Visualize the PyTorch profiler results
Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.
## Examples
Consult the Nsight Systems User Guide for full overview of MPI-related options.
### Profiling specific iterations on a trtllm-bench/trtllm-serve run
Say we want to profile iterations 100 to 150 on a trtllm-bench/trtllm-serve run, we want to collect as much information as possible for debugging, such as GIL, debugging NVTX markers, etc:
```bash
#!/bin/bash
# Prepare dataset for the benchmark
python3 benchmarks/cpp/prepare_dataset.py \
--tokenizer=${MODEL_PATH} \
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
# Benchmark and profile
TLLM_PROFILE_START_STOP=100-150 nsys profile \
-o trace -f true \
-t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
--cuda-graph-trace node \
-e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
--trace-fork-before-exec=true \
trtllm-bench \ # or trtllm-serve command
--model deepseek-ai/DeepSeek-V3 \
--model_path ${MODEL_PATH} \
throughput \
--dataset /tmp/dataset.txt --warmup 0 \
--backend pytorch \
--streaming
```
The Nsight Systems reports will be saved to `trace.nsys-rep`. Use NVIDIA Nsight Systems application to open it.
The PyTorch profiler results will be saved to `trace.json`. Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.