mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
93 lines
4.0 KiB
Markdown
93 lines
4.0 KiB
Markdown
(perf-analysis)=
|
|
|
|
# Performance Analysis
|
|
|
|
NVIDIA Nsight Systems reports at the application level are highly informative. Metric sampling capabilities have increased over generations and provide a clean middle-ground between timing analysis and kernel-level deep dives with NVIDIA Nsight Compute.
|
|
|
|
Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, we have added features to TensorRT-LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.
|
|
|
|
|
|
## Feature Descriptions
|
|
|
|
The main functionality here:
|
|
* Relies on toggling the CUDA profiler runtime API on and off.
|
|
* (PyTorch workflow only) Toggling the PyTorch profiler on and off.
|
|
* Provides a means to understand which regions a user may want to focus on.
|
|
|
|
Toggling the CUDA profiler runtime API on and off:
|
|
* Allows users to know specifically what the profiled region corresponds to.
|
|
* Results in smaller files to post-process (for metric extraction or similar).
|
|
|
|
(PyTorch workflow only) Toggling the PyTorch profiler on and off:
|
|
* Help users to analysis the performance breakdown in the model.
|
|
* Results in smaller files to post-process (for metric extraction or similar).
|
|
|
|
|
|
## Coordinating with NVIDIA Nsight Systems Launch
|
|
|
|
Consult the Nsight Systems User Guide for full overview of options.
|
|
|
|
On the PyTorch workflow, basic NVTX markers are by default provided. On the C++/TensorRT workflow, append `--nvtx` when calling `scripts/build_wheel.py` script to compile, and clean build the code.
|
|
|
|
### Only collect specific iterations
|
|
|
|
To reduce the Nsight Systems profile size, and to control that only specific iterations are collected, set environment variable `TLLM_PROFILE_START_STOP=A-B`, and append `-c cudaProfilerApi` to `nsys profile` command.
|
|
|
|
|
|
### Enable more NVTX markers for debugging
|
|
Set environment variable `TLLM_NVTX_DEBUG=1`.
|
|
|
|
### Enable garbage collection (GC) NVTX markers
|
|
Set environment variable `TLLM_PROFILE_RECORD_GC=1`.
|
|
|
|
|
|
### Enable GIL information in NVTX markers
|
|
Append “python-gil” to Nsys “-t” option.
|
|
|
|
|
|
## Coordinating with PyTorch profiler (PyTorch workflow only)
|
|
|
|
### Collect PyTorch profiler results
|
|
1. Set environment variable `TLLM_PROFILE_START_STOP=A-B` to specify the range of the iterations to be collected.
|
|
2. Set environment variable `TLLM_TORCH_PROFILE_TRACE=<path>`, and the results will be saved to `<path>`.
|
|
|
|
### Visualize the PyTorch profiler results
|
|
Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.
|
|
|
|
|
|
## Examples
|
|
Consult the Nsight Systems User Guide for full overview of MPI-related options.
|
|
|
|
### Profiling specific iterations on a trtllm-bench/trtllm-serve run
|
|
|
|
Say we want to profile iterations 100 to 150 on a trtllm-bench/trtllm-serve run, we want to collect as much information as possible for debugging, such as GIL, debugging NVTX markers, etc:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
# Prepare dataset for the benchmark
|
|
python3 benchmarks/cpp/prepare_dataset.py \
|
|
--tokenizer=${MODEL_PATH} \
|
|
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
|
|
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
|
|
|
|
# Benchmark and profile
|
|
TLLM_PROFILE_START_STOP=100-150 nsys profile \
|
|
-o trace -f true \
|
|
-t 'cuda,nvtx,python-gil' -c cudaProfilerApi \
|
|
--cuda-graph-trace node \
|
|
-e TLLM_PROFILE_RECORD_GC=1,TLLM_LLMAPI_ENABLE_NVTX=1,TLLM_TORCH_PROFILE_TRACE=trace.json \
|
|
--trace-fork-before-exec=true \
|
|
trtllm-bench \ # or trtllm-serve command
|
|
--model deepseek-ai/DeepSeek-V3 \
|
|
--model_path ${MODEL_PATH} \
|
|
throughput \
|
|
--dataset /tmp/dataset.txt --warmup 0 \
|
|
--backend pytorch \
|
|
--streaming
|
|
```
|
|
|
|
The Nsight Systems reports will be saved to `trace.nsys-rep`. Use NVIDIA Nsight Systems application to open it.
|
|
|
|
The PyTorch profiler results will be saved to `trace.json`. Use [chrome://tracing/](chrome://tracing/) to inspect the saved profile.
|