mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Yukun He 2225745782 [TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 ) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>		2025-11-04 16:42:31 +08:00
..
all_reduce.py	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 )	2025-11-04 16:42:31 +08:00
build_time_benchmark.py	[None][fix] Migrate to new cuda binding package name (#6700 )	2025-08-07 16:29:55 -04:00
build_time_dashboard.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
README.md	Update (#2978 )	2025-03-23 16:39:35 +08:00

README.md

!!! WARNING: This is not intended for external users to benchmark the performance numbers of the TRT-LLM product. !!! This folder contains the benchmark script used internally to assistant TRT-LLM development.

build_time_benchmark

# example 1: offline benmark for all the built-in models, see --help for all the options
python ./build_time_benchmark.py --model ALL

# By default, the benmark don't load the weights to save benchmark time, load the weights to test the TRT-LLM load and convert time
# WARNING: this can takes very long time if the model is large, or if you use a online HF model id since it can download the weights
python ./build_time_benchmark.py --model ALL --load

# example 2: benchmark a HF model model w/o downloading the model locally in advance
python ./build_time_benchmark.py --model "TinyLlama/TinyLlama_v1.1" # no weights loading
python ./build_time_benchmark.py --model "openai-community/gpt2" --load # with weights loading

# example 3: benchmark a local download HF model
python  ./build_time_benchmark.py --model /home/scratch.trt_llm_data/llm-models/falcon-rw-1b/

# example 4: benchmark one model with managed weights option, with verbose option
python ./build_time_benchmark.py --model llama2-70b.TP4 --managed_weights -v

# example 5: see help
python ./build_time_benchmark.py --help