mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Sharan Chetlur 258c7540c0 open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 ) Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> open source f8c0381a2bc50ee2739c3d8c2be481b31e5f00bd (#2736) Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Add note for blackwell (#2742) Update the docs to workaround the extra-index-url issue (#2744) update README.md (#2751) Fix github io pages (#2761) Update		2025-02-11 02:21:51 +00:00
..
build_sdxl_unet.py	TensorRT-LLM v0.16 Release	2024-12-24 15:58:43 +08:00
pipeline_stable_diffusion_xl.py	open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725 )	2025-02-11 02:21:51 +00:00
README.md	TensorRT-LLM v0.16 Release	2024-12-24 15:58:43 +08:00
run_sdxl.py	TensorRT-LLM v0.16 Release	2024-12-24 15:58:43 +08:00

README.md

Stable Diffusion XL

This document showcases how to build and run the Stable Diffusion XL (SDXL) model on multiple GPUs using TensorRT-LLM. The community-contributed SDXL example in TRT-LLM is intended solely to showcase distributed inference for high-resolution use cases. For an optimized single-GPU setup in Stable Diffusion inference, please refer to the TensorRT DemoDiffusion example.

The design of distributed parallel inference comes from the CVPR 2024 paper DistriFusion from MIT HAN Lab. To simplify the implementation, all communications in this example are handled synchronously.

Usage

1. Build TensorRT Engine

# 1 gpu
python build_sdxl_unet.py --size 1024

# 2 gpus
mpirun -n 2 --allow-run-as-root python build_sdxl_unet.py --size 1024

2. Generate images using the engine

# 1 gpu
python run_sdxl.py --size 1024 --prompt "flowers, rabbit"

# 2 gpus
mpirun -n 2 --allow-run-as-root python run_sdxl.py --size 1024 --prompt "flowers, rabbit"

Latency Benchmark

This benchmark is provided as reference points and should not be considered as the peak inference speed that can be delivered by TensorRT-LLM.

Framework	Resolution	n_gpu	A100 latency (s)	A100 speedup	H100 latency (s)	H100 speedup
Torch	1024x1024	1	6.280	1	5.820	1
TRT-LLM	1024x1024	2	2.803	2.24x	1.719	3.39x
TRT-LLM	1024x1024	4	2.962	2.12x	2.592	2.25x
Torch	2048x2048	1	27.865	1	18.330	1
TRT-LLM	2048x2048	2	13.152	2.12x	7.943	2.31x
TRT-LLM	2048x2048	4	9.781	2.85x	7.596	2.41x

torch v2.5.0. TRT-LLM v0.15.0.dev2024102900, --num-warmup-runs=5; --avg-runs=20. All communications are synchronous.