* move some models to examples/models/contrib Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update the document Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove arctic, blip2, cogvlm, dbrx from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove tests of dit, mmdit and stdit from qa test Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove grok, jais, sdxl, skywork, smaug from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * re-organize the glm examples Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix issues after running pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix some typo in glm_4_9b readme Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| build_sdxl_unet.py | ||
| pipeline_stable_diffusion_xl.py | ||
| README.md | ||
| run_sdxl.py | ||
Stable Diffusion XL
This document showcases how to build and run the Stable Diffusion XL (SDXL) model on multiple GPUs using TensorRT-LLM. The community-contributed SDXL example in TRT-LLM is intended solely to showcase distributed inference for high-resolution use cases. For an optimized single-GPU setup in Stable Diffusion inference, please refer to the TensorRT DemoDiffusion example.
The design of distributed parallel inference comes from the CVPR 2024 paper DistriFusion from MIT HAN Lab. To simplify the implementation, all communications in this example are handled synchronously.
Usage
1. Build TensorRT Engine
# 1 gpu
python build_sdxl_unet.py --size 1024
# 2 gpus
mpirun -n 2 --allow-run-as-root python build_sdxl_unet.py --size 1024
2. Generate images using the engine
# 1 gpu
python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
# 2 gpus
mpirun -n 2 --allow-run-as-root python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
Latency Benchmark
This benchmark is provided as reference points and should not be considered as the peak inference speed that can be delivered by TensorRT-LLM.
| Framework | Resolution | n_gpu | A100 latency (s) | A100 speedup | H100 latency (s) | H100 speedup |
|---|---|---|---|---|---|---|
| Torch | 1024x1024 | 1 | 6.280 | 1 | 5.820 | 1 |
| TRT-LLM | 1024x1024 | 2 | 2.803 | 2.24x | 1.719 | 3.39x |
| TRT-LLM | 1024x1024 | 4 | 2.962 | 2.12x | 2.592 | 2.25x |
| Torch | 2048x2048 | 1 | 27.865 | 1 | 18.330 | 1 |
| TRT-LLM | 2048x2048 | 2 | 13.152 | 2.12x | 7.943 | 2.31x |
| TRT-LLM | 2048x2048 | 4 | 9.781 | 2.85x | 7.596 | 2.41x |
torch v2.5.0. TRT-LLM v0.15.0.dev2024102900, --num-warmup-runs=5; --avg-runs=20. All communications are synchronous.