TensorRT-LLMs/examples/models/contrib/sdxl/README.md
bhsueh_NV 322ac565fc
chore: clean some ci of qa test (#3083)
* move some models to examples/models/contrib

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* update the document

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove arctic, blip2, cogvlm, dbrx from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove tests of dit, mmdit and stdit from qa test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove grok, jais, sdxl, skywork, smaug from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* re-organize the glm examples

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix issues after running pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix some typo in glm_4_9b readme

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-03-31 14:30:41 +08:00

43 lines
2.3 KiB
Markdown

# Stable Diffusion XL
This document showcases how to build and run the [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on multiple GPUs using TensorRT-LLM. The community-contributed SDXL example in TRT-LLM is intended solely to showcase distributed inference for high-resolution use cases. For an optimized single-GPU setup in Stable Diffusion inference, please refer to the [TensorRT DemoDiffusion example](https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion).
The design of distributed parallel inference comes from the CVPR 2024 paper [DistriFusion](https://github.com/mit-han-lab/distrifuser) from [MIT HAN Lab](https://hanlab.mit.edu/). To simplify the implementation, all communications in this example are handled synchronously.
## Usage
### 1. Build TensorRT Engine
```bash
# 1 gpu
python build_sdxl_unet.py --size 1024
# 2 gpus
mpirun -n 2 --allow-run-as-root python build_sdxl_unet.py --size 1024
```
### 2. Generate images using the engine
```bash
# 1 gpu
python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
# 2 gpus
mpirun -n 2 --allow-run-as-root python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
```
## Latency Benchmark
This benchmark is provided as reference points and should not be considered as the peak inference speed that can be delivered by TensorRT-LLM.
| Framework | Resolution | n_gpu | A100 latency (s) | A100 speedup | H100 latency (s) | H100 speedup |
|:---------:|:----------:|:-----:|:---------------:|:-------------:|:---------------:|:-------------:|
| Torch | 1024x1024 | 1 | 6.280 | 1 | 5.820 | 1 |
| TRT-LLM | 1024x1024 | 2 | 2.803 | **2.24x** | 1.719 | **3.39x** |
| TRT-LLM | 1024x1024 | 4 | 2.962 | **2.12x** | 2.592 | **2.25x** |
| Torch | 2048x2048 | 1 | 27.865 | 1 | 18.330 | 1 |
| TRT-LLM | 2048x2048 | 2 | 13.152 | **2.12x** | 7.943 | **2.31x** |
| TRT-LLM | 2048x2048 | 4 | 9.781 | **2.85x** | 7.596 | **2.41x** |
torch v2.5.0. TRT-LLM v0.15.0.dev2024102900, `--num-warmup-runs=5; --avg-runs=20`. All communications are synchronous.