mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-06 03:01:50 +08:00
* move some models to examples/models/contrib Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update the document Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove arctic, blip2, cogvlm, dbrx from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove tests of dit, mmdit and stdit from qa test Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove grok, jais, sdxl, skywork, smaug from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * re-organize the glm examples Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix issues after running pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix some typo in glm_4_9b readme Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
43 lines
2.3 KiB
Markdown
43 lines
2.3 KiB
Markdown
# Stable Diffusion XL
|
|
|
|
This document showcases how to build and run the [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on multiple GPUs using TensorRT-LLM. The community-contributed SDXL example in TRT-LLM is intended solely to showcase distributed inference for high-resolution use cases. For an optimized single-GPU setup in Stable Diffusion inference, please refer to the [TensorRT DemoDiffusion example](https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion).
|
|
|
|
The design of distributed parallel inference comes from the CVPR 2024 paper [DistriFusion](https://github.com/mit-han-lab/distrifuser) from [MIT HAN Lab](https://hanlab.mit.edu/). To simplify the implementation, all communications in this example are handled synchronously.
|
|
|
|
## Usage
|
|
|
|
### 1. Build TensorRT Engine
|
|
|
|
```bash
|
|
# 1 gpu
|
|
python build_sdxl_unet.py --size 1024
|
|
|
|
# 2 gpus
|
|
mpirun -n 2 --allow-run-as-root python build_sdxl_unet.py --size 1024
|
|
```
|
|
|
|
### 2. Generate images using the engine
|
|
|
|
|
|
```bash
|
|
# 1 gpu
|
|
python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
|
|
|
|
# 2 gpus
|
|
mpirun -n 2 --allow-run-as-root python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
|
|
```
|
|
|
|
## Latency Benchmark
|
|
This benchmark is provided as reference points and should not be considered as the peak inference speed that can be delivered by TensorRT-LLM.
|
|
|
|
| Framework | Resolution | n_gpu | A100 latency (s) | A100 speedup | H100 latency (s) | H100 speedup |
|
|
|:---------:|:----------:|:-----:|:---------------:|:-------------:|:---------------:|:-------------:|
|
|
| Torch | 1024x1024 | 1 | 6.280 | 1 | 5.820 | 1 |
|
|
| TRT-LLM | 1024x1024 | 2 | 2.803 | **2.24x** | 1.719 | **3.39x** |
|
|
| TRT-LLM | 1024x1024 | 4 | 2.962 | **2.12x** | 2.592 | **2.25x** |
|
|
| Torch | 2048x2048 | 1 | 27.865 | 1 | 18.330 | 1 |
|
|
| TRT-LLM | 2048x2048 | 2 | 13.152 | **2.12x** | 7.943 | **2.31x** |
|
|
| TRT-LLM | 2048x2048 | 4 | 9.781 | **2.85x** | 7.596 | **2.41x** |
|
|
|
|
torch v2.5.0. TRT-LLM v0.15.0.dev2024102900, `--num-warmup-runs=5; --avg-runs=20`. All communications are synchronous.
|