# Stable Diffusion XL

This document showcases how to build and run the [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on multiple GPUs using TensorRT-LLM. The community-contributed SDXL example in TRT-LLM is intended solely to showcase distributed inference for high-resolution use cases. For an optimized single-GPU setup in Stable Diffusion inference, please refer to the [TensorRT DemoDiffusion example](https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion).

The design of distributed parallel inference comes from the CVPR 2024 paper [DistriFusion](https://github.com/mit-han-lab/distrifuser) from [MIT HAN Lab](https://hanlab.mit.edu/). To simplify the implementation, all communications in this example are handled synchronously.

## Usage

### 1. Build TensorRT Engine

```bash
# 1 gpu
python build_sdxl_unet.py --size 1024

# 2 gpus
mpirun -n 2 --allow-run-as-root python build_sdxl_unet.py --size 1024
```

### 2. Generate images using the engine


```bash
# 1 gpu
python run_sdxl.py --size 1024 --prompt "flowers, rabbit"

# 2 gpus
mpirun -n 2 --allow-run-as-root python run_sdxl.py --size 1024 --prompt "flowers, rabbit"
```

## Latency Benchmark
This benchmark is provided as reference points and should not be considered as the peak inference speed that can be delivered by TensorRT-LLM.

| Framework | Resolution | n_gpu | A100 latency (s) | A100 speedup | H100 latency (s) | H100 speedup |
|:---------:|:----------:|:-----:|:---------------:|:-------------:|:---------------:|:-------------:|
|   Torch   |  1024x1024  |   1   |      6.280      |       1       |      5.820      |       1       |
|  TRT-LLM  |  1024x1024  |   2   |      2.803      |     **2.24x** |      1.719      |     **3.39x** |
|  TRT-LLM  |  1024x1024  |   4   |      2.962      |     **2.12x** |      2.592      |     **2.25x** |
|   Torch   |  2048x2048  |   1   |     27.865      |       1       |     18.330      |       1       |
|  TRT-LLM  |  2048x2048  |   2   |     13.152      |     **2.12x** |      7.943      |     **2.31x** |
|  TRT-LLM  |  2048x2048  |   4   |      9.781      |     **2.85x** |      7.596      |     **2.41x** |

torch v2.5.0. TRT-LLM v0.15.0.dev2024102900, `--num-warmup-runs=5; --avg-runs=20`. All communications are synchronous.