mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Kaiyu Xie 9b931c0f63 Update TensorRT-LLM (#2873 )		2025-03-11 21:13:42 +08:00
..
assets	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
aspect.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
convert_checkpoint.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
pipeline_tllm.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
README.md	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
requirements.txt	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
sample.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
scheduler.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
text_encoder.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
utils.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
vae.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
video_transforms.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00

README.md

STDiT in OpenSoRA

This document shows how to build and run a STDiT in OpenSoRA with TensorRT-LLM.

Overview

The TensorRT-LLM implementation of STDiT can be found in tensorrt_llm/models/stdit/model.py. The TensorRT-LLM STDiT (OpenSoRA) example code is located in examples/stdit. There are main files to build and run STDiT with TensorRT-LLM:

convert_checkpoint.py to convert the STDiT model into tensorrt-llm checkpoint format.
sample.py to run the pipeline with TensorRT engine(s) to generate videos.

Support Matrix

Usage

The TensorRT-LLM STDiT example code locates at examples/stdit. It takes HuggingFace checkpiont as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.

Requirements

Please install required packages first:

pip install -r requirements.txt
# ColossalAI is also needed for text encoder.
pip install colossalai --no-deps

Build STDiT TensorRT engine(s)

This checkpoint will be converted to the TensorRT-LLM checkpoint format by convert_checkpoint.py. After that, we can build TensorRT engine(s) with the TensorRT-LLM checkpoint. The pretrained checkpoint can be downloaded from here.

# Convert to TRT-LLM
python convert_checkpoint.py --timm_ckpt=<pretrained_checkpoint>
# Build engine
trtllm-build --checkpoint_dir=tllm_checkpoint/ \
             --max_batch_size=2 \
             --gemm_plugin=float16 \
             --kv_cache_type=disabled \
             --remove_input_padding=enable \
             --gpt_attention_plugin=auto \
             --bert_attention_plugin=auto \
             --context_fmha=enable

After build, we can find a ./engine_output directory, it is ready for running STDiT model with TensorRT-LLM now.

Generate videos

A sample.py is provided to generated videos with the optimized TensorRT engines.

python sample.py "a beautiful waterfall"

And we can see a video named sample_outputs/sample_0000.mp4 will be generated:

Tensor Parallel

We can levaerage tensor parallel to further reduce latency and memory consumption on each GPU.

# Convert to TRT-LLM
python convert_checkpoint.py --tp_size=2 --timm_ckpt=<pretrained_checkpoint>
# Build engines
trtllm-build --checkpoint_dir=tllm_checkpoint/ \
             --max_batch_size=2 \
             --gemm_plugin=float16 \
             --kv_cache_type=disabled \
             --remove_input_padding=enable \
             --gpt_attention_plugin=auto \
             --bert_attention_plugin=auto \
             --context_fmha=enable
# Run example
mpirun -n 2 --allow-run-as-root python sample.py "a beautiful waterfall"

Context Parallel

Not supported yet.