TensorRT-LLMs/examples/auto_deploy
Lucas Liebenwein 39eb120b96
[#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635)
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-09-18 10:44:24 +08:00
..
.vscode [#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635) 2025-09-18 10:44:24 +08:00
.gitignore [#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221) 2025-09-05 22:10:48 -04:00
build_and_run_ad.py [#7308] [feat] AutoDeploy: graph-less transformers mode for HF (#7635) 2025-09-18 10:44:24 +08:00
build_and_run_flux.py [AutoDeploy] merge feat/ad-2025-07-07 (#6196) 2025-07-23 05:11:04 +08:00
CONTRIBUTING.md Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
README.md [None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder (#7233) 2025-08-26 10:47:57 -07:00
requirements.txt Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00

🔥🚀 AutoDeploy Examples

This folder contains runnable examples for AutoDeploy. For general AutoDeploy documentation, motivation, support matrix, and feature overview, please see the official docs.


Quick Start

AutoDeploy is included with the TRT-LLM installation.

sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

You can refer to TRT-LLM installation guide for more information.

Run a simple example with a Hugging Face model:

cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Example Run Script (build_and_run_ad.py)

This script demonstrates end-to-end deployment of HuggingFace checkpoints using AutoDeploys graph-transformation pipeline.

You can configure your experiment with various options. Use the -h/--help flag to see available options:

python build_and_run_ad.py --help

Below is a non-exhaustive list of common configuration options:

Configuration Key Description
--model The HF model card or path to a HF checkpoint folder
--args.model-factory Choose model factory implementation ("AutoModelForCausalLM", ...)
--args.skip-loading-weights Only load the architecture, not the weights
--args.model-kwargs Extra kwargs that are being passed to the model initializer in the model factory
--args.tokenizer-kwargs Extra kwargs that are being passed to the tokenizer initializer in the model factory
--args.world-size The number of GPUs used for auto-sharding the model
--args.runtime Specifies which type of Engine to use during runtime ("demollm" or "trtllm")
--args.compile-backend Specifies how to compile the graph at the end
--args.attn-backend Specifies kernel implementation for attention
--args.mla-backend Specifies implementation for multi-head latent attention
--args.max-seq-len Maximum sequence length for inference/cache
--args.max-batch-size Maximum dimension for statically allocated KV cache
--args.attn-page-size Page size for attention
--prompt.batch-size Number of queries to generate
--benchmark.enabled Whether to run the built-in benchmark (true/false)

For default values and additional configuration options, refer to the ExperimentConfig class in build_and_run_ad.py file.

The following is a more complete example of using the script:

cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True

Advanced Configuration

The script supports flexible configs:

  • CLI dot notation for nested fields
  • YAML configs with deep merge
  • Precedence: CLI > YAML > defaults

Please refer to Expert Configuration of LLM API for details.

Disclaimer

This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.