mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Fridah-nv 0f947c64cb [None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder (#7233 ) Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>		2025-08-26 10:47:57 -07:00
..
.vscode	[AutoDeploy] merge feat/ad-2025-07-07 (#6196 )	2025-07-23 05:11:04 +08:00
.gitignore	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
build_and_run_ad.py	[None][feat] Add GPT OSS support for AutoDeploy (#6641 )	2025-08-12 14:03:22 -04:00
build_and_run_flux.py	[AutoDeploy] merge feat/ad-2025-07-07 (#6196 )	2025-07-23 05:11:04 +08:00
CONTRIBUTING.md	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
README.md	[None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder (#7233 )	2025-08-26 10:47:57 -07:00
requirements.txt	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00

README.md

🔥🚀⚡ AutoDeploy Examples

This folder contains runnable examples for AutoDeploy. For general AutoDeploy documentation, motivation, support matrix, and feature overview, please see the official docs.

Quick Start

AutoDeploy is included with the TRT-LLM installation.

sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

You can refer to TRT-LLM installation guide for more information.

Run a simple example with a Hugging Face model:

cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Example Run Script (`build_and_run_ad.py`)

This script demonstrates end-to-end deployment of HuggingFace checkpoints using AutoDeploy’s graph-transformation pipeline.

You can configure your experiment with various options. Use the -h/--help flag to see available options:

python build_and_run_ad.py --help

Below is a non-exhaustive list of common configuration options:

Configuration Key	Description
`--model`	The HF model card or path to a HF checkpoint folder
`--args.model-factory`	Choose model factory implementation (`"AutoModelForCausalLM"`, ...)
`--args.skip-loading-weights`	Only load the architecture, not the weights
`--args.model-kwargs`	Extra kwargs that are being passed to the model initializer in the model factory
`--args.tokenizer-kwargs`	Extra kwargs that are being passed to the tokenizer initializer in the model factory
`--args.world-size`	The number of GPUs used for auto-sharding the model
`--args.runtime`	Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`)
`--args.compile-backend`	Specifies how to compile the graph at the end
`--args.attn-backend`	Specifies kernel implementation for attention
`--args.mla-backend`	Specifies implementation for multi-head latent attention
`--args.max-seq-len`	Maximum sequence length for inference/cache
`--args.max-batch-size`	Maximum dimension for statically allocated KV cache
`--args.attn-page-size`	Page size for attention
`--prompt.batch-size`	Number of queries to generate
`--benchmark.enabled`	Whether to run the built-in benchmark (true/false)

For default values and additional configuration options, refer to the ExperimentConfig class in build_and_run_ad.py file.

The following is a more complete example of using the script:

cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True

Advanced Configuration

The script supports flexible configs:

CLI dot notation for nested fields
YAML configs with deep merge
Precedence: CLI > YAML > defaults

Please refer to Expert Configuration of LLM API for details.

Disclaimer

This project is under active development and is currently in a prototype stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, there are no guarantees regarding functionality, stability, or reliability.

README.md Unescape Escape