mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Guoming Zhang f53fb4c803 [TRTLLM-5930][doc] 1.0 Documentation. (#6696 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

2025-09-09 12:16:03 +08:00

2.2 KiB

Raw Blame History

Example Run Script

To build and run AutoDeploy example, use the examples/auto_deploy/build_and_run_ad.py script:

cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

You can configure your experiment with various options. Use the -h/--help flag to see available options:

python build_and_run_ad.py --help

The following is a non-exhaustive list of common configuration options:

Configuration Key	Description
`--model`	The HF model card or path to a HF checkpoint folder
`--args.model-factory`	Choose model factory implementation (`"AutoModelForCausalLM"`, ...)
`--args.skip-loading-weights`	Only load the architecture, not the weights
`--args.model-kwargs`	Extra kwargs that are being passed to the model initializer in the model factory
`--args.tokenizer-kwargs`	Extra kwargs that are being passed to the tokenizer initializer in the model factory
`--args.world-size`	The number of GPUs used for auto-sharding the model
`--args.runtime`	Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`)
`--args.compile-backend`	Specifies how to compile the graph at the end
`--args.attn-backend`	Specifies kernel implementation for attention
`--args.mla-backend`	Specifies implementation for multi-head latent attention
`--args.max-seq-len`	Maximum sequence length for inference/cache
`--args.max-batch-size`	Maximum dimension for statically allocated KV cache
`--args.attn-page-size`	Page size for attention
`--prompt.batch-size`	Number of queries to generate
`--benchmark.enabled`	Whether to run the built-in benchmark (true/false)

For default values and additional configuration options, refer to the ExperimentConfig class in examples/auto_deploy/build_and_run_ad.py file.

The following is a more complete example of using the script:

cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True

2.2 KiB Raw Blame History

Example Run Script

2.2 KiB

Raw Blame History