TensorRT-LLMs/workflow.md at ff3a494f5ca86f8bc332c82b8d6d574e20ef8169

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-05 02:31:33 +08:00

Lucas Liebenwein ff3a494f5c

[#10013 ][feat] AutoDeploy: native cache manager integration (#10635 )

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

2026-01-27 11:23:22 -05:00

1.1 KiB

Raw Blame History

Incorporating `auto_deploy` into your own workflow

AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.

The following example demonstrates how to build an LLM object with AutoDeploy integration:

from tensorrt_llm._torch.auto_deploy import LLM


# Construct the LLM high-level interface object with autodeploy as backend
llm = LLM(
    model=<HF_MODEL_CARD_OR_DIR>,
    world_size=<DESIRED_WORLD_SIZE>,
    compile_backend="torch-compile",
    model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
    attn_backend="flashinfer", # choose between "triton" and "flashinfer"
    skip_loading_weights=False,
    model_factory="AutoModelForCausalLM", # choose appropriate model factory
    max_seq_len=<MAX_SEQ_LEN>,
    max_batch_size=<MAX_BATCH_SIZE>,
)

For more information about configuring AutoDeploy via the LLM API using **kwargs, see the AutoDeploy LLM API in tensorrt_llm._torch.auto_deploy.llm and the AutoDeployConfig class in tensorrt_llm._torch.auto_deploy.llm_args.

1.1 KiB Raw Blame History

Incorporating auto_deploy into your own workflow

1.1 KiB

Raw Blame History

Incorporating `auto_deploy` into your own workflow