TensorRT-LLMs/examples/auto_deploy/nemotron_flash.yaml
Lucas Liebenwein ff3a494f5c
[#10013][feat] AutoDeploy: native cache manager integration (#10635)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-27 11:23:22 -05:00

15 lines
457 B
YAML

compile_backend: torch-cudagraph
max_batch_size: 384
max_seq_len: 2097152
max_num_tokens: 8192
enable_chunked_prefill: true
model_factory: NemotronFlashForCausalLM
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 96, 128, 256, 320, 384]
transforms:
gather_logits_before_lm_head:
# TODO: fix https://github.com/NVIDIA/TensorRT-LLM/issues/9878 to enable by default
enabled: true
fuse_mamba_a_log:
stage: post_load_fusion
enabled: true