TensorRT-LLMs/examples/auto_deploy/model_registry
tcherckez-nvidia f6c4dd885f
[None][chore] Update AutoDeploy model list (#10505)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2026-01-10 08:47:37 +02:00
..
configs [None][chore] Update AutoDeploy model list (#10505) 2026-01-10 08:47:37 +02:00
models.yaml [None][chore] Update AutoDeploy model list (#10505) 2026-01-10 08:47:37 +02:00
README.md [#9640][feat] Migrate model registry to v2.0 format with composable configs (#9836) 2025-12-19 05:30:02 -08:00

AutoDeploy Model Registry

The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.

Format

Version: 2.0 (Flat format with composable configurations)

Structure

version: '2.0'
description: AutoDeploy Model Registry - Flat format with composable configs
models:
- name: meta-llama/Llama-3.1-8B-Instruct
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]

- name: meta-llama/Llama-3.3-70B-Instruct
  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]

- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]

Key Concepts

  • Flat list: Models are in a single flat list (not grouped)
  • Composable configs: Each model references YAML config files via yaml_extra
  • Deep merging: Config files are merged in order (later files override earlier ones)
  • No inline args: All configuration is in YAML files for reusability

Configuration Files

Config files are stored in configs/ subdirectory and define runtime parameters:

Core Configs

File Purpose Example Use
dashboard_default.yaml Baseline settings for all models Always first in yaml_extra
world_size_N.yaml GPU count (1, 2, 4, 8) Defines tensor_parallel_size

Runtime Configs

File Purpose
multimodal.yaml Vision + text models
demollm_triton.yaml DemoLLM runtime with Triton backend
simple_shard_only.yaml Large models requiring simple sharding

Model-Specific Configs

File Purpose
llama-3.3-70b.yaml Optimized settings for Llama 3.3 70B
nano_v3.yaml Settings for Nemotron Nano V3
llama-4-scout.yaml Settings for Llama 4 Scout
openelm.yaml Apple OpenELM (custom tokenizer)
gemma3_1b.yaml Gemma 3 1B (sequence length)
deepseek_v3_lite.yaml DeepSeek V3/R1 (reduced layers)
llama4_maverick_lite.yaml Llama 4 Maverick (reduced layers)

Adding a New Model

Simple Model (Standard Config)

- name: organization/my-new-model-7b
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]

Model with Special Requirements

- name: organization/my-multimodal-model
  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]

Model with Custom Config

  1. Create configs/my_model.yaml:
# Custom settings for my model
max_batch_size: 2048
kv_cache_free_gpu_memory_fraction: 0.95
cuda_graph_config:
  enable_padding: true
  1. Reference it in models.yaml:
- name: organization/my-custom-model
  yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]

Config Merging

Configs are merged in order. Example:

yaml_extra:
  - dashboard_default.yaml    # baseline: runtime=trtllm, benchmark_enabled=true
  - world_size_2.yaml         # adds: tensor_parallel_size=2
  - openelm.yaml              # overrides: tokenizer=llama-2, benchmark_enabled=false

Result: runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false

World Size Guidelines

World Size Model Size Range Example Models
1 < 2B params TinyLlama, Qwen 0.5B, Phi-4-mini
2 2-15B params Llama 3.1 8B, Qwen 7B, Mistral 7B
4 20-80B params Llama 3.3 70B, QwQ 32B, Gemma 27B
8 80B+ params DeepSeek V3, Llama 405B, Nemotron Ultra

Model Coverage

The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.

To verify current model counts and coverage:

cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
    --source local \
    --local-path /path/to/TensorRT-LLM \
    --output /tmp/model_coverage.yaml

# View summary
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l

When adding or removing models, use prepare_model_coverage_v2.py to validate the registry structure and coverage.

Best Practices

  1. Always include dashboard_default.yaml first - it provides baseline settings
  2. Always include a world_size_N.yaml - defines GPU count
  3. Add special configs after world_size - they override defaults
  4. Create reusable configs - if 3+ models need same settings, make a config file
  5. Use model-specific configs sparingly - only for unique requirements
  6. Test before committing - verify with prepare_model_coverage_v2.py

Testing Changes

# Generate workload from local changes
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
    --source local \
    --local-path /path/to/TensorRT-LLM \
    --output /tmp/test_workload.yaml

# Verify output
cat /tmp/test_workload.yaml