mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

History

tcherckez-nvidia f6c4dd885f [None][chore] Update AutoDeploy model list (#10505 ) Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>		2026-01-10 08:47:37 +02:00
..
configs	[None][chore] Update AutoDeploy model list (#10505 )	2026-01-10 08:47:37 +02:00
models.yaml	[None][chore] Update AutoDeploy model list (#10505 )	2026-01-10 08:47:37 +02:00
README.md	[#9640 ][feat] Migrate model registry to v2.0 format with composable configs (#9836 )	2025-12-19 05:30:02 -08:00

README.md

AutoDeploy Model Registry

The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.

Format

Version: 2.0 (Flat format with composable configurations)

Structure

version: '2.0'
description: AutoDeploy Model Registry - Flat format with composable configs
models:
- name: meta-llama/Llama-3.1-8B-Instruct
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]

- name: meta-llama/Llama-3.3-70B-Instruct
  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]

- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]

Key Concepts

Flat list: Models are in a single flat list (not grouped)
Composable configs: Each model references YAML config files via yaml_extra
Deep merging: Config files are merged in order (later files override earlier ones)
No inline args: All configuration is in YAML files for reusability

Configuration Files

Config files are stored in configs/ subdirectory and define runtime parameters:

Core Configs

File	Purpose	Example Use
`dashboard_default.yaml`	Baseline settings for all models	Always first in yaml_extra
`world_size_N.yaml`	GPU count (1, 2, 4, 8)	Defines tensor_parallel_size

Runtime Configs

File	Purpose
`multimodal.yaml`	Vision + text models
`demollm_triton.yaml`	DemoLLM runtime with Triton backend
`simple_shard_only.yaml`	Large models requiring simple sharding

Model-Specific Configs

File	Purpose
`llama-3.3-70b.yaml`	Optimized settings for Llama 3.3 70B
`nano_v3.yaml`	Settings for Nemotron Nano V3
`llama-4-scout.yaml`	Settings for Llama 4 Scout
`openelm.yaml`	Apple OpenELM (custom tokenizer)
`gemma3_1b.yaml`	Gemma 3 1B (sequence length)
`deepseek_v3_lite.yaml`	DeepSeek V3/R1 (reduced layers)
`llama4_maverick_lite.yaml`	Llama 4 Maverick (reduced layers)

Adding a New Model

Simple Model (Standard Config)

- name: organization/my-new-model-7b
  yaml_extra: [dashboard_default.yaml, world_size_2.yaml]

Model with Special Requirements

- name: organization/my-multimodal-model
  yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]

Model with Custom Config

Create configs/my_model.yaml:

# Custom settings for my model
max_batch_size: 2048
kv_cache_free_gpu_memory_fraction: 0.95
cuda_graph_config:
  enable_padding: true

Reference it in models.yaml:

- name: organization/my-custom-model
  yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]

Config Merging

Configs are merged in order. Example:

yaml_extra:
  - dashboard_default.yaml    # baseline: runtime=trtllm, benchmark_enabled=true
  - world_size_2.yaml         # adds: tensor_parallel_size=2
  - openelm.yaml              # overrides: tokenizer=llama-2, benchmark_enabled=false

Result: runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false

World Size Guidelines

World Size	Model Size Range	Example Models
1	< 2B params	TinyLlama, Qwen 0.5B, Phi-4-mini
2	2-15B params	Llama 3.1 8B, Qwen 7B, Mistral 7B
4	20-80B params	Llama 3.3 70B, QwQ 32B, Gemma 27B
8	80B+ params	DeepSeek V3, Llama 405B, Nemotron Ultra

Model Coverage

The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.

To verify current model counts and coverage:

cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
    --source local \
    --local-path /path/to/TensorRT-LLM \
    --output /tmp/model_coverage.yaml

# View summary
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l

When adding or removing models, use prepare_model_coverage_v2.py to validate the registry structure and coverage.

Best Practices

Always include dashboard_default.yaml first - it provides baseline settings
Always include a world_size_N.yaml - defines GPU count
Add special configs after world_size - they override defaults
Create reusable configs - if 3+ models need same settings, make a config file
Use model-specific configs sparingly - only for unique requirements
Test before committing - verify with prepare_model_coverage_v2.py

Testing Changes

# Generate workload from local changes
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
    --source local \
    --local-path /path/to/TensorRT-LLM \
    --output /tmp/test_workload.yaml

# Verify output
cat /tmp/test_workload.yaml