mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
|
|
||
|---|---|---|
| .. | ||
| configs | ||
| models.yaml | ||
| README.md | ||
AutoDeploy Model Registry
The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.
Format
Version: 2.0 (Flat format with composable configurations)
Structure
version: '2.0'
description: AutoDeploy Model Registry - Flat format with composable configs
models:
- name: meta-llama/Llama-3.1-8B-Instruct
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
- name: meta-llama/Llama-3.3-70B-Instruct
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]
- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]
Key Concepts
- Flat list: Models are in a single flat list (not grouped)
- Composable configs: Each model references YAML config files via
yaml_extra - Deep merging: Config files are merged in order (later files override earlier ones)
- No inline args: All configuration is in YAML files for reusability
Configuration Files
Config files are stored in configs/ subdirectory and define runtime parameters:
Core Configs
| File | Purpose | Example Use |
|---|---|---|
dashboard_default.yaml |
Baseline settings for all models | Always first in yaml_extra |
world_size_N.yaml |
GPU count (1, 2, 4, 8) | Defines tensor_parallel_size |
Runtime Configs
| File | Purpose |
|---|---|
multimodal.yaml |
Vision + text models |
demollm_triton.yaml |
DemoLLM runtime with Triton backend |
simple_shard_only.yaml |
Large models requiring simple sharding |
Model-Specific Configs
| File | Purpose |
|---|---|
llama-3.3-70b.yaml |
Optimized settings for Llama 3.3 70B |
nano_v3.yaml |
Settings for Nemotron Nano V3 |
llama-4-scout.yaml |
Settings for Llama 4 Scout |
openelm.yaml |
Apple OpenELM (custom tokenizer) |
gemma3_1b.yaml |
Gemma 3 1B (sequence length) |
deepseek_v3_lite.yaml |
DeepSeek V3/R1 (reduced layers) |
llama4_maverick_lite.yaml |
Llama 4 Maverick (reduced layers) |
Adding a New Model
Simple Model (Standard Config)
- name: organization/my-new-model-7b
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
Model with Special Requirements
- name: organization/my-multimodal-model
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]
Model with Custom Config
- Create
configs/my_model.yaml:
# Custom settings for my model
max_batch_size: 2048
kv_cache_free_gpu_memory_fraction: 0.95
cuda_graph_config:
enable_padding: true
- Reference it in
models.yaml:
- name: organization/my-custom-model
yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]
Config Merging
Configs are merged in order. Example:
yaml_extra:
- dashboard_default.yaml # baseline: runtime=trtllm, benchmark_enabled=true
- world_size_2.yaml # adds: tensor_parallel_size=2
- openelm.yaml # overrides: tokenizer=llama-2, benchmark_enabled=false
Result: runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false
World Size Guidelines
| World Size | Model Size Range | Example Models |
|---|---|---|
| 1 | < 2B params | TinyLlama, Qwen 0.5B, Phi-4-mini |
| 2 | 2-15B params | Llama 3.1 8B, Qwen 7B, Mistral 7B |
| 4 | 20-80B params | Llama 3.3 70B, QwQ 32B, Gemma 27B |
| 8 | 80B+ params | DeepSeek V3, Llama 405B, Nemotron Ultra |
Model Coverage
The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.
To verify current model counts and coverage:
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
--source local \
--local-path /path/to/TensorRT-LLM \
--output /tmp/model_coverage.yaml
# View summary
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l
When adding or removing models, use prepare_model_coverage_v2.py to validate the registry structure and coverage.
Best Practices
- Always include
dashboard_default.yamlfirst - it provides baseline settings - Always include a
world_size_N.yaml- defines GPU count - Add special configs after world_size - they override defaults
- Create reusable configs - if 3+ models need same settings, make a config file
- Use model-specific configs sparingly - only for unique requirements
- Test before committing - verify with
prepare_model_coverage_v2.py
Testing Changes
# Generate workload from local changes
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
--source local \
--local-path /path/to/TensorRT-LLM \
--output /tmp/test_workload.yaml
# Verify output
cat /tmp/test_workload.yaml