TensorRT-LLMs/examples/auto_deploy/model_registry/README.md
tcherckez-nvidia 9f6abaf59f
[#9640][feat] Migrate model registry to v2.0 format with composable configs (#9836)
Signed-off-by: Tal Cherckez <127761168+tcherckez-nvidia@users.noreply.github.com>
2025-12-19 05:30:02 -08:00

161 lines
4.9 KiB
Markdown

# AutoDeploy Model Registry
The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.
## Format
**Version: 2.0** (Flat format with composable configurations)
### Structure
```yaml
version: '2.0'
description: AutoDeploy Model Registry - Flat format with composable configs
models:
- name: meta-llama/Llama-3.1-8B-Instruct
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
- name: meta-llama/Llama-3.3-70B-Instruct
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]
- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]
```
### Key Concepts
- **Flat list**: Models are in a single flat list (not grouped)
- **Composable configs**: Each model references YAML config files via `yaml_extra`
- **Deep merging**: Config files are merged in order (later files override earlier ones)
- **No inline args**: All configuration is in YAML files for reusability
## Configuration Files
Config files are stored in `configs/` subdirectory and define runtime parameters:
### Core Configs
| File | Purpose | Example Use |
|------|---------|-------------|
| `dashboard_default.yaml` | Baseline settings for all models | Always first in yaml_extra |
| `world_size_N.yaml` | GPU count (1, 2, 4, 8) | Defines tensor_parallel_size |
### Runtime Configs
| File | Purpose |
|------|---------|
| `multimodal.yaml` | Vision + text models |
| `demollm_triton.yaml` | DemoLLM runtime with Triton backend |
| `simple_shard_only.yaml` | Large models requiring simple sharding
### Model-Specific Configs
| File | Purpose |
|------|---------|
| `llama-3.3-70b.yaml` | Optimized settings for Llama 3.3 70B |
| `nano_v3.yaml` | Settings for Nemotron Nano V3 |
| `llama-4-scout.yaml` | Settings for Llama 4 Scout |
| `openelm.yaml` | Apple OpenELM (custom tokenizer) |
| `gemma3_1b.yaml` | Gemma 3 1B (sequence length) |
| `deepseek_v3_lite.yaml` | DeepSeek V3/R1 (reduced layers) |
| `llama4_maverick_lite.yaml` | Llama 4 Maverick (reduced layers) |
## Adding a New Model
### Simple Model (Standard Config)
```yaml
- name: organization/my-new-model-7b
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
```
### Model with Special Requirements
```yaml
- name: organization/my-multimodal-model
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]
```
### Model with Custom Config
1. Create `configs/my_model.yaml`:
```yaml
# Custom settings for my model
max_batch_size: 2048
kv_cache_free_gpu_memory_fraction: 0.95
cuda_graph_config:
enable_padding: true
```
2. Reference it in `models.yaml`:
```yaml
- name: organization/my-custom-model
yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]
```
## Config Merging
Configs are merged in order. Example:
```yaml
yaml_extra:
- dashboard_default.yaml # baseline: runtime=trtllm, benchmark_enabled=true
- world_size_2.yaml # adds: tensor_parallel_size=2
- openelm.yaml # overrides: tokenizer=llama-2, benchmark_enabled=false
```
**Result**: `runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false`
## World Size Guidelines
| World Size | Model Size Range | Example Models |
|------------|------------------|----------------|
| 1 | \< 2B params | TinyLlama, Qwen 0.5B, Phi-4-mini |
| 2 | 2-15B params | Llama 3.1 8B, Qwen 7B, Mistral 7B |
| 4 | 20-80B params | Llama 3.3 70B, QwQ 32B, Gemma 27B |
| 8 | 80B+ params | DeepSeek V3, Llama 405B, Nemotron Ultra |
## Model Coverage
The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.
**To verify current model counts and coverage:**
```bash
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
--source local \
--local-path /path/to/TensorRT-LLM \
--output /tmp/model_coverage.yaml
# View summary
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l
```
When adding or removing models, use `prepare_model_coverage_v2.py` to validate the registry structure and coverage.
## Best Practices
1. **Always include `dashboard_default.yaml` first** - it provides baseline settings
1. **Always include a `world_size_N.yaml`** - defines GPU count
1. **Add special configs after world_size** - they override defaults
1. **Create reusable configs** - if 3+ models need same settings, make a config file
1. **Use model-specific configs sparingly** - only for unique requirements
1. **Test before committing** - verify with `prepare_model_coverage_v2.py`
## Testing Changes
```bash
# Generate workload from local changes
cd /path/to/autodeploy-dashboard
python3 scripts/prepare_model_coverage_v2.py \
--source local \
--local-path /path/to/TensorRT-LLM \
--output /tmp/test_workload.yaml
# Verify output
cat /tmp/test_workload.yaml
```