mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-04 18:21:52 +08:00
161 lines
4.9 KiB
Markdown
161 lines
4.9 KiB
Markdown
# AutoDeploy Model Registry
|
|
|
|
The AutoDeploy model registry provides a comprehensive, maintainable list of supported models for testing and coverage tracking.
|
|
|
|
## Format
|
|
|
|
**Version: 2.0** (Flat format with composable configurations)
|
|
|
|
### Structure
|
|
|
|
```yaml
|
|
version: '2.0'
|
|
description: AutoDeploy Model Registry - Flat format with composable configs
|
|
models:
|
|
- name: meta-llama/Llama-3.1-8B-Instruct
|
|
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
|
|
|
|
- name: meta-llama/Llama-3.3-70B-Instruct
|
|
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, llama-3.3-70b.yaml]
|
|
|
|
- name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
|
yaml_extra: [dashboard_default.yaml, world_size_2.yaml, demollm_triton.yaml]
|
|
```
|
|
|
|
### Key Concepts
|
|
|
|
- **Flat list**: Models are in a single flat list (not grouped)
|
|
- **Composable configs**: Each model references YAML config files via `yaml_extra`
|
|
- **Deep merging**: Config files are merged in order (later files override earlier ones)
|
|
- **No inline args**: All configuration is in YAML files for reusability
|
|
|
|
## Configuration Files
|
|
|
|
Config files are stored in `configs/` subdirectory and define runtime parameters:
|
|
|
|
### Core Configs
|
|
|
|
| File | Purpose | Example Use |
|
|
|------|---------|-------------|
|
|
| `dashboard_default.yaml` | Baseline settings for all models | Always first in yaml_extra |
|
|
| `world_size_N.yaml` | GPU count (1, 2, 4, 8) | Defines tensor_parallel_size |
|
|
|
|
### Runtime Configs
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `multimodal.yaml` | Vision + text models |
|
|
| `demollm_triton.yaml` | DemoLLM runtime with Triton backend |
|
|
| `simple_shard_only.yaml` | Large models requiring simple sharding
|
|
|
|
### Model-Specific Configs
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `llama-3.3-70b.yaml` | Optimized settings for Llama 3.3 70B |
|
|
| `nano_v3.yaml` | Settings for Nemotron Nano V3 |
|
|
| `llama-4-scout.yaml` | Settings for Llama 4 Scout |
|
|
| `openelm.yaml` | Apple OpenELM (custom tokenizer) |
|
|
| `gemma3_1b.yaml` | Gemma 3 1B (sequence length) |
|
|
| `deepseek_v3_lite.yaml` | DeepSeek V3/R1 (reduced layers) |
|
|
| `llama4_maverick_lite.yaml` | Llama 4 Maverick (reduced layers) |
|
|
|
|
## Adding a New Model
|
|
|
|
### Simple Model (Standard Config)
|
|
|
|
```yaml
|
|
- name: organization/my-new-model-7b
|
|
yaml_extra: [dashboard_default.yaml, world_size_2.yaml]
|
|
```
|
|
|
|
### Model with Special Requirements
|
|
|
|
```yaml
|
|
- name: organization/my-multimodal-model
|
|
yaml_extra: [dashboard_default.yaml, world_size_4.yaml, multimodal.yaml]
|
|
```
|
|
|
|
### Model with Custom Config
|
|
|
|
1. Create `configs/my_model.yaml`:
|
|
|
|
```yaml
|
|
# Custom settings for my model
|
|
max_batch_size: 2048
|
|
kv_cache_free_gpu_memory_fraction: 0.95
|
|
cuda_graph_config:
|
|
enable_padding: true
|
|
```
|
|
|
|
2. Reference it in `models.yaml`:
|
|
|
|
```yaml
|
|
- name: organization/my-custom-model
|
|
yaml_extra: [dashboard_default.yaml, world_size_8.yaml, my_model.yaml]
|
|
```
|
|
|
|
## Config Merging
|
|
|
|
Configs are merged in order. Example:
|
|
|
|
```yaml
|
|
yaml_extra:
|
|
- dashboard_default.yaml # baseline: runtime=trtllm, benchmark_enabled=true
|
|
- world_size_2.yaml # adds: tensor_parallel_size=2
|
|
- openelm.yaml # overrides: tokenizer=llama-2, benchmark_enabled=false
|
|
```
|
|
|
|
**Result**: `runtime=trtllm, tensor_parallel_size=2, tokenizer=llama-2, benchmark_enabled=false`
|
|
|
|
## World Size Guidelines
|
|
|
|
| World Size | Model Size Range | Example Models |
|
|
|------------|------------------|----------------|
|
|
| 1 | \< 2B params | TinyLlama, Qwen 0.5B, Phi-4-mini |
|
|
| 2 | 2-15B params | Llama 3.1 8B, Qwen 7B, Mistral 7B |
|
|
| 4 | 20-80B params | Llama 3.3 70B, QwQ 32B, Gemma 27B |
|
|
| 8 | 80B+ params | DeepSeek V3, Llama 405B, Nemotron Ultra |
|
|
|
|
## Model Coverage
|
|
|
|
The registry contains models distributed across different GPU configurations (world sizes 1, 2, 4, and 8), including both text-only and multimodal models.
|
|
|
|
**To verify current model counts and coverage:**
|
|
|
|
```bash
|
|
cd /path/to/autodeploy-dashboard
|
|
python3 scripts/prepare_model_coverage_v2.py \
|
|
--source local \
|
|
--local-path /path/to/TensorRT-LLM \
|
|
--output /tmp/model_coverage.yaml
|
|
|
|
# View summary
|
|
grep -E "^- name:|yaml_extra:" /path/to/TensorRT-LLM/examples/auto_deploy/model_registry/models.yaml | wc -l
|
|
```
|
|
|
|
When adding or removing models, use `prepare_model_coverage_v2.py` to validate the registry structure and coverage.
|
|
|
|
## Best Practices
|
|
|
|
1. **Always include `dashboard_default.yaml` first** - it provides baseline settings
|
|
1. **Always include a `world_size_N.yaml`** - defines GPU count
|
|
1. **Add special configs after world_size** - they override defaults
|
|
1. **Create reusable configs** - if 3+ models need same settings, make a config file
|
|
1. **Use model-specific configs sparingly** - only for unique requirements
|
|
1. **Test before committing** - verify with `prepare_model_coverage_v2.py`
|
|
|
|
## Testing Changes
|
|
|
|
```bash
|
|
# Generate workload from local changes
|
|
cd /path/to/autodeploy-dashboard
|
|
python3 scripts/prepare_model_coverage_v2.py \
|
|
--source local \
|
|
--local-path /path/to/TensorRT-LLM \
|
|
--output /tmp/test_workload.yaml
|
|
|
|
# Verify output
|
|
cat /tmp/test_workload.yaml
|
|
```
|