mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-20 09:45:21 +08:00
240 lines
6.6 KiB
Markdown
240 lines
6.6 KiB
Markdown
# TensorRT-LLM Disaggregated Benchmark Framework
|
|
|
|
A YAML-based testing framework for TensorRT-LLM disaggregated serving performance and accuracy benchmarks.
|
|
|
|
## Overview
|
|
|
|
This framework provides a simple, maintainable approach to benchmark testing using YAML configuration files. Each test configuration is defined in a separate YAML file, with automatic test discovery and execution through pytest.
|
|
|
|
## Key Features
|
|
|
|
- **YAML Configuration**: Each test has its own independent YAML configuration file
|
|
- **Automatic Test Discovery**: Tests are automatically discovered from the config directory structure
|
|
- **Default Metrics**: Built-in default metrics configuration for common test scenarios
|
|
- **GPU Filtering**: Automatically filters tests based on hardware compatibility
|
|
- **Flexible Override**: Override default configurations as needed for special cases
|
|
- **Test Categories**: Support for both performance (perf) and accuracy tests
|
|
- **Multiple Test Types**: Support for disagg (disaggregated) and wideep architectures
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
test_configs/
|
|
├── disagg/ # Disaggregated serving tests
|
|
│ ├── perf/ # Performance tests
|
|
│ └── accuracy/ # Accuracy tests (optional)
|
|
└── wideep/ # Wide-deep tests
|
|
├── perf/
|
|
└── accuracy/
|
|
```
|
|
|
|
## YAML Configuration
|
|
|
|
### Minimal Configuration Example
|
|
|
|
```yaml
|
|
metadata:
|
|
model_name: "deepseek-r1-fp4"
|
|
precision: "fp4"
|
|
supported_gpus: ["GB200"]
|
|
|
|
slurm:
|
|
partition: "<partition>"
|
|
account: "<account>"
|
|
job_time: "02:00:00"
|
|
|
|
benchmark:
|
|
mode: "e2e"
|
|
streaming: true
|
|
concurrency_list: "1 2 4 8 16 36"
|
|
input_length: 1024
|
|
output_length: 1024
|
|
|
|
hardware:
|
|
gpus_per_node: 4
|
|
num_ctx_servers: 1
|
|
num_gen_servers: 4
|
|
|
|
environment:
|
|
container_mount: "<container_mount>"
|
|
container_image: "<container_image>"
|
|
model_path: "<model_path>"
|
|
|
|
worker_config:
|
|
gen:
|
|
tensor_parallel_size: 8
|
|
moe_expert_parallel_size: 8
|
|
max_batch_size: 32
|
|
max_num_tokens: 32
|
|
max_seq_len: 2251
|
|
# ... other gen worker configs
|
|
|
|
ctx:
|
|
tensor_parallel_size: 4
|
|
moe_expert_parallel_size: 4
|
|
max_batch_size: 4
|
|
max_num_tokens: 4608
|
|
max_seq_len: 2251
|
|
# ... other ctx worker configs
|
|
```
|
|
|
|
### Custom Metrics (Optional)
|
|
|
|
Most tests use default metrics. To customize:
|
|
|
|
```yaml
|
|
benchmark:
|
|
metrics:
|
|
log_file: "custom_benchmark.log"
|
|
extractor_pattern: "Custom Pattern:\s+([0-9.]+)"
|
|
metric_names: ["CUSTOM_METRIC"]
|
|
```
|
|
|
|
## GPU Support
|
|
|
|
Currently supports **OCI GB200** only. The framework is designed to support additional GPU types in the future.
|
|
|
|
All configurations must specify:
|
|
```yaml
|
|
metadata:
|
|
supported_gpus: ["GB200"]
|
|
```
|
|
|
|
## Configuration Validation
|
|
|
|
The framework validates configurations before execution:
|
|
|
|
1. **gen_max_tokens**: Must equal `gen_max_batch_size * (mtp_size + 1)` when MTP is enabled
|
|
2. **streaming**: Must be `true`
|
|
3. **max_seq_len**: Both ctx and gen must be > (input_length + output_length)
|
|
|
|
## Running Tests
|
|
|
|
### Run all tests
|
|
```bash
|
|
poetry run pytest --disagg test_disagg.py -s -vv
|
|
```
|
|
|
|
### Run from test list
|
|
```bash
|
|
poetry run pytest --disagg test_disagg.py -s -vv --disagg-test-list=./testlist/disagg.txt
|
|
```
|
|
|
|
### Run specific tests
|
|
```bash
|
|
# Run only performance tests
|
|
poetry run pytest --disagg test_disagg.py -s -vv -m perf
|
|
|
|
# Run only accuracy tests
|
|
poetry run pytest --disagg test_disagg.py -s -vv -m accuracy
|
|
|
|
# Run specific test by ID
|
|
poetry run pytest --disagg test_disagg.py -s -vv -k "deepseek-r1-fp4_1k1k"
|
|
```
|
|
|
|
## Test Naming Convention
|
|
|
|
Tests are automatically named using the format:
|
|
```
|
|
{test_type}_{category}_{config_filename}
|
|
```
|
|
|
|
Example: `disagg_perf_deepseek-r1-fp4_1k1k_ctx1_gen4_tep8_bs32_eplb0_mtp0_ccb-NIXL`
|
|
|
|
## File Naming Convention
|
|
|
|
Configuration files should follow this format:
|
|
```
|
|
{model}_{benchmark_type}_{config_details}.yaml
|
|
```
|
|
|
|
Examples:
|
|
- `deepseek-r1-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb0_mtp0_ccb-NIXL.yaml`
|
|
- `deepseek-r1-fp4_8k1k_ctx1_gen3_tep8_bs32_eplb0_mtp0_ccb-UCX.yaml`
|
|
|
|
Where:
|
|
- `1k1k`, `8k1k`: Input/output lengths (1024/1024, 8192/1024)
|
|
- `ctx1_gen1`: Context and generation server counts
|
|
- `dep32` or `tep8`: Data parallel (dep) or tensor parallel (tep) configuration
|
|
- `bs32`: Batch size
|
|
- `eplb0`: Expert parallel load balancing slots
|
|
- `mtp0`: Multi-token prediction layers
|
|
- `ccb-NIXL` or `ccb-UCX`: Communication backend
|
|
|
|
## Key Configuration Fields
|
|
|
|
### Metadata
|
|
- `model_name`: Model identifier
|
|
- `precision`: Model precision (fp4, fp8, etc.)
|
|
- `supported_gpus`: List of compatible GPU types
|
|
|
|
### Benchmark
|
|
- `mode`: Benchmark mode (e2e, gen_only, ctx_only)
|
|
- `streaming`: Enable streaming (must be true)
|
|
- `input_length`, `output_length`: Sequence lengths
|
|
- `concurrency_list`: Concurrency levels to test
|
|
|
|
### Worker Config
|
|
- `tensor_parallel_size`: Tensor parallelism degree
|
|
- `moe_expert_parallel_size`: MoE expert parallelism
|
|
- `max_batch_size`: Maximum batch size
|
|
- `max_num_tokens`: Maximum tokens per batch
|
|
- `max_seq_len`: Maximum sequence length
|
|
- `speculative_config`: Multi-token prediction settings (optional)
|
|
|
|
## Test Output
|
|
|
|
Test results are saved to:
|
|
- Performance metrics: `{OUTPUT_PATH}/perf_script_test_results.csv`
|
|
- Test logs: `{OUTPUT_PATH}/disagg_benchmark_{timestamp}.log`
|
|
|
|
## Environment Variables
|
|
|
|
- `GPU_TYPE`: Current GPU type (default: GB200)
|
|
- `OUTPUT_PATH`: Directory for test results and logs
|
|
- `WORK_DIR`: Working directory for benchmark execution
|
|
- `DEBUG_MODE`: Enable debug mode (set to "1" to skip job submission)
|
|
- `DEBUG_JOB_ID`: Job ID to use in debug mode
|
|
|
|
## Debug Mode
|
|
|
|
For local testing without SLURM submission:
|
|
|
|
```bash
|
|
export DEBUG_MODE=1
|
|
export DEBUG_JOB_ID=12345
|
|
poetry run pytest --disagg test_disagg.py -s -vv
|
|
```
|
|
|
|
## Architecture
|
|
|
|
The framework consists of:
|
|
|
|
1. **ConfigLoader**: Scans and loads YAML configurations
|
|
2. **ConfigValidator**: Validates configuration correctness
|
|
3. **JobManager**: Handles SLURM job submission and monitoring
|
|
4. **LogParser**: Extracts metrics from benchmark logs
|
|
5. **TestCaseTracker**: Tracks test execution timing
|
|
6. **ResultSaver**: Saves results to CSV
|
|
|
|
## Benefits
|
|
|
|
- **Simple**: YAML-based configuration, no code changes needed
|
|
- **Maintainable**: Each test is a separate file
|
|
- **Flexible**: Override defaults only when needed
|
|
- **Scalable**: Easy to add new tests and models
|
|
- **Reliable**: Automatic validation before execution
|
|
- **Traceable**: Comprehensive logging and result tracking
|
|
|
|
## Adding New Tests
|
|
|
|
1. Create a new YAML file in `test_configs/{test_type}/{category}/`
|
|
2. Configure the test parameters
|
|
3. Run pytest - the test will be automatically discovered
|
|
|
|
No code changes required!
|
|
|
|
---
|
|
|
|
For detailed configuration options and advanced usage, refer to the inline comments in the YAML configuration files.
|