Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> Co-authored-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| envs | ||
| execution | ||
| reporting | ||
| test_configs | ||
| testlist | ||
| utils | ||
| compare_backends.py | ||
| conftest.py | ||
| poetry.lock | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| session_collect.sh | ||
| simple_collect.py | ||
| test_disagg.py | ||
TensorRT-LLM Disaggregated Benchmark Framework
A YAML-based testing framework for TensorRT-LLM disaggregated serving performance and accuracy benchmarks.
Overview
This framework provides a simple, maintainable approach to benchmark testing using YAML configuration files. Each test configuration is defined in a separate YAML file, with automatic test discovery and execution through pytest.
Key Features
- YAML Configuration: Each test has its own independent YAML configuration file
- Automatic Test Discovery: Tests are automatically discovered from the config directory structure
- Default Metrics: Built-in default metrics configuration for common test scenarios
- GPU Filtering: Automatically filters tests based on hardware compatibility
- Flexible Override: Override default configurations as needed for special cases
- Test Categories: Support for both performance (perf) and accuracy tests
- Multiple Test Types: Support for disagg (disaggregated) and wideep architectures
Directory Structure
test_configs/
├── disagg/ # Disaggregated serving tests
│ ├── perf/ # Performance tests
│ └── accuracy/ # Accuracy tests (optional)
└── wideep/ # Wide-deep tests
├── perf/
└── accuracy/
YAML Configuration
Minimal Configuration Example
metadata:
model_name: "deepseek-r1-fp4"
precision: "fp4"
supported_gpus: ["GB200"]
slurm:
partition: "<partition>"
account: "<account>"
job_time: "02:00:00"
benchmark:
mode: "e2e"
streaming: true
concurrency_list: "1 2 4 8 16 36"
input_length: 1024
output_length: 1024
hardware:
gpus_per_node: 4
num_ctx_servers: 1
num_gen_servers: 4
environment:
container_mount: "<container_mount>"
container_image: "<container_image>"
model_path: "<model_path>"
worker_config:
gen:
tensor_parallel_size: 8
moe_expert_parallel_size: 8
max_batch_size: 32
max_num_tokens: 32
max_seq_len: 2251
# ... other gen worker configs
ctx:
tensor_parallel_size: 4
moe_expert_parallel_size: 4
max_batch_size: 4
max_num_tokens: 4608
max_seq_len: 2251
# ... other ctx worker configs
Custom Metrics (Optional)
Most tests use default metrics. To customize:
benchmark:
metrics:
log_file: "custom_benchmark.log"
extractor_pattern: "Custom Pattern:\s+([0-9.]+)"
metric_names: ["CUSTOM_METRIC"]
GPU Support
Currently supports OCI GB200 only. The framework is designed to support additional GPU types in the future.
All configurations must specify:
metadata:
supported_gpus: ["GB200"]
Configuration Validation
The framework validates configurations before execution:
- gen_max_tokens: Must equal
gen_max_batch_size * (mtp_size + 1)when MTP is enabled - streaming: Must be
true - max_seq_len: Both ctx and gen must be > (input_length + output_length)
Running Tests
Run all tests
poetry run pytest --disagg test_disagg.py -s -vv
Run from test list
poetry run pytest --disagg test_disagg.py -s -vv --disagg-test-list=./testlist/disagg.txt
Run specific tests
# Run only performance tests
poetry run pytest --disagg test_disagg.py -s -vv -m perf
# Run only accuracy tests
poetry run pytest --disagg test_disagg.py -s -vv -m accuracy
# Run specific test by ID
poetry run pytest --disagg test_disagg.py -s -vv -k "deepseek-r1-fp4_1k1k"
Batch Job Submission
The framework supports automatic batch job submission to maximize parallelism in SLURM cluster environments. Instead of submitting jobs one-by-one, it groups test cases into batches and submits entire batches when needed.
Quick Start
Default batch size (5 jobs per batch):
# Run all tests with default batching
poetry run pytest --disagg test_disagg.py -s -vv
# Run with test list
poetry run pytest --disagg test_disagg.py -s -vv --disagg-test-list=./testlist/all.txt
Custom batch size:
# Set batch size via command line
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=10
# Set batch size via environment variable
export DISAGG_BATCH_SIZE=20
poetry run pytest --disagg test_disagg.py -s -vv
# Submit all jobs at once (unlimited batch)
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=0
How Batch Submission Works
Pytest Collection Phase:
- Collects all test cases (e.g., 100 tests)
- BatchManager splits them into batches (e.g., 20 batches of 5)
Pytest Execution Phase:
Test 0 runs:
-> Triggers submission of Batch 0 (jobs 0-4)
-> Waits for job 0 to complete
Test 1-4 run:
-> Batch 0 already submitted, directly wait for completion
Test 5 runs:
-> Triggers submission of Batch 1 (jobs 5-9)
-> Waits for job 5 to complete
... and so on
Key Benefits
- Parallel Execution: All jobs in a batch run simultaneously on SLURM cluster
- Reduced Wait Time: Total time ≈ MAX(job time) instead of SUM(job times)
- Automatic Management: No need to manually split test lists
- Lazy Loading: Only submits batches when needed
Configuration Options
Priority: Command line option > Environment variable > Default (5)
Examples:
# Small batch for quick testing
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=3 \
--disagg-test-list=./testlist/debug.txt
# Large batch for production
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=50 \
--disagg-test-list=./testlist/all.txt
# Submit all at once
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=0
Timeout Configuration
The default timeout for waiting for job completion is 10 hours (36000 seconds), which accounts for:
- SLURM queue wait time
- Job execution time
- Buffer for delays
Performance Comparison
Before (Sequential Submission):
Case 1: submit + wait (1.5h) = 1.5h
Case 2: submit + wait (1.5h) = 1.5h
Case 3: submit + wait (1.5h) = 1.5h
...
Total: 50 × 1.5h = 75 hours
After (Batch Submission, batch_size=50):
Batch 0 (50 jobs): submitted in parallel
Case 1: wait (1.5h)
Case 2-50: wait (0s, already done)
Total: ~1.5 hours
Speedup: 50x
Troubleshooting
Check BatchManager initialization:
======================================================================
Batch Manager Initialized
Batch size: 5 jobs per batch
======================================================================
Total test configs: 20
Total batches: 4
Monitor batch submission:
======================================================================
Submitting Batch 0
Range: [0:5] (5 jobs)
======================================================================
[ 1/5] Job 1234 <- test_config_id_1
[ 2/5] Job 1235 <- test_config_id_2
...
If jobs timeout frequently:
- Check SLURM queue status
- Consider reducing batch size to avoid resource contention
- Verify that timeout (36000s) is sufficient for your workload
Test Naming Convention
Tests are automatically named using the format:
{test_type}_{category}_{config_filename}
Example: disagg_perf_deepseek-r1-fp4_1k1k_ctx1_gen4_tep8_bs32_eplb0_mtp0_ccb-NIXL
File Naming Convention
Configuration files should follow this format:
{model}_{benchmark_type}_{config_details}.yaml
Examples:
deepseek-r1-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb0_mtp0_ccb-NIXL.yamldeepseek-r1-fp4_8k1k_ctx1_gen3_tep8_bs32_eplb0_mtp0_ccb-UCX.yaml
Where:
1k1k,8k1k: Input/output lengths (1024/1024, 8192/1024)ctx1_gen1: Context and generation server countsdep32ortep8: Data parallel (dep) or tensor parallel (tep) configurationbs32: Batch sizeeplb0: Expert parallel load balancing slotsmtp0: Multi-token prediction layersccb-NIXLorccb-UCX: Communication backend
Key Configuration Fields
Metadata
model_name: Model identifierprecision: Model precision (fp4, fp8, etc.)supported_gpus: List of compatible GPU types
Benchmark
mode: Benchmark mode (e2e, gen_only, ctx_only)streaming: Enable streaming (must be true)input_length,output_length: Sequence lengthsconcurrency_list: Concurrency levels to test
Worker Config
tensor_parallel_size: Tensor parallelism degreemoe_expert_parallel_size: MoE expert parallelismmax_batch_size: Maximum batch sizemax_num_tokens: Maximum tokens per batchmax_seq_len: Maximum sequence lengthspeculative_config: Multi-token prediction settings (optional)
Test Output
Test results are saved to:
- Performance metrics:
{OUTPUT_PATH}/perf_script_test_results.csv - Test logs:
{OUTPUT_PATH}/disagg_benchmark_{timestamp}.log
Environment Variables
GPU_TYPE: Current GPU type (default: GB200)OUTPUT_PATH: Directory for test results and logsWORK_DIR: Working directory for benchmark executionDISAGG_BATCH_SIZE: Default batch size for job submission (default: 5)DEBUG_MODE: Enable debug mode (set to "1" to skip job submission)DEBUG_JOB_ID: Job ID to use in debug mode
Debug Mode
For local testing without SLURM submission:
export DEBUG_MODE=1
export DEBUG_JOB_ID=12345
poetry run pytest --disagg test_disagg.py -s -vv
Architecture
The framework consists of:
- ConfigLoader: Scans and loads YAML configurations
- ConfigValidator: Validates configuration correctness
- BatchManager: Manages batch job submission for parallel execution
- JobManager: Handles SLURM job submission and monitoring
- LogParser: Extracts metrics from benchmark logs
- TestCaseTracker: Tracks test execution timing
- ResultSaver: Saves results to CSV
Benefits
- Simple: YAML-based configuration, no code changes needed
- Maintainable: Each test is a separate file
- Flexible: Override defaults only when needed
- Scalable: Easy to add new tests and models
- Reliable: Automatic validation before execution
- Traceable: Comprehensive logging and result tracking
Adding New Tests
- Create a new YAML file in
test_configs/{test_type}/{category}/ - Configure the test parameters
- Run pytest - the test will be automatically discovered
No code changes required!
For detailed configuration options and advanced usage, refer to the inline comments in the YAML configuration files.