mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-03 17:52:19 +08:00

History

yingguo-trt f8b2a8fd30 [None][chore] Support multiple job submission at the same time (#10492 ) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com> Co-authored-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>		2026-01-07 21:51:36 -05:00
..
envs	[None][test] Add Kimi k2 WIDEEP perf and accuracy cases (#9686 )	2025-12-08 01:25:07 -08:00
execution	[None][chore] Support multiple job submission at the same time (#10492 )	2026-01-07 21:51:36 -05:00
reporting	[TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases (#9356 )	2025-11-26 10:34:49 +08:00
test_configs	[https://nvbugs/5726086 ][fix] update kimi-k2-1k1k dataset (#10473 )	2026-01-07 01:24:08 -05:00
testlist	[None][test] Unified slurm extra args management and session collection logic (#10332 )	2026-01-01 21:10:51 -05:00
utils	[None][test] Unified slurm extra args management and session collection logic (#10332 )	2026-01-01 21:10:51 -05:00
compare_backends.py	[None][test] Unified slurm extra args management and session collection logic (#10332 )	2026-01-01 21:10:51 -05:00
conftest.py	[None][chore] Support multiple job submission at the same time (#10492 )	2026-01-07 21:51:36 -05:00
poetry.lock	[TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases (#9356 )	2025-11-26 10:34:49 +08:00
pyproject.toml	[TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases (#9356 )	2025-11-26 10:34:49 +08:00
pytest.ini	[TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases (#9356 )	2025-11-26 10:34:49 +08:00
README.md	[None][chore] Support multiple job submission at the same time (#10492 )	2026-01-07 21:51:36 -05:00
session_collect.sh	[None][test] Unified slurm extra args management and session collection logic (#10332 )	2026-01-01 21:10:51 -05:00
simple_collect.py	[None][test] Unified slurm extra args management and session collection logic (#10332 )	2026-01-01 21:10:51 -05:00
test_disagg.py	[None][chore] Support multiple job submission at the same time (#10492 )	2026-01-07 21:51:36 -05:00

README.md

TensorRT-LLM Disaggregated Benchmark Framework

A YAML-based testing framework for TensorRT-LLM disaggregated serving performance and accuracy benchmarks.

Overview

This framework provides a simple, maintainable approach to benchmark testing using YAML configuration files. Each test configuration is defined in a separate YAML file, with automatic test discovery and execution through pytest.

Key Features

YAML Configuration: Each test has its own independent YAML configuration file
Automatic Test Discovery: Tests are automatically discovered from the config directory structure
Default Metrics: Built-in default metrics configuration for common test scenarios
GPU Filtering: Automatically filters tests based on hardware compatibility
Flexible Override: Override default configurations as needed for special cases
Test Categories: Support for both performance (perf) and accuracy tests
Multiple Test Types: Support for disagg (disaggregated) and wideep architectures

Directory Structure

test_configs/
├── disagg/                    # Disaggregated serving tests
│   ├── perf/                  # Performance tests
│   └── accuracy/              # Accuracy tests (optional)
└── wideep/                    # Wide-deep tests
    ├── perf/
    └── accuracy/

YAML Configuration

Minimal Configuration Example

metadata:
  model_name: "deepseek-r1-fp4"
  precision: "fp4"
  supported_gpus: ["GB200"]

slurm:
  partition: "<partition>"
  account: "<account>"
  job_time: "02:00:00"

benchmark:
  mode: "e2e"
  streaming: true
  concurrency_list: "1 2 4 8 16 36"
  input_length: 1024
  output_length: 1024

hardware:
  gpus_per_node: 4
  num_ctx_servers: 1
  num_gen_servers: 4

environment:
  container_mount: "<container_mount>"
  container_image: "<container_image>"
  model_path: "<model_path>"

worker_config:
  gen:
    tensor_parallel_size: 8
    moe_expert_parallel_size: 8
    max_batch_size: 32
    max_num_tokens: 32
    max_seq_len: 2251
    # ... other gen worker configs
  
  ctx:
    tensor_parallel_size: 4
    moe_expert_parallel_size: 4
    max_batch_size: 4
    max_num_tokens: 4608
    max_seq_len: 2251
    # ... other ctx worker configs

Custom Metrics (Optional)

Most tests use default metrics. To customize:

benchmark:
  metrics:
    log_file: "custom_benchmark.log"
    extractor_pattern: "Custom Pattern:\s+([0-9.]+)"
    metric_names: ["CUSTOM_METRIC"]

GPU Support

Currently supports OCI GB200 only. The framework is designed to support additional GPU types in the future.

All configurations must specify:

metadata:
  supported_gpus: ["GB200"]

Configuration Validation

The framework validates configurations before execution:

gen_max_tokens: Must equal gen_max_batch_size * (mtp_size + 1) when MTP is enabled
streaming: Must be true
max_seq_len: Both ctx and gen must be > (input_length + output_length)

Running Tests

Run all tests

poetry run pytest --disagg test_disagg.py -s -vv

Run from test list

poetry run pytest --disagg test_disagg.py -s -vv --disagg-test-list=./testlist/disagg.txt

Run specific tests

# Run only performance tests
poetry run pytest --disagg test_disagg.py -s -vv -m perf

# Run only accuracy tests
poetry run pytest --disagg test_disagg.py -s -vv -m accuracy

# Run specific test by ID
poetry run pytest --disagg test_disagg.py -s -vv -k "deepseek-r1-fp4_1k1k"

Batch Job Submission

The framework supports automatic batch job submission to maximize parallelism in SLURM cluster environments. Instead of submitting jobs one-by-one, it groups test cases into batches and submits entire batches when needed.

Quick Start

Default batch size (5 jobs per batch):

# Run all tests with default batching
poetry run pytest --disagg test_disagg.py -s -vv

# Run with test list
poetry run pytest --disagg test_disagg.py -s -vv --disagg-test-list=./testlist/all.txt

Custom batch size:

# Set batch size via command line
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=10

# Set batch size via environment variable
export DISAGG_BATCH_SIZE=20
poetry run pytest --disagg test_disagg.py -s -vv

# Submit all jobs at once (unlimited batch)
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=0

How Batch Submission Works

Pytest Collection Phase:
  - Collects all test cases (e.g., 100 tests)
  - BatchManager splits them into batches (e.g., 20 batches of 5)

Pytest Execution Phase:
  Test 0 runs:
    -> Triggers submission of Batch 0 (jobs 0-4)
    -> Waits for job 0 to complete
  
  Test 1-4 run:
    -> Batch 0 already submitted, directly wait for completion
  
  Test 5 runs:
    -> Triggers submission of Batch 1 (jobs 5-9)
    -> Waits for job 5 to complete
  
  ... and so on

Key Benefits

Parallel Execution: All jobs in a batch run simultaneously on SLURM cluster
Reduced Wait Time: Total time ≈ MAX(job time) instead of SUM(job times)
Automatic Management: No need to manually split test lists
Lazy Loading: Only submits batches when needed

Configuration Options

Priority: Command line option > Environment variable > Default (5)

Examples:

# Small batch for quick testing
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=3 \
  --disagg-test-list=./testlist/debug.txt

# Large batch for production
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=50 \
  --disagg-test-list=./testlist/all.txt

# Submit all at once
poetry run pytest --disagg test_disagg.py -s -vv --disagg-batch-size=0

Timeout Configuration

The default timeout for waiting for job completion is 10 hours (36000 seconds), which accounts for:

SLURM queue wait time
Job execution time
Buffer for delays

Performance Comparison

Before (Sequential Submission):

Case 1: submit + wait (1.5h) = 1.5h
Case 2: submit + wait (1.5h) = 1.5h
Case 3: submit + wait (1.5h) = 1.5h
...
Total: 50 × 1.5h = 75 hours

After (Batch Submission, batch_size=50):

Batch 0 (50 jobs): submitted in parallel
  Case 1: wait (1.5h)
  Case 2-50: wait (0s, already done)

Total: ~1.5 hours

Speedup: 50x

Troubleshooting

Check BatchManager initialization:

======================================================================
Batch Manager Initialized
Batch size: 5 jobs per batch
======================================================================

Total test configs: 20
Total batches: 4

Monitor batch submission:

======================================================================
Submitting Batch 0
Range: [0:5] (5 jobs)
======================================================================

  [  1/5] Job 1234 <- test_config_id_1
  [  2/5] Job 1235 <- test_config_id_2
  ...

If jobs timeout frequently:

Check SLURM queue status
Consider reducing batch size to avoid resource contention
Verify that timeout (36000s) is sufficient for your workload

Test Naming Convention

Tests are automatically named using the format:

{test_type}_{category}_{config_filename}

Example: disagg_perf_deepseek-r1-fp4_1k1k_ctx1_gen4_tep8_bs32_eplb0_mtp0_ccb-NIXL

File Naming Convention

Configuration files should follow this format:

{model}_{benchmark_type}_{config_details}.yaml

Examples:

deepseek-r1-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb0_mtp0_ccb-NIXL.yaml
deepseek-r1-fp4_8k1k_ctx1_gen3_tep8_bs32_eplb0_mtp0_ccb-UCX.yaml

Where:

1k1k, 8k1k: Input/output lengths (1024/1024, 8192/1024)
ctx1_gen1: Context and generation server counts
dep32 or tep8: Data parallel (dep) or tensor parallel (tep) configuration
bs32: Batch size
eplb0: Expert parallel load balancing slots
mtp0: Multi-token prediction layers
ccb-NIXL or ccb-UCX: Communication backend

Key Configuration Fields

Metadata

model_name: Model identifier
precision: Model precision (fp4, fp8, etc.)
supported_gpus: List of compatible GPU types

Benchmark

mode: Benchmark mode (e2e, gen_only, ctx_only)
streaming: Enable streaming (must be true)
input_length, output_length: Sequence lengths
concurrency_list: Concurrency levels to test

Worker Config

tensor_parallel_size: Tensor parallelism degree
moe_expert_parallel_size: MoE expert parallelism
max_batch_size: Maximum batch size
max_num_tokens: Maximum tokens per batch
max_seq_len: Maximum sequence length
speculative_config: Multi-token prediction settings (optional)

Test Output

Test results are saved to:

Performance metrics: {OUTPUT_PATH}/perf_script_test_results.csv
Test logs: {OUTPUT_PATH}/disagg_benchmark_{timestamp}.log

Environment Variables

GPU_TYPE: Current GPU type (default: GB200)
OUTPUT_PATH: Directory for test results and logs
WORK_DIR: Working directory for benchmark execution
DISAGG_BATCH_SIZE: Default batch size for job submission (default: 5)
DEBUG_MODE: Enable debug mode (set to "1" to skip job submission)
DEBUG_JOB_ID: Job ID to use in debug mode

Debug Mode

For local testing without SLURM submission:

export DEBUG_MODE=1
export DEBUG_JOB_ID=12345
poetry run pytest --disagg test_disagg.py -s -vv

Architecture

The framework consists of:

ConfigLoader: Scans and loads YAML configurations
ConfigValidator: Validates configuration correctness
BatchManager: Manages batch job submission for parallel execution
JobManager: Handles SLURM job submission and monitoring
LogParser: Extracts metrics from benchmark logs
TestCaseTracker: Tracks test execution timing
ResultSaver: Saves results to CSV

Benefits

Simple: YAML-based configuration, no code changes needed
Maintainable: Each test is a separate file
Flexible: Override defaults only when needed
Scalable: Easy to add new tests and models
Reliable: Automatic validation before execution
Traceable: Comprehensive logging and result tracking

Adding New Tests

Create a new YAML file in test_configs/{test_type}/{category}/
Configure the test parameters
Run pytest - the test will be automatically discovered

No code changes required!

For detailed configuration options and advanced usage, refer to the inline comments in the YAML configuration files.

README.md Unescape Escape