TensorRT-LLMs/tests/scripts/perf-sanity
2025-12-19 15:16:56 -08:00
..
benchmark-serve.sh [TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 (#7985) 2025-10-22 10:17:22 +08:00
config_database_b200_nvl.yaml [None][fix] enable KV cache reuse for config database (#10094) 2025-12-19 15:16:56 -08:00
config_database_h200_sxm.yaml [None][fix] enable KV cache reuse for config database (#10094) 2025-12-19 15:16:56 -08:00
l0_dgx_b200.yaml [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896) 2025-12-12 17:16:50 +08:00
l0_dgx_b300.yaml [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896) 2025-12-12 17:16:50 +08:00
l0_gb200_multi_gpus.yaml [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896) 2025-12-12 17:16:50 +08:00
l0_gb200_multi_nodes.yaml [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896) 2025-12-12 17:16:50 +08:00
parse_benchmark_results.py [TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 (#7985) 2025-10-22 10:17:22 +08:00
README.md [TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800) 2025-12-08 09:00:44 +08:00
run_benchmark_serve.py [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic (#9726) 2025-12-16 05:16:32 -08:00

TensorRT-LLM Perf Sanity Test System

Performance sanity testing scripts for TensorRT-LLM with configuration-driven test cases supporting single-node, multi-node aggregated, and multi-node disaggregated architectures.

Overview

  • Run performance sanity benchmarks across multiple model configurations
  • Support three deployment architectures: single-node, multi-node aggregated, and multi-node disaggregated
  • Manage test cases through YAML configuration files
  • Automated resource calculation and job submission via SLURM

Configuration File Types

There are three types of YAML configuration files for different deployment architectures:

1. Single-Node Aggregated Test Configuration

File Example: l0_dgx_b200.yaml

Use Case: Single-node performance tests on a single server with multiple GPUs.

Structure:

server_configs:
  - name: "r1_fp8_dep8_mtp1_1k1k"
    model_name: "deepseek_r1_0528_fp8"
    gpus: 8
    tensor_parallel_size: 8
    moe_expert_parallel_size: 8
    pipeline_parallel_size: 1
    max_batch_size: 512
    max_num_tokens: 8192
    attention_backend: "TRTLLM"
    enable_attention_dp: true
    attention_dp_config:
      batching_wait_iters: 0
      enable_balance: true
      timeout_iters: 60
    moe_config:
      backend: 'DEEPGEMM'
    cuda_graph_config:
      enable_padding: true
      max_batch_size: 512
    kv_cache_config:
      dtype: 'fp8'
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.8
    speculative_config:
      decoding_type: 'MTP'
      num_nextn_predict_layers: 1
    client_configs:
      - name: "con4096_iter10_1k1k"
        concurrency: 4096
        iterations: 10
        isl: 1024
        osl: 1024
        random_range_ratio: 0.8
        backend: "openai"

2. Multi-Node Aggregated Test Configuration

File Example: l0_gb200_multi_nodes.yaml

Use Case: Multi-node aggregated architecture where model runs across multiple nodes with unified execution.

Structure:

# Hardware Config
hardware:
  gpus_per_node: 4
  gpus_per_server: 8

server_configs:
  - name: "r1_fp4_v2_dep8_mtp1"
    model_name: "deepseek_r1_0528_fp4_v2"
    gpus: 8
    gpus_per_node: 4
    trust_remote_code: true
    tensor_parallel_size: 8
    moe_expert_parallel_size: 8
    pipeline_parallel_size: 1
    max_batch_size: 512
    max_num_tokens: 2112
    attn_backend: "TRTLLM"
    enable_attention_dp: true
    attention_dp_config:
      batching_wait_iters: 0
      enable_balance: true
      timeout_iters: 60
    moe_config:
      backend: 'CUTLASS'
    cuda_graph_config:
      enable_padding: true
      max_batch_size: 512
    kv_cache_config:
      dtype: 'fp8'
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.5
    client_configs:
      - name: "con32_iter12_1k1k"
        concurrency: 32
        iterations: 12
        isl: 1024
        osl: 1024
        random_range_ratio: 0.8
        backend: "openai"