mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Anish Shanbhag 7c82605327 [None][fix] enable KV cache reuse for config database (#10094 )		2025-12-19 15:16:56 -08:00
..
benchmark-serve.sh	[TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 (#7985 )	2025-10-22 10:17:22 +08:00
config_database_b200_nvl.yaml	[None][fix] enable KV cache reuse for config database (#10094 )	2025-12-19 15:16:56 -08:00
config_database_h200_sxm.yaml	[None][fix] enable KV cache reuse for config database (#10094 )	2025-12-19 15:16:56 -08:00
l0_dgx_b200.yaml	[https://nvbugs/5727481 ][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896 )	2025-12-12 17:16:50 +08:00
l0_dgx_b300.yaml	[https://nvbugs/5727481 ][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896 )	2025-12-12 17:16:50 +08:00
l0_gb200_multi_gpus.yaml	[https://nvbugs/5727481 ][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896 )	2025-12-12 17:16:50 +08:00
l0_gb200_multi_nodes.yaml	[https://nvbugs/5727481 ][ci] Fix Port Conflict in Perf-Sanity CI Test (#9896 )	2025-12-12 17:16:50 +08:00
parse_benchmark_results.py	[TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 (#7985 )	2025-10-22 10:17:22 +08:00
README.md	[TRTLLM-9000][feat] Add multi-node Perf Tests into CI (#8800 )	2025-12-08 09:00:44 +08:00
run_benchmark_serve.py	[TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic (#9726 )	2025-12-16 05:16:32 -08:00

README.md

TensorRT-LLM Perf Sanity Test System

Performance sanity testing scripts for TensorRT-LLM with configuration-driven test cases supporting single-node, multi-node aggregated, and multi-node disaggregated architectures.

Overview

Run performance sanity benchmarks across multiple model configurations
Support three deployment architectures: single-node, multi-node aggregated, and multi-node disaggregated
Manage test cases through YAML configuration files
Automated resource calculation and job submission via SLURM

Configuration File Types

There are three types of YAML configuration files for different deployment architectures:

1. Single-Node Aggregated Test Configuration

File Example: l0_dgx_b200.yaml

Use Case: Single-node performance tests on a single server with multiple GPUs.

Structure:

server_configs:
  - name: "r1_fp8_dep8_mtp1_1k1k"
    model_name: "deepseek_r1_0528_fp8"
    gpus: 8
    tensor_parallel_size: 8
    moe_expert_parallel_size: 8
    pipeline_parallel_size: 1
    max_batch_size: 512
    max_num_tokens: 8192
    attention_backend: "TRTLLM"
    enable_attention_dp: true
    attention_dp_config:
      batching_wait_iters: 0
      enable_balance: true
      timeout_iters: 60
    moe_config:
      backend: 'DEEPGEMM'
    cuda_graph_config:
      enable_padding: true
      max_batch_size: 512
    kv_cache_config:
      dtype: 'fp8'
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.8
    speculative_config:
      decoding_type: 'MTP'
      num_nextn_predict_layers: 1
    client_configs:
      - name: "con4096_iter10_1k1k"
        concurrency: 4096
        iterations: 10
        isl: 1024
        osl: 1024
        random_range_ratio: 0.8
        backend: "openai"

2. Multi-Node Aggregated Test Configuration

File Example: l0_gb200_multi_nodes.yaml

Use Case: Multi-node aggregated architecture where model runs across multiple nodes with unified execution.

Structure:

# Hardware Config
hardware:
  gpus_per_node: 4
  gpus_per_server: 8

server_configs:
  - name: "r1_fp4_v2_dep8_mtp1"
    model_name: "deepseek_r1_0528_fp4_v2"
    gpus: 8
    gpus_per_node: 4
    trust_remote_code: true
    tensor_parallel_size: 8
    moe_expert_parallel_size: 8
    pipeline_parallel_size: 1
    max_batch_size: 512
    max_num_tokens: 2112
    attn_backend: "TRTLLM"
    enable_attention_dp: true
    attention_dp_config:
      batching_wait_iters: 0
      enable_balance: true
      timeout_iters: 60
    moe_config:
      backend: 'CUTLASS'
    cuda_graph_config:
      enable_padding: true
      max_batch_size: 512
    kv_cache_config:
      dtype: 'fp8'
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.5
    client_configs:
      - name: "con32_iter12_1k1k"
        concurrency: 32
        iterations: 12
        isl: 1024
        osl: 1024
        random_range_ratio: 0.8
        backend: "openai"