mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-30 15:43:19 +08:00

History

Kaiyu Xie 23f72c8bbd [None] [feat] Use numa to bind CPU (#7304 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>		2025-08-28 06:27:11 -04:00
..
disaggr_torch.slurm	[None] [feat] Use numa to bind CPU (#7304 )	2025-08-28 06:27:11 -04:00
gen_server_config.py	[None] [chore] Make disagg example compatible with recommended usage (#7121 )	2025-08-27 23:57:46 +08:00
gen_worker_config.py	[None] [chore] Make disagg example compatible with recommended usage (#7121 )	2025-08-27 23:57:46 +08:00
README.md	[None] [chore] Make disagg example compatible with recommended usage (#7121 )	2025-08-27 23:57:46 +08:00
run_benchmark.sh	[None] [chore] Make disagg example compatible with recommended usage (#7121 )	2025-08-27 23:57:46 +08:00
start_server.sh	[None] [chore] Make disagg example compatible with recommended usage (#7121 )	2025-08-27 23:57:46 +08:00
start_worker.sh	[None] [feat] Use numa to bind CPU (#7304 )	2025-08-28 06:27:11 -04:00
submit.sh	[None][chore] update disagg readme and scripts for pipeline parallelism (#6875 )	2025-08-27 00:53:57 -04:00

README.md

Disaggregated Inference Benchmark Scripts

This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.

Overview

The benchmarking process is orchestrated through a set of shell scripts and Python scripts that work together:

submit.sh: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by calling sbatch with different configurations.
disaggr_torch.slurm: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates configuration files, starts the server and workers, and runs the benchmark client.
gen_worker_config.py: A Python script that generates the worker configuration YAML file needed by trtllm-serve. It determines the worker configuration based on SLURM environment variables and script arguments.
gen_server_config.py: A Python script that generates the server configuration YAML file needed by trtllm-serve. It determines the server configuration based on the number of context and generation servers.
start_worker.sh: A shell script responsible for starting disaggregated workers using trtllm-serve on each allocated machine.
start_server.sh: A shell script responsible for starting disaggregated server using trtllm-serve on each allocated machine.
run_benchmark.sh: A shell script that waits for the server to be healthy and then runs the actual benchmark client (run_benchmark.py, not included in this directory).

File Descriptions

`submit.sh`

This script is used to submit multiple SLURM jobs for running benchmarks with different parameters. It iterates through various configurations and uses sbatch to submit disaggr_torch.slurm for each one.

Usage:

./submit.sh

You can modify the loops in this script to change the parameter space for the benchmark sweep.

`disaggr_torch.slurm`

This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via sbatch (e.g., by submit.sh).

It takes the following arguments in order:

num_ctx_servers: Number of context servers.
ctx_tp_size: Tensor parallel size for context servers.
ctx_pp_size: Pipeline parallel size for context servers.
ctx_batch_size: Max batch size for context servers.
ctx_max_num_tokens: Max number of tokens for context servers.
ctx_enable_attention_dp: true or false to enable attention DP for context servers.
num_gen_servers: Number of generation servers.
gen_tp_size: Tensor parallel size for generation servers.
gen_pp_size: Pipeline parallel size for generation servers.
gen_batch_size: Max batch size for generation servers.
gen_max_num_tokens: Max number of tokens for generation servers.
gen_enable_attention_dp: true or false to enable attention DP for generation servers.
gen_gpu_memory_fraction: GPU memory fraction for generation servers.
eplb_num_slots: Number of slots for eplb.
mtp_size: Number of nextn layers for MTP.
concurrency: Concurrency level for benchmarking.
isl: Input sequence length.
osl: Output sequence length.
multi_round: Number of rounds for the benchmark.
streaming: true or false for streaming mode.
container_image: Container image to use.
mounts: Container mounts.
workdir: Working directory.
model_dir: Model directory path.
trtllm_repo: TensorRT-LLM repository path.

`gen_worker_config.py`

This Python script generates the worker configuration YAML file that configures the trtllm-serve workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters.

Usage:

The script is called from within disaggr_torch.slurm. It takes numerous arguments to define the model, parallelism, and worker configurations for both context and generation phases.

`gen_server_config.py`

This Python script generates the server configuration YAML file that configures the trtllm-serve disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers.

Usage:

The script is called from within start_server.sh. It takes arguments for the number of context and generation servers and the work directory.

`start_worker.sh`

This script starts a trtllm-serve disaggregated_mpi_worker. It is launched by srun from the disaggr_torch.slurm script on all allocated nodes.

Arguments:

worker_type: Either "CTX" or "GEN" to specify the worker type.
worker_index: Index of the worker instance.
model_dir: Path to the model directory.
worker_port: Port for the worker to listen on.
benchmark_mode: Benchmark mode setting.
concurrency: Concurrency level.
enable_pdl: true or false.
work_dir: Work directory for logs and configuration.
nsys_on: Whether to enable nsys profiling.

`start_server.sh`

This script starts the trtllm-serve disaggregated server. It first generates the server configuration using gen_server_config.py, then starts the server process.

Arguments:

num_ctx_servers: Number of context servers.
num_gen_servers: Number of generation servers.
work_dir: Work directory for logs and configuration.
script_dir: Directory containing the scripts.

`run_benchmark.sh`

This script orchestrates the execution of the benchmark client. It waits for the configuration files to be created and for the server's /health endpoint to respond, then it runs the benchmark.

Arguments:

isl: Input sequence length.
osl: Output sequence length.
multi_round: Number of rounds for the benchmark.
model_name: Name of the model being benchmarked.
concurrency_list: Space-separated list of concurrencies.
streaming: true or false.
log_path: Path to the log directory.

Workflow

Make sure that SLURM parameters are correctly set in disaggr_torch.slurm.
The user runs ./submit.sh.
submit.sh submits one or more jobs to SLURM by calling sbatch disaggr_torch.slurm with different parameters.
For each job, SLURM allocates resources and runs disaggr_torch.slurm.
disaggr_torch.slurm runs gen_worker_config.py to create worker configuration files.
disaggr_torch.slurm uses srun to launch start_worker.sh on all nodes, starting the MPI workers for both context and generation phases.
disaggr_torch.slurm starts the main trtllm-serve process using start_server.sh, which generates the server configuration using gen_server_config.py.
disaggr_torch.slurm runs run_benchmark.sh which waits for the server to be ready.
run_benchmark.sh executes the benchmark for each concurrency level specified.
After the benchmark, run_benchmark.sh and disaggr_torch.slurm attempt to kill the server and worker processes.
Logs for each run are stored in a subdirectory specified by the sub_file parameter.