Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| disaggr_torch.slurm | ||
| gen_server_config.py | ||
| gen_worker_config.py | ||
| README.md | ||
| run_benchmark_nv_sa.sh | ||
| run_benchmark.sh | ||
| start_server.sh | ||
| start_worker.sh | ||
| submit.sh | ||
Disaggregated Inference Benchmark Scripts
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
Overview
The benchmarking process is orchestrated through a set of shell scripts and Python scripts that work together:
submit.sh: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by callingsbatchwith different configurations. Supports both context and generation server configurations with pipeline parallelism.disaggr_torch.slurm: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, optionally builds TensorRT-LLM from source, generates configuration files, starts the server and workers, and runs the benchmark client.gen_worker_config.py: A Python script that generates the worker configuration YAML file needed bytrtllm-serve. It determines the worker configuration based on SLURM environment variables and script arguments, supporting both context and generation workers with tensor/pipeline parallelism.gen_server_config.py: A Python script that generates the server configuration YAML file needed bytrtllm-serve. It determines the server configuration based on the number of context and generation servers.start_worker.sh: A shell script responsible for starting disaggregated workers usingtrtllm-serveon each allocated machine. Supports both context and generation workers with profiling capabilities.start_server.sh: A shell script responsible for starting disaggregated server usingtrtllm-serveon each allocated machine.run_benchmark.sh: A shell script that waits for the server to be healthy and then runs the actual benchmark client. Supports streaming mode and various metrics collection.
File Descriptions
submit.sh
This script is used to submit SLURM jobs for running benchmarks with specific configurations. It provides helper functions to calculate required nodes and submit jobs with the right parameters.
The script includes a user configuration section where you can set various parameters:
-
SLURM Configuration:
partition: SLURM partition to useaccount: SLURM account to usejob_time: Job time limitjob_name: Name of the job
-
Hardware Configuration:
gpus_per_node: Number of GPUs per node (default: 4)
-
Benchmark Configuration:
use_nv_sa_benchmark: Whether to use NVIDIA SA benchmark scriptisl: Input sequence lengthosl: Output sequence lengthmulti_round: Number of benchmark roundsbenchmark_ratio: Benchmark ratiostreaming: Enable streaming modecache_max_tokens: Cache transceiver max tokensdataset_file: Path to dataset file for benchmarking
-
Environment Configuration:
mount_dir: Directory to mount in containercontainer_image: Path to container imagemodel_path: Path to model directorytrtllm_repo: Path to TensorRT-LLM repositorybuild_wheel: Whether to build TensorRT-LLM from source
-
Workspace and Profiling Configuration:
work_dir: Path to work directorynsys_on: Enable nsys profiling (true/false)
Usage:
The script provides a run_single function that takes all the necessary parameters for both context and generation servers. Example usage:
# CTX: num tp_size pp_size batch tokens attn_dp gpu_frac GEN: num tp_size pp_size batch tokens attn_dp gpu_frac eplb mtp concurrency
run_single 1 4 1 4 4608 true 0.85 1 8 1 32 128 false "0.9" 0 3 "16"
The script automatically calculates the required number of nodes based on the tensor parallel size and server count.
disaggr_torch.slurm
This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via sbatch (e.g., by submit.sh).
It takes the following arguments in order:
num_ctx_servers: Number of context servers.ctx_tp_size: Tensor parallel size for context servers.ctx_pp_size: Pipeline parallel size for context servers.ctx_batch_size: Max batch size for context servers.ctx_max_num_tokens: Max number of tokens for context servers.ctx_enable_attention_dp:trueorfalseto enable attention DP for context servers.ctx_gpu_frac: GPU memory fraction for context servers.num_gen_servers: Number of generation servers.gen_tp_size: Tensor parallel size for generation servers.gen_pp_size: Pipeline parallel size for generation servers.gen_batch_size: Max batch size for generation servers.gen_max_num_tokens: Max number of tokens for generation servers.gen_enable_attention_dp:trueorfalseto enable attention DP for generation servers.gen_gpu_memory_fraction: GPU memory fraction for generation servers.eplb_num_slots: Number of slots for eplb.mtp_size: Number of nextn layers for MTP.concurrency_list: Space-separated list of concurrencies for benchmarking.gpus_per_node: Number of GPUs per node.use_nv_sa_benchmark: Whether to use NVIDIA SA benchmark script.isl: Input sequence length.osl: Output sequence length.multi_round: Number of benchmark rounds.benchmark_ratio: Benchmark ratio.streaming: Enable streaming mode.cache_max_tokens: Cache transceiver max tokens.dataset_file: Path to dataset file for benchmarking.mount_dir: Directory to mount in container.container_image: Path to container image.model_path: Path to model directory.trtllm_repo: Path to TensorRT-LLM repository.build_wheel: Whether to build TensorRT-LLM from source.work_dir: Path to work directory.nsys_on: Enable nsys profiling.
gen_worker_config.py
This Python script generates the worker configuration YAML file that configures the trtllm-serve workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters.
Usage:
The script is called from within disaggr_torch.slurm. It takes numerous arguments to define the model, parallelism, and worker configurations for both context and generation phases.
gen_server_config.py
This Python script generates the server configuration YAML file that configures the trtllm-serve disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers.
Usage:
The script is called from within start_server.sh. It takes arguments for the number of context and generation servers and the work directory.
start_worker.sh
This script starts a trtllm-serve disaggregated_mpi_worker. It is launched by srun from the disaggr_torch.slurm script on all allocated nodes.
Arguments:
worker_type: Either "CTX" or "GEN" to specify the worker type.worker_index: Index of the worker instance.model_dir: Path to the model directory.worker_port: Port for the worker to listen on.enable_pdl:trueorfalsefor enabling PDL.work_dir: Work directory for logs and configuration.nsys_on: Enable nsys profiling (true/false).
start_server.sh
This script starts the trtllm-serve disaggregated server. It first generates the server configuration using gen_server_config.py, then starts the server process.
Arguments:
num_ctx_servers: Number of context servers.num_gen_servers: Number of generation servers.work_dir: Work directory for logs and configuration.script_dir: Directory containing the scripts.
run_benchmark.sh and run_benchmark_nv_sa.sh
The benchmark can be run using either the default benchmark script (run_benchmark.sh) or the NVIDIA SA benchmark script (run_benchmark_nv_sa.sh), controlled by the use_nv_sa_benchmark parameter.
Default Benchmark Script Arguments (run_benchmark.sh):
model_name: Path to the model directory.dataset_file: Path to the dataset file for benchmarking.multi_round: Number of rounds for the benchmark.num_gen_servers: Number of generation servers.concurrency_list: Space-separated list of concurrencies.streaming:trueorfalsefor streaming mode.log_path: Path to the log directory.
The script supports various metrics collection including:
- TTFT (Time to First Token)
- TPOT (Throughput Over Time)
- ITL (Inter-Token Latency)
- E2EL (End-to-End Latency)
NVIDIA SA Benchmark Script Arguments (run_benchmark_nv_sa.sh):
model_name: Path to the model directory.isl: Input sequence length.osl: Output sequence length.benchmark_ratio: Ratio for benchmarking.multi_round: Number of rounds for the benchmark.num_gen_servers: Number of generation servers.concurrency_list: Space-separated list of concurrencies.streaming:trueorfalsefor streaming mode.log_path: Path to the log directory.
Workflow
- Configure the parameters in
submit.sh(e.g., SLURM settings, sequence lengths, dataset file, model path, container image). - The user runs
./submit.shwith appropriate parameters for context and generation servers. submit.shcalculates required nodes based on tensor/pipeline parallelism and submits the job to SLURM usingsbatch disaggr_torch.slurm.- For each job, SLURM allocates resources and runs
disaggr_torch.slurm. disaggr_torch.slurmvalidates all required parameters.disaggr_torch.slurmstarts the container and optionally builds/installs TensorRT-LLM from source.disaggr_torch.slurmrunsgen_worker_config.pyto create worker configuration files with tensor/pipeline parallelism settings.disaggr_torch.slurmusessrunto launchstart_worker.shon allocated nodes for context and generation workers.disaggr_torch.slurmgenerates server configuration usinggen_server_config.pyand starts the server withstart_server.sh.disaggr_torch.slurmruns eitherrun_benchmark.shorrun_benchmark_nv_sa.shbased onuse_nv_sa_benchmarksetting.- The benchmark script executes the benchmark for each concurrency level, collecting various metrics.
- After completion, processes are gracefully terminated and logs are stored in the specified log directory.