|
|
||
|---|---|---|
| .. | ||
| disaggr_torch.slurm | ||
| gen_yaml.py | ||
| README.md | ||
| run_benchmark.sh | ||
| start_worker.sh | ||
| submit.sh | ||
Disaggregated Inference Benchmark Scripts
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
Overview
The benchmarking process is orchestrated through a set of shell scripts and a Python script that work together:
submit.sh: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by callingsbatchwith different configurations.disaggr_torch.slurm: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates a configuration file, starts the server and workers, and runs the benchmark client.gen_yaml.py: A Python script that generates theconfig.yamlfile needed bytrtllm-serve. It determines the server and worker configuration based on SLURM environment variables and script arguments.start_worker.sh: A shell script responsible for starting atrtllm-serve disaggregated_mpi_workeron each allocated machine.run_benchmark.sh: A shell script that waits for the server to be healthy and then runs the actual benchmark client (run_benchmark.py, not included in this directory).
File Descriptions
submit.sh
This script is used to submit multiple SLURM jobs for running benchmarks with different parameters. It iterates through various configurations and uses sbatch to submit disaggr_torch.slurm for each one.
Usage:
./submit.sh
You can modify the loops in this script to change the parameter space for the benchmark sweep.
disaggr_torch.slurm
This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via sbatch (e.g., by submit.sh).
It takes the following arguments in order:
num_ctx_servers: Number of context servers.ctx_tp_size: Tensor parallel size for context servers.ctx_batch_size: Max batch size for context servers.ctx_max_num_tokens: Max number of tokens for context servers.ctx_enable_attention_dp:trueorfalseto enable attention DP for context servers.num_gen_servers: Number of generation servers.gen_tp_size: Tensor parallel size for generation servers.gen_batch_size: Max batch size for generation servers.gen_max_num_tokens: Max number of tokens for generation servers.gen_enable_attention_dp:trueorfalseto enable attention DP for generation servers.gen_gpu_memory_fraction: GPU memory fraction for generation servers.concurrency_list: A space-separated list of concurrencies to test (e.g., "1 2 4 8").sub_file: A subdirectory name for logs.
gen_yaml.py
This Python script generates the config.yaml file that configures the trtllm-serve application. It reads SLURM environment variables (SLURM_JOB_NODELIST, SLURM_TASKS_PER_NODE) to distribute workers across nodes.
Usage:
The script is called from within disaggr_torch.slurm. It takes numerous arguments to define the model, parallelism, and server configurations.
start_worker.sh
This script starts a trtllm-serve disaggregated_mpi_worker. It is launched by srun from the disaggr_torch.slurm script on all allocated nodes.
Arguments:
config_file: Path to theconfig.yamlfile.enable_pdl:trueorfalse.ctx_gpus: Number of GPUs used for the context phase.work_dir: (Optional) Directory to store nsys profiling output.
run_benchmark.sh
This script orchestrates the execution of the benchmark client. It waits for the config.yaml to be created and for the server's /health endpoint to respond, then it runs the benchmark.
Arguments:
isl: Input sequence length.osl: Output sequence length.multi_round: Number of rounds for the benchmark.model_name: Name of the model being benchmarked.concurrency_list: Space-separated list of concurrencies.streaming:trueorfalse.log_path: Path to the log directory.
Workflow
- The user runs
./submit.sh. submit.shsubmits one or more jobs to SLURM by callingsbatch disaggr_torch.slurmwith different parameters.- For each job, SLURM allocates resources and runs
disaggr_torch.slurm. disaggr_torch.slurmrunsgen_yaml.pyto create aconfig.yaml.disaggr_torch.slurmusessrunto launchstart_worker.shon all nodes, starting the MPI workers.disaggr_torch.slurmstarts the maintrtllm-serveprocess.disaggr_torch.slurmrunsrun_benchmark.shwhich waits for the server to be ready.run_benchmark.shexecutes the benchmark for each concurrency level specified.- After the benchmark,
run_benchmark.shanddisaggr_torch.slurmattempt to kill the server and worker processes. - Logs for each run are stored in a subdirectory specified by the
sub_fileparameter.