From c23369248581e56a955566801b7166302b0e12ad Mon Sep 17 00:00:00 2001 From: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Date: Tue, 10 Feb 2026 15:31:45 +0800 Subject: [PATCH] [None][doc] add multiple-instances section in disaggregated serving doc (#11412) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> --- docs/source/features/disagg-serving.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/docs/source/features/disagg-serving.md b/docs/source/features/disagg-serving.md index 88feb11b08..6e600793fb 100644 --- a/docs/source/features/disagg-serving.md +++ b/docs/source/features/disagg-serving.md @@ -10,6 +10,7 @@ - [Usage](#Usage) - [Dynamo](#Dynamo) - [trtllm-serve](#trtllm-serve) + - [Multiple Instances](#multiple-instances) - [Environment Variables](#Environment-Variables) - [Troubleshooting and FAQ](#Troubleshooting-and-FAQ) @@ -215,6 +216,23 @@ curl http://localhost:8000/v1/completions \ Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm). +### Multiple Instances + +To increase maximum concurrency without more GPU nodes, you can deploy multiple disaggregated server instances across different nodes, while each instance manages the same context/generation servers. This is helpful when one disaggregated server becomes a performance bottleneck or runs out of ephemeral ports. + +Example (two-node deployment): + +- **Node A** + - Context servers: `node-a:8001` + - Generation servers: `node-b:8002` + - Disaggregated orchestrator endpoint: `node-a:8000` +- **Node B** + - Context servers: `node-a:8001` + - Generation servers: `node-b:8002` + - Disaggregated orchestrator endpoint: `node-b:8000` +- **Client entrypoint** + - Send requests or use a load balancer forwarding to `node-a:8000` and `node-b:8000` + ## Environment Variables TRT-LLM uses some environment variables to control the behavior of disaggregated service.