[None][doc] add multiple-instances section in disaggregated serving doc (#11412)

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-02-16 15:55:08 +08:00 · 2026-02-10 15:31:45 +08:00 · 2026-02-10 15:31:45 +08:00 · c233692485
commit c233692485
parent 17cc1c13d6
1 changed files with 18 additions and 0 deletions
--- a/docs/source/features/disagg-serving.md
+++ b/docs/source/features/disagg-serving.md
@ -10,6 +10,7 @@
 - [Usage](#Usage)
  - [Dynamo](#Dynamo)
  - [trtllm-serve](#trtllm-serve)
+  - [Multiple Instances](#multiple-instances)
 - [Environment Variables](#Environment-Variables)
 - [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)

@ -215,6 +216,23 @@ curl http://localhost:8000/v1/completions \

 Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm).

+### Multiple Instances
+
+To increase maximum concurrency without more GPU nodes, you can deploy multiple disaggregated server instances across different nodes, while each instance manages the same context/generation servers. This is helpful when one disaggregated server becomes a performance bottleneck or runs out of ephemeral ports.
+
+Example (two-node deployment):
+
+- **Node A**
+  - Context servers: `node-a:8001`
+  - Generation servers: `node-b:8002`
+  - Disaggregated orchestrator endpoint: `node-a:8000`
+- **Node B**
+  - Context servers: `node-a:8001`
+  - Generation servers: `node-b:8002`
+  - Disaggregated orchestrator endpoint: `node-b:8000`
+- **Client entrypoint**
+  - Send requests or use a load balancer forwarding to `node-a:8000` and `node-b:8000`
+
 ## Environment Variables

 TRT-LLM uses some environment variables to control the behavior of disaggregated service.