mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-16 15:55:08 +08:00
[None][doc] add multiple-instances section in disaggregated serving doc (#11412)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
This commit is contained in:
parent
17cc1c13d6
commit
c233692485
@ -10,6 +10,7 @@
|
||||
- [Usage](#Usage)
|
||||
- [Dynamo](#Dynamo)
|
||||
- [trtllm-serve](#trtllm-serve)
|
||||
- [Multiple Instances](#multiple-instances)
|
||||
- [Environment Variables](#Environment-Variables)
|
||||
- [Troubleshooting and FAQ](#Troubleshooting-and-FAQ)
|
||||
|
||||
@ -215,6 +216,23 @@ curl http://localhost:8000/v1/completions \
|
||||
|
||||
Please refer to [Disaggregated Inference Benchmark Scripts](../../../examples/disaggregated/slurm).
|
||||
|
||||
### Multiple Instances
|
||||
|
||||
To increase maximum concurrency without more GPU nodes, you can deploy multiple disaggregated server instances across different nodes, while each instance manages the same context/generation servers. This is helpful when one disaggregated server becomes a performance bottleneck or runs out of ephemeral ports.
|
||||
|
||||
Example (two-node deployment):
|
||||
|
||||
- **Node A**
|
||||
- Context servers: `node-a:8001`
|
||||
- Generation servers: `node-b:8002`
|
||||
- Disaggregated orchestrator endpoint: `node-a:8000`
|
||||
- **Node B**
|
||||
- Context servers: `node-a:8001`
|
||||
- Generation servers: `node-b:8002`
|
||||
- Disaggregated orchestrator endpoint: `node-b:8000`
|
||||
- **Client entrypoint**
|
||||
- Send requests or use a load balancer forwarding to `node-a:8000` and `node-b:8000`
|
||||
|
||||
## Environment Variables
|
||||
|
||||
TRT-LLM uses some environment variables to control the behavior of disaggregated service.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user