mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
1.3 KiB
1.3 KiB
Disaggregated Serving with Ray orchestrator
TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI.
Running disaggregated serving with Ray follows the same workflow as in MPI, except that orchestrator_type="ray" must be set on the LLM class, and CUDA_VISIBLE_DEVICES can be omitted since Ray handles GPU placement.
Quick Start
This script is a shorthand to launch a single-GPU context and generation server, as well as the disaggregated server within a single Ray cluster. Please see this documentation for details on adjusting parallel settings.
# requires a total of two GPUs
bash -e disagg_serving_local.sh
Once the disaggregated server is ready, you can send requests to the disaggregated server using curl:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"prompt": "NVIDIA is a great company because",
"max_tokens": 16,
"temperature": 0
}' -w "\n"
Disclaimer
The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.