mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>

2025-10-04 08:12:24 +08:00

1.3 KiB

Raw Permalink Blame History

Disaggregated Serving with Ray orchestrator

TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI.

Running disaggregated serving with Ray follows the same workflow as in MPI, except that orchestrator_type="ray" must be set on the LLM class, and CUDA_VISIBLE_DEVICES can be omitted since Ray handles GPU placement.

Quick Start

This script is a shorthand to launch a single-GPU context and generation server, as well as the disaggregated server within a single Ray cluster. Please see this documentation for details on adjusting parallel settings.

# requires a total of two GPUs
bash -e disagg_serving_local.sh

Once the disaggregated server is ready, you can send requests to the disaggregated server using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 16,
        "temperature": 0
    }' -w "\n"

Disclaimer

The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.

1.3 KiB Raw Permalink Blame History

Disaggregated Serving with Ray orchestrator

Quick Start

Disclaimer

1.3 KiB

Raw Permalink Blame History