mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

History

Venky dfa11d810e [TRTC-102][docs] `--extra_llm_api_options`->`--config` in docs/examples/tests (#10005 )		2025-12-19 13:48:43 -05:00
..
disagg_serving_local.sh	[TRTC-102][docs] `--extra_llm_api_options`->`--config` in docs/examples/tests (#10005 )	2025-12-19 13:48:43 -05:00
README.md	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00

README.md

Disaggregated Serving with Ray orchestrator

TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI.

Running disaggregated serving with Ray follows the same workflow as in MPI, except that orchestrator_type="ray" must be set on the LLM class, and CUDA_VISIBLE_DEVICES can be omitted since Ray handles GPU placement.

Quick Start

This script is a shorthand to launch a single-GPU context and generation server, as well as the disaggregated server within a single Ray cluster. Please see this documentation for details on adjusting parallel settings.

# requires a total of two GPUs
bash -e disagg_serving_local.sh

Once the disaggregated server is ready, you can send requests to the disaggregated server using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "prompt": "NVIDIA is a great company because",
        "max_tokens": 16,
        "temperature": 0
    }' -w "\n"

Disclaimer

The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.