mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Jonas Yang CN 88ea2c4ee9 [TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>		2025-10-04 08:12:24 +08:00
..
disaggregated	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
multi_nodes	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
llm_inference_async_ray.py	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
llm_inference_distributed_ray.py	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
README.md	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00
requirements.txt	[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )	2025-10-04 08:12:24 +08:00

README.md

TensorRT-LLM with Ray orchestrator

This folder contains examples for a prototype Ray orchestrator that supports on-demand LLM instance spin-up and flexible GPU placement across single- and multi-node inference. It’s a first step toward making TensorRT-LLM a better fit for Reinforcement learning from human feedback (RLHF) workflows. For RLHF, Ray — unlike MPI’s fixed world size and placement — can dynamically spawn and reconnect distributed inference actors, each with its own parallelism strategy.

This feature is a prototype and under active development. MPI remains the default.

Quick Start

To use Ray orchestrator, you need to first install Ray.

cd examples/ray_orchestrator
pip install -r requirements.txt

Run a simple TP=2 example with a Hugging Face model:

python llm_inference_distributed_ray.py

This example is the same as in /examples/llm-api, with the only change being orchestrator_type="ray" on LLM(). Other examples can be adapted similarly by toggling this flag.

Features

Available

Generate text asynchronously (refer to llm_inference_async_ray.py)
Multi-node inference (refer to multi-node README)
Disaggregated serving (refer to disagg README)

Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.

Upcoming

Performance optimization
Integration with RLHF frameworks, such as NVIDIA Nemo-RL and Verl.

Architecture

This feature introduces new classes such as RayExecutor and RayGPUWorker for Ray actor lifecycle management and distributed inference. In Ray mode, collective ops run on torch.distributed without MPI. We welcome contributions to improve and extend this support.

Disclaimer

The code a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.

README.md Unescape Escape