mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520 )

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>

2025-10-04 08:12:24 +08:00

1.8 KiB

Raw Permalink Blame History

Multi-node inference with Ray orchestrator

TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI. The following example shows how to start a Ray cluster for multi-node inference.

Quick Start

Prerequisite: a container image with TensorRT-LLM preinstalled (or suitable for installing it). The examples use Slurm and Enroot. If you use a different setup, adapt the following scripts and commands to your multi-node environment.

Allocate nodes and open a shell on the head node:

# e.g., 2 nodes with 8 GPUs per node
>> salloc -t 240 -N 2 -p interactive

>> srun --pty -p interactive bash

Once on the head node, launch a multi-node Ray cluster:

# Remember to set CONTAINER and MOUNTS env vars or variables inside the script to your path.
# You can add the TensorRT-LLM installation command in this script if it is not preinstalled in your container.
>> bash -e run_cluster.sh

Enter the head container and run your TensorRT-LLM driver script

Note that this step requires TensorRT-LLM to be installed in the containers on all nodes. If it isn’t, install it manually inside each node’s container.

# On the head node
>> sacct

# Grab the Slurm step ID with Job Name "ray-head"
>> srun --jobid=<Your Step ID> --overlap  --pty bash

>> enroot list -f # get process id
>> enroot exec <process id> bash

# You can change this script to a model and parallel settings effective for multi-node inference (e.g., TP8 or TP4PP4).
>> python examples/ray_orchestrator/llm_inference_async_ray.py

Disclaimer

The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.

1.8 KiB Raw Permalink Blame History Unescape Escape

Multi-node inference with Ray orchestrator

Quick Start

Disclaimer

1.8 KiB

Raw Permalink Blame History