TensorRT-LLMs/examples/ray_orchestrator/multi_nodes/README.md
Jonas Yang CN 88ea2c4ee9
[TRTLLM-7349][feat] Adding new orchestrator type -- ray (#7520)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-10-04 08:12:24 +08:00

1.8 KiB
Raw Permalink Blame History

Multi-node inference with Ray orchestrator

TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI. The following example shows how to start a Ray cluster for multi-node inference.

Quick Start

Prerequisite: a container image with TensorRT-LLM preinstalled (or suitable for installing it). The examples use Slurm and Enroot. If you use a different setup, adapt the following scripts and commands to your multi-node environment.

  1. Allocate nodes and open a shell on the head node:

    # e.g., 2 nodes with 8 GPUs per node
    >> salloc -t 240 -N 2 -p interactive
    
    >> srun --pty -p interactive bash
    
  2. Once on the head node, launch a multi-node Ray cluster:

    # Remember to set CONTAINER and MOUNTS env vars or variables inside the script to your path.
    # You can add the TensorRT-LLM installation command in this script if it is not preinstalled in your container.
    >> bash -e run_cluster.sh
    
  3. Enter the head container and run your TensorRT-LLM driver script

    Note that this step requires TensorRT-LLM to be installed in the containers on all nodes. If it isnt, install it manually inside each nodes container.

    # On the head node
    >> sacct
    
    # Grab the Slurm step ID with Job Name "ray-head"
    >> srun --jobid=<Your Step ID> --overlap  --pty bash
    
    >> enroot list -f # get process id
    >> enroot exec <process id> bash
    
    # You can change this script to a model and parallel settings effective for multi-node inference (e.g., TP8 or TP4PP4).
    >> python examples/ray_orchestrator/llm_inference_async_ray.py
    

Disclaimer

The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.