Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
1.8 KiB
Multi-node inference with Ray orchestrator
TensorRT-LLM supports a prototype Ray orchestrator as an alternative to MPI. The following example shows how to start a Ray cluster for multi-node inference.
Quick Start
Prerequisite: a container image with TensorRT-LLM preinstalled (or suitable for installing it). The examples use Slurm and Enroot. If you use a different setup, adapt the following scripts and commands to your multi-node environment.
-
Allocate nodes and open a shell on the head node:
# e.g., 2 nodes with 8 GPUs per node >> salloc -t 240 -N 2 -p interactive >> srun --pty -p interactive bash -
Once on the head node, launch a multi-node Ray cluster:
# Remember to set CONTAINER and MOUNTS env vars or variables inside the script to your path. # You can add the TensorRT-LLM installation command in this script if it is not preinstalled in your container. >> bash -e run_cluster.sh -
Enter the head container and run your TensorRT-LLM driver script
Note that this step requires TensorRT-LLM to be installed in the containers on all nodes. If it isn’t, install it manually inside each node’s container.
# On the head node >> sacct # Grab the Slurm step ID with Job Name "ray-head" >> srun --jobid=<Your Step ID> --overlap --pty bash >> enroot list -f # get process id >> enroot exec <process id> bash # You can change this script to a model and parallel settings effective for multi-node inference (e.g., TP8 or TP4PP4). >> python examples/ray_orchestrator/llm_inference_async_ray.py
Disclaimer
The code is a prototype and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.