Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Signed-off-by: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
8.6 KiB
CPU Affinity configuration in TensorRT-LLM
NUMA-aware affinity in TensorRT-LLM
TensorRT-LLM is frequently deployed on NUMA systems. In order to ensure consistent and optimal performance on these systems, it is critical to set the CPU affinity of the workers/tasks launched as part of a particular TRT-LLM instance so as to minimize latency and maximize bandwidth of CPU↔GPU and CPU↔DRAM communication.
Because TensorRT-LLM does the work of allocating GPU/CUDA devices to ranks, it is logically the ideal place for the CPU affinity to be determined and set. For this reason, TensorRT-LLM provides a mechanism to automatically set CPU affinity according to NUMA topology. In some situations/deployments, the user may wish to configure CPU affinity manually (i.e. using numactl, wrappers around the same, or mpirun). For this reason, this feature is only activated if it is explicitly enabled or if CPU affinity is not already constrained by the user or environment. It is controlled by the TLLM_NUMA_AWARE_WORKER_AFFINITY environment variable as follows:
| TLLM_NUMA_AWARE_WORKER_AFFINITY | Behavior |
|---|---|
| Affinity is auto-configured if it is unconstrained, and cleared if it is constrained by the user and/or environment | |
| 1 | Affinity is unconditionally auto-configured. |
| 0 or any other value | Affinity remains as configured by the user and/or environment |
Other environmental considerations
Whether or not the user chooses to manually configure CPU affinity or have TensorRT-LLM configure it automatically, the environment can also constrain the CPU affinity in a way that subverts the user's intent. Both OpenMPI and Slurm may configure CPU affinity, so the following additional configuration is recommended to avoid this.
OpenMPI
By default, OpenMPI chooses a rank-wise CPU affinity that is not sensitized to the NUMA-topology of the system. Because it does not know which GPU a particular rank will be communicating with (this is determined by TRT-LLM at runtime), it cannot set the CPU affinity accordingly. For this reason, it is recommended that OpenMPI's default binding policy be disabled as follows:
export OMPI_MCA_hwloc_base_binding_policy=none
export OMPI_MCA_rmaps_base_inherit=1
The first environment variable ensures that OpenMPI will not attempt to bind or set the affinity of the ranks that are created at launch.
The second ensures that OpenMPI's binding policy will propagate to MPI workers
that are spawned by mpi4py's MPIPoolExecutor class within TensorRT-LLM
(when using mpirun).
Slurm
If Slurm is configured to use a affinity or cgroup task plugin, then Slurm may also configure CPU affinity by default in a way that is not sensitized to NUMA topology. To prevent this, Slurm jobs should be launched accordingly:
srun
The srun parameters should include --cpu-bind=none and exclude --exclusive:
srun --cpu-bind=none ...
sbatch
The sbatch script should set SLURM_CPU_BIND environment variable to "none":
export SLURM_CPU_BIND=none
Note: if this environment variable is set, it is not necessary to supply the
--cpu-bind=none to each job step (srun invocation)
CPU affinity configuration examples
Using NUMA-aware autoconfiguration
To explicitly enable the NUMA-aware autoconfiguration feature in TensorRT-LLM,
simply set TLLM_NUMA_AWARE_WORKER_AFFINITY in the launch script (prior to
trtllm-bench or trtllm-serve) as follows:
export TLLM_NUMA_AWARE_WORKER_AFFINITY=1
Because autoconfiguration happens within TensorRT-LLM itself, it will override any CPU affinity or binding that has been previously set by OpenMPI or Slurm.
NUMA-aware CPU affinity using bindpcie
The bindpcie script is designed to set a per-rank CPU affinity that is ideal
for NUMA topology. While setting TLLM_NUMA_AWARE_WORKER_AFFINITY=1 usually
achieves the same result in terms of the CPU affinity that is set, this
approach has the distinct advantage that the optimal CPU affinity gets set
upon launching TensorRT-LLM, guaranteeing that each worker/rank executes on
the optimal NUMA node from inception. The NUMA-aware CPU affinity
autoconfiguration mechanism in TensorRT-LLM, on the other hand, is triggered by
each worker/rank upon its own PID after it has already launched. If the
worker/rank executes on a NUMA node other than the optimal NUMA node at some
point between the launch of the process and the NUMA-aware autoconfiguration,
it is possible that some CPU memory may have been allocated/touched on what
will become a remote NUMA node after the point of autoconfiguration,
potentially negatively impacting performance. In practice, this effect has been
observed to have minimal performance impact, but some degradation of
performance due to remote NUMA node access is still theoretically possible.
The bindpcie script can only be applied to deployments that make use of
trtllm-llmapi-launch within an sbatch script. One example of how to apply
bindpcie to trtllm-serve in an sbatch script is as follows:
# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
# Prevent OpenMPI from overriding affinity set by bindpcie
export OMPI_MCA_hwloc_base_binding_policy=none
# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1
# Prevent Slurm from assigning a default CPU affinity
export SLURM_CPU_BIND=none
srun -l \
--container-image=${CONTAINER_IMAGE} \
--container-mounts=${MOUNT_DIR}:${MOUNT_DEST} \
--container-workdir=${WORKDIR} \
--export=ALL,PYTHONPATH=${SOURCE_ROOT} \
--mpi=pmix \
bash -c "
set -ex
$PROLOGUE
export PATH=$PATH:~/.local/bin
bindpcie trtllm-llmapi-launch \
trtllm-serve $LOCAL_MODEL \
${ADDITIONAL_OPTIONS}
Note
This is not a complete or exhaustive example of an sbatch script to launch trtllm-serve and is only intended to highlight the application of bindpcie within an existing sbatch script.
Using numactl
# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
# Prevent OpenMPI from overriding affinity set by numactl
export OMPI_MCA_hwloc_base_binding_policy=none
# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1
# Use numactl to specify CPU and memory binding for all ranks (not per-rank)
numactl --physcpubind=0,1,16,17 --membind=0 mpirun --report-bindings --oversubscribe --allow-run-as-root \
<trtllm-serve | trtllm-bench> <arguments>
Using mpirun
If a manually-specified per-rank CPU affinity is desired when running on a single node with mpirun, this can be achieved most easily using an OpenMPI rankfile. The following is an example of how a rankfile can be used to arbitrarily map each of 4 MPI ranks to a distinct set of 4 cores:
# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
# Not strictly needed here, since we are overriding with explicit bindings from
# a rankfile
# export OMPI_MCA_hwloc_base_binding_policy=none
# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1
# Create a rankfile to enumerate a set of 4 cores to which each rank is bound
cat > ./rankfile <<EOF
rank 0=localhost slot=0,1,2,3
rank 1=localhost slot=4,5,6,7
rank 2=localhost slot=8,9,10,11
rank 3=localhost slot=12,13,14,15
EOF
# Run with (and report to verify) the bindings from the rankfile
mpirun --rankfile ./rankfile -n 1 --report-bindings --oversubscribe --allow-run-as-root \
<trtllm-serve | trtllm-bench> <arguments>
See the official OpenMPI Documentation for more details on mapping and binding of MPI ranks.