From 80235e53cfdf0bbcab6a7ff564dae42d07b66ae9 Mon Sep 17 00:00:00 2001 From: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com> Date: Fri, 30 Jan 2026 17:02:15 -0500 Subject: [PATCH] [None][feat] Add documentation on configuring CPU affinity in TRT-LLM (#10678) Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Signed-off-by: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> --- .../configuring-cpu-affinity.md | 208 ++++++++++++++++++ docs/source/index.rst | 1 + 2 files changed, 209 insertions(+) create mode 100644 docs/source/deployment-guide/configuring-cpu-affinity.md diff --git a/docs/source/deployment-guide/configuring-cpu-affinity.md b/docs/source/deployment-guide/configuring-cpu-affinity.md new file mode 100644 index 0000000000..daccaf4e65 --- /dev/null +++ b/docs/source/deployment-guide/configuring-cpu-affinity.md @@ -0,0 +1,208 @@ +# CPU Affinity configuration in TensorRT-LLM + +## NUMA-aware affinity in TensorRT-LLM + +TensorRT-LLM is frequently deployed on +[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. In +order to ensure consistent and optimal performance on these systems, it is +critical to set the CPU affinity of the workers/tasks launched as part of a +particular TRT-LLM instance so as to minimize latency and maximize bandwidth of +CPU↔GPU and CPU↔DRAM communication. + +Because TensorRT-LLM does the work of allocating GPU/CUDA devices to ranks, it +is logically the ideal place for the CPU affinity to be determined and set. For +this reason, TensorRT-LLM provides a mechanism to automatically set CPU +affinity according to NUMA topology. In some situations/deployments, the user +may wish to configure CPU affinity manually (i.e. using +[numactl](https://github.com/numactl/numactl), [wrappers around the +same](https://github.com/NVIDIA/mlperf-common/blob/main/client/bindpcie), or +mpirun). For this reason, this feature is only activated if it is explicitly +enabled or if CPU affinity is not already constrained by the user or +environment. It is controlled by the TLLM_NUMA_AWARE_WORKER_AFFINITY +environment variable as follows: + + +| TLLM_NUMA_AWARE_WORKER_AFFINITY | Behavior | +|---------------------------------|------------------------------------------------------------------------------------------------------------------------------| +| | Affinity is auto-configured if it is unconstrained, and cleared if it is constrained by the user and/or environment | +| 1 | Affinity is unconditionally auto-configured. | +| 0 or any other value | Affinity remains as configured by the user and/or environment | + + +## Other environmental considerations + +Whether or not the user chooses to manually configure CPU affinity or have +TensorRT-LLM configure it automatically, the environment can also constrain the +CPU affinity in a way that subverts the user's intent. Both OpenMPI and Slurm +may configure CPU affinity, so the following additional configuration is +recommended to avoid this. + +### OpenMPI + +By default, OpenMPI chooses a rank-wise CPU affinity that is not sensitized to +the NUMA-topology of the system. Because it does not know which GPU a +particular rank will be communicating with (this is determined by TRT-LLM at +runtime), it cannot set the CPU affinity accordingly. For this reason, it is +recommended that OpenMPI's default binding policy be disabled as follows: + +```bash +export OMPI_MCA_hwloc_base_binding_policy=none +export OMPI_MCA_rmaps_base_inherit=1 +``` + +The first environment variable ensures that OpenMPI will not attempt to bind or +set the affinity of the ranks that are created at launch. + +The second ensures that OpenMPI's binding policy will propagate to MPI workers +that are spawned by `mpi4py`'s `MPIPoolExecutor` class within TensorRT-LLM +(when using mpirun). + +### Slurm + +If Slurm is configured to use a affinity or cgroup task plugin, then Slurm may +also configure CPU affinity by default in a way that is not sensitized to NUMA +topology. To prevent this, Slurm jobs should be launched accordingly: + +#### srun + +The srun parameters should include `--cpu-bind=none` and exclude `--exclusive`: + +```bash +srun --cpu-bind=none ... +``` + +#### sbatch + +The sbatch script should set `SLURM_CPU_BIND` environment variable to "none": + +```bash +export SLURM_CPU_BIND=none +``` + +Note: if this environment variable is set, it is not necessary to supply the +`--cpu-bind=none` to each job step (srun invocation) + +## CPU affinity configuration examples + +### Using NUMA-aware autoconfiguration + +To explicitly enable the NUMA-aware autoconfiguration feature in TensorRT-LLM, +simply set `TLLM_NUMA_AWARE_WORKER_AFFINITY` in the launch script (prior to +`trtllm-bench` or `trtllm-serve`) as follows: + +```bash +export TLLM_NUMA_AWARE_WORKER_AFFINITY=1 +``` + +Because autoconfiguration happens within TensorRT-LLM itself, it will override +any CPU affinity or binding that has been previously set by OpenMPI or Slurm. + +### NUMA-aware CPU affinity using [bindpcie](https://github.com/NVIDIA/mlperf-common/blob/main/client/bindpcie) + +The bindpcie script is designed to set a per-rank CPU affinity that is ideal +for NUMA topology. While setting `TLLM_NUMA_AWARE_WORKER_AFFINITY=1` usually +achieves the same result in terms of the CPU affinity that is set, this +approach has the distinct advantage that the optimal CPU affinity gets set +_upon launching_ TensorRT-LLM, guaranteeing that each worker/rank executes on +the optimal NUMA node from inception. The NUMA-aware CPU affinity +autoconfiguration mechanism in TensorRT-LLM, on the other hand, is triggered by +each worker/rank upon its own PID _after_ it has already launched. If the +worker/rank executes on a NUMA node other than the optimal NUMA node at some +point between the launch of the process and the NUMA-aware autoconfiguration, +it is possible that some CPU memory may have been allocated/touched on what +will become a remote NUMA node after the point of autoconfiguration, +potentially negatively impacting performance. In practice, this effect has been +observed to have minimal performance impact, but some degradation of +performance due to remote NUMA node access is still theoretically possible. + +The `bindpcie` script can only be applied to deployments that make use of +`trtllm-llmapi-launch` within an sbatch script. One example of how to apply +bindpcie to `trtllm-serve` in an sbatch script is as follows: + +```bash +# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity +export TLLM_NUMA_AWARE_WORKER_AFFINITY=0 + +# Prevent OpenMPI from overriding affinity set by bindpcie +export OMPI_MCA_hwloc_base_binding_policy=none + +# Ensure that MPI binding policy propagates to any MPI workers dynamically +# spawned by MPIPoolExecutor +export OMPI_MCA_rmaps_base_inherit=1 + +# Prevent Slurm from assigning a default CPU affinity +export SLURM_CPU_BIND=none + +srun -l \ + --container-image=${CONTAINER_IMAGE} \ + --container-mounts=${MOUNT_DIR}:${MOUNT_DEST} \ + --container-workdir=${WORKDIR} \ + --export=ALL,PYTHONPATH=${SOURCE_ROOT} \ + --mpi=pmix \ + bash -c " + set -ex + $PROLOGUE + export PATH=$PATH:~/.local/bin + + bindpcie trtllm-llmapi-launch \ + trtllm-serve $LOCAL_MODEL \ + ${ADDITIONAL_OPTIONS} +``` + +> [!NOTE] +> This is not a complete or exhaustive example of an sbatch script to launch +> trtllm-serve and is only intended to highlight the application of bindpcie +> within an existing sbatch script. + +### Using [numactl](https://github.com/numactl/numactl) + +```bash +# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity +export TLLM_NUMA_AWARE_WORKER_AFFINITY=0 + +# Prevent OpenMPI from overriding affinity set by numactl +export OMPI_MCA_hwloc_base_binding_policy=none + +# Ensure that MPI binding policy propagates to any MPI workers dynamically +# spawned by MPIPoolExecutor +export OMPI_MCA_rmaps_base_inherit=1 + +# Use numactl to specify CPU and memory binding for all ranks (not per-rank) +numactl --physcpubind=0,1,16,17 --membind=0 mpirun --report-bindings --oversubscribe --allow-run-as-root \ + +``` + +### Using mpirun + +If a manually-specified per-rank CPU affinity is desired when running on a +single node with mpirun, this can be achieved most easily using an OpenMPI +rankfile. The following is an example of how a rankfile can be used to +arbitrarily map each of 4 MPI ranks to a distinct set of 4 cores: + +```bash +# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity +export TLLM_NUMA_AWARE_WORKER_AFFINITY=0 + +# Not strictly needed here, since we are overriding with explicit bindings from +# a rankfile +# export OMPI_MCA_hwloc_base_binding_policy=none + +# Ensure that MPI binding policy propagates to any MPI workers dynamically +# spawned by MPIPoolExecutor +export OMPI_MCA_rmaps_base_inherit=1 + +# Create a rankfile to enumerate a set of 4 cores to which each rank is bound +cat > ./rankfile < +``` + +See the official [OpenMPI Documentation](https://www.open-mpi.org/doc/) for +more details on mapping and binding of MPI ranks. diff --git a/docs/source/index.rst b/docs/source/index.rst index a4ca7804e0..141c164562 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -26,6 +26,7 @@ Welcome to TensorRT LLM's Documentation! examples/trtllm_serve_examples examples/dynamo_k8s_example.rst deployment-guide/index.rst + deployment-guide/configuring-cpu-affinity.md .. toctree:: :maxdepth: 2