[None][feat] Add documentation on configuring CPU affinity in TRT-LLM (#10678)

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Signed-off-by: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-18 16:55:08 +08:00 · 2026-01-30 17:02:15 -05:00 · 2026-01-30 17:02:15 -05:00 · 80235e53cf
commit 80235e53cf
parent 5d73194ffb
2 changed files with 209 additions and 0 deletions
--- a/docs/source/deployment-guide/configuring-cpu-affinity.md
+++ b/docs/source/deployment-guide/configuring-cpu-affinity.md
@ -0,0 +1,208 @@
+# CPU Affinity configuration in TensorRT-LLM
+
+## NUMA-aware affinity in TensorRT-LLM
+
+TensorRT-LLM is frequently deployed on
+[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) systems. In
+order to ensure consistent and optimal performance on these systems, it is
+critical to set the CPU affinity of the workers/tasks launched as part of a
+particular TRT-LLM instance so as to minimize latency and maximize bandwidth of
+CPU&harr;GPU and CPU&harr;DRAM communication.
+
+Because TensorRT-LLM does the work of allocating GPU/CUDA devices to ranks, it
+is logically the ideal place for the CPU affinity to be determined and set. For
+this reason, TensorRT-LLM provides a mechanism to automatically set CPU
+affinity according to NUMA topology. In some situations/deployments, the user
+may wish to configure CPU affinity manually (i.e. using
+[numactl](https://github.com/numactl/numactl), [wrappers around the
+same](https://github.com/NVIDIA/mlperf-common/blob/main/client/bindpcie), or
+mpirun). For this reason, this feature is only activated if it is explicitly
+enabled or if CPU affinity is not already constrained by the user or
+environment. It is controlled by the TLLM_NUMA_AWARE_WORKER_AFFINITY
+environment variable as follows:
+
+
+| TLLM_NUMA_AWARE_WORKER_AFFINITY | Behavior                                                                                                                     |
+|---------------------------------|------------------------------------------------------------------------------------------------------------------------------|
+| <unset>                         | Affinity is auto-configured if it is unconstrained, and cleared if it is constrained by the user and/or environment          |
+| 1                               | Affinity is unconditionally auto-configured.                                                                                 |
+| 0 or any other value            | Affinity remains as configured by the user and/or environment                                                                |
+
+
+## Other environmental considerations
+
+Whether or not the user chooses to manually configure CPU affinity or have
+TensorRT-LLM configure it automatically, the environment can also constrain the
+CPU affinity in a way that subverts the user's intent. Both OpenMPI and Slurm
+may configure CPU affinity, so the following additional configuration is
+recommended to avoid this.
+
+### OpenMPI
+
+By default, OpenMPI chooses a rank-wise CPU affinity that is not sensitized to
+the NUMA-topology of the system. Because it does not know which GPU a
+particular rank will be communicating with (this is determined by TRT-LLM at
+runtime), it cannot set the CPU affinity accordingly. For this reason, it is
+recommended that OpenMPI's default binding policy be disabled as follows:
+
+```bash
+export OMPI_MCA_hwloc_base_binding_policy=none
+export OMPI_MCA_rmaps_base_inherit=1
+```
+
+The first environment variable ensures that OpenMPI will not attempt to bind or
+set the affinity of the ranks that are created at launch.
+
+The second ensures that OpenMPI's binding policy will propagate to MPI workers
+that are spawned by `mpi4py`'s `MPIPoolExecutor` class within TensorRT-LLM
+(when using mpirun).
+
+### Slurm
+
+If Slurm is configured to use a affinity or cgroup task plugin, then Slurm may
+also configure CPU affinity by default in a way that is not sensitized to NUMA
+topology. To prevent this, Slurm jobs should be launched accordingly:
+
+#### srun
+
+The srun parameters should include `--cpu-bind=none` and exclude `--exclusive`:
+
+```bash
+srun --cpu-bind=none ...
+```
+
+#### sbatch
+
+The sbatch script should set `SLURM_CPU_BIND` environment variable to "none":
+
+```bash
+export SLURM_CPU_BIND=none
+```
+
+Note: if this environment variable is set, it is not necessary to supply the
+`--cpu-bind=none` to each job step (srun invocation)
+
+## CPU affinity configuration examples
+
+### Using NUMA-aware autoconfiguration
+
+To explicitly enable the NUMA-aware autoconfiguration feature in TensorRT-LLM,
+simply set `TLLM_NUMA_AWARE_WORKER_AFFINITY` in the launch script (prior to
+`trtllm-bench` or `trtllm-serve`) as follows:
+
+```bash
+export TLLM_NUMA_AWARE_WORKER_AFFINITY=1
+```
+
+Because autoconfiguration happens within TensorRT-LLM itself, it will override
+any CPU affinity or binding that has been previously set by OpenMPI or Slurm.
+
+### NUMA-aware CPU affinity using [bindpcie](https://github.com/NVIDIA/mlperf-common/blob/main/client/bindpcie)
+
+The bindpcie script is designed to set a per-rank CPU affinity that is ideal
+for NUMA topology. While setting `TLLM_NUMA_AWARE_WORKER_AFFINITY=1` usually
+achieves the same result in terms of the CPU affinity that is set, this
+approach has the distinct advantage that the optimal CPU affinity gets set
+_upon launching_ TensorRT-LLM, guaranteeing that each worker/rank executes on
+the optimal NUMA node from inception. The NUMA-aware CPU affinity
+autoconfiguration mechanism in TensorRT-LLM, on the other hand, is triggered by
+each worker/rank upon its own PID _after_ it has already launched. If the
+worker/rank executes on a NUMA node other than the optimal NUMA node at some
+point between the launch of the process and the NUMA-aware autoconfiguration,
+it is possible that some CPU memory may have been allocated/touched on what
+will become a remote NUMA node after the point of autoconfiguration,
+potentially negatively impacting performance. In practice, this effect has been
+observed to have minimal performance impact, but some degradation of
+performance due to remote NUMA node access is still theoretically possible.
+
+The `bindpcie` script can only be applied to deployments that make use of
+`trtllm-llmapi-launch` within an sbatch script. One example of how to apply
+bindpcie to `trtllm-serve` in an sbatch script is as follows:
+
+```bash
+# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
+export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
+
+# Prevent OpenMPI from overriding affinity set by bindpcie
+export OMPI_MCA_hwloc_base_binding_policy=none
+
+# Ensure that MPI binding policy propagates to any MPI workers dynamically
+# spawned by MPIPoolExecutor
+export OMPI_MCA_rmaps_base_inherit=1
+
+# Prevent Slurm from assigning a default CPU affinity
+export SLURM_CPU_BIND=none
+
+srun -l \
+    --container-image=${CONTAINER_IMAGE} \
+    --container-mounts=${MOUNT_DIR}:${MOUNT_DEST} \
+    --container-workdir=${WORKDIR} \
+    --export=ALL,PYTHONPATH=${SOURCE_ROOT} \
+    --mpi=pmix \
+    bash -c "
+        set -ex
+        $PROLOGUE
+        export PATH=$PATH:~/.local/bin
+
+        bindpcie trtllm-llmapi-launch \
+         trtllm-serve $LOCAL_MODEL \
+            ${ADDITIONAL_OPTIONS}
+```
+
+> [!NOTE]
+> This is not a complete or exhaustive example of an sbatch script to launch
+> trtllm-serve and is only intended to highlight the application of bindpcie
+> within an existing sbatch script.
+
+### Using [numactl](https://github.com/numactl/numactl)
+
+```bash
+# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
+export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
+
+# Prevent OpenMPI from overriding affinity set by numactl
+export OMPI_MCA_hwloc_base_binding_policy=none
+
+# Ensure that MPI binding policy propagates to any MPI workers dynamically
+# spawned by MPIPoolExecutor
+export OMPI_MCA_rmaps_base_inherit=1
+
+# Use numactl to specify CPU and memory binding for all ranks (not per-rank)
+numactl --physcpubind=0,1,16,17 --membind=0 mpirun --report-bindings --oversubscribe --allow-run-as-root \
+<trtllm-serve | trtllm-bench> <arguments>
+```
+
+### Using mpirun
+
+If a manually-specified per-rank CPU affinity is desired when running on a
+single node with mpirun, this can be achieved most easily using an OpenMPI
+rankfile. The following is an example of how a rankfile can be used to
+arbitrarily map each of 4 MPI ranks to a distinct set of 4 cores:
+
+```bash
+# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
+export TLLM_NUMA_AWARE_WORKER_AFFINITY=0
+
+# Not strictly needed here, since we are overriding with explicit bindings from
+# a rankfile
+# export OMPI_MCA_hwloc_base_binding_policy=none
+
+# Ensure that MPI binding policy propagates to any MPI workers dynamically
+# spawned by MPIPoolExecutor
+export OMPI_MCA_rmaps_base_inherit=1
+
+# Create a rankfile to enumerate a set of 4 cores to which each rank is bound
+cat > ./rankfile <<EOF
+rank 0=localhost slot=0,1,2,3
+rank 1=localhost slot=4,5,6,7
+rank 2=localhost slot=8,9,10,11
+rank 3=localhost slot=12,13,14,15
+EOF
+
+# Run with (and report to verify) the bindings from the rankfile
+mpirun --rankfile ./rankfile -n 1 --report-bindings --oversubscribe --allow-run-as-root \
+<trtllm-serve | trtllm-bench> <arguments>
+```
+
+See the official [OpenMPI Documentation](https://www.open-mpi.org/doc/) for
+more details on mapping and binding of MPI ranks.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -26,6 +26,7 @@ Welcome to TensorRT LLM's Documentation!
   examples/trtllm_serve_examples
   examples/dynamo_k8s_example.rst
   deployment-guide/index.rst
+   deployment-guide/configuring-cpu-affinity.md

 .. toctree::
   :maxdepth: 2