mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-19 01:05:12 +08:00

dhansen-nvidia 80235e53cf [None][feat] Add documentation on configuring CPU affinity in TRT-LLM (#10678 )

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Signed-off-by: dhansen-nvidia <218031328+dhansen-nvidia@users.noreply.github.com>
Co-authored-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

2026-02-15 19:57:03 +08:00

8.6 KiB

Raw Blame History

CPU Affinity configuration in TensorRT-LLM

NUMA-aware affinity in TensorRT-LLM

TensorRT-LLM is frequently deployed on NUMA systems. In order to ensure consistent and optimal performance on these systems, it is critical to set the CPU affinity of the workers/tasks launched as part of a particular TRT-LLM instance so as to minimize latency and maximize bandwidth of CPU↔GPU and CPU↔DRAM communication.

Because TensorRT-LLM does the work of allocating GPU/CUDA devices to ranks, it is logically the ideal place for the CPU affinity to be determined and set. For this reason, TensorRT-LLM provides a mechanism to automatically set CPU affinity according to NUMA topology. In some situations/deployments, the user may wish to configure CPU affinity manually (i.e. using numactl, wrappers around the same, or mpirun). For this reason, this feature is only activated if it is explicitly enabled or if CPU affinity is not already constrained by the user or environment. It is controlled by the TLLM_NUMA_AWARE_WORKER_AFFINITY environment variable as follows:

TLLM_NUMA_AWARE_WORKER_AFFINITY	Behavior
	Affinity is auto-configured if it is unconstrained, and cleared if it is constrained by the user and/or environment
1	Affinity is unconditionally auto-configured.
0 or any other value	Affinity remains as configured by the user and/or environment

Other environmental considerations

Whether or not the user chooses to manually configure CPU affinity or have TensorRT-LLM configure it automatically, the environment can also constrain the CPU affinity in a way that subverts the user's intent. Both OpenMPI and Slurm may configure CPU affinity, so the following additional configuration is recommended to avoid this.

OpenMPI

By default, OpenMPI chooses a rank-wise CPU affinity that is not sensitized to the NUMA-topology of the system. Because it does not know which GPU a particular rank will be communicating with (this is determined by TRT-LLM at runtime), it cannot set the CPU affinity accordingly. For this reason, it is recommended that OpenMPI's default binding policy be disabled as follows:

export OMPI_MCA_hwloc_base_binding_policy=none
export OMPI_MCA_rmaps_base_inherit=1

The first environment variable ensures that OpenMPI will not attempt to bind or set the affinity of the ranks that are created at launch.

The second ensures that OpenMPI's binding policy will propagate to MPI workers that are spawned by mpi4py's MPIPoolExecutor class within TensorRT-LLM (when using mpirun).

Slurm

If Slurm is configured to use a affinity or cgroup task plugin, then Slurm may also configure CPU affinity by default in a way that is not sensitized to NUMA topology. To prevent this, Slurm jobs should be launched accordingly:

srun

The srun parameters should include --cpu-bind=none and exclude --exclusive:

srun --cpu-bind=none ...

sbatch

The sbatch script should set SLURM_CPU_BIND environment variable to "none":

export SLURM_CPU_BIND=none

Note: if this environment variable is set, it is not necessary to supply the --cpu-bind=none to each job step (srun invocation)

CPU affinity configuration examples

Using NUMA-aware autoconfiguration

To explicitly enable the NUMA-aware autoconfiguration feature in TensorRT-LLM, simply set TLLM_NUMA_AWARE_WORKER_AFFINITY in the launch script (prior to trtllm-bench or trtllm-serve) as follows:

export TLLM_NUMA_AWARE_WORKER_AFFINITY=1

Because autoconfiguration happens within TensorRT-LLM itself, it will override any CPU affinity or binding that has been previously set by OpenMPI or Slurm.

NUMA-aware CPU affinity using bindpcie

The bindpcie script is designed to set a per-rank CPU affinity that is ideal for NUMA topology. While setting TLLM_NUMA_AWARE_WORKER_AFFINITY=1 usually achieves the same result in terms of the CPU affinity that is set, this approach has the distinct advantage that the optimal CPU affinity gets set upon launching TensorRT-LLM, guaranteeing that each worker/rank executes on the optimal NUMA node from inception. The NUMA-aware CPU affinity autoconfiguration mechanism in TensorRT-LLM, on the other hand, is triggered by each worker/rank upon its own PID after it has already launched. If the worker/rank executes on a NUMA node other than the optimal NUMA node at some point between the launch of the process and the NUMA-aware autoconfiguration, it is possible that some CPU memory may have been allocated/touched on what will become a remote NUMA node after the point of autoconfiguration, potentially negatively impacting performance. In practice, this effect has been observed to have minimal performance impact, but some degradation of performance due to remote NUMA node access is still theoretically possible.

The bindpcie script can only be applied to deployments that make use of trtllm-llmapi-launch within an sbatch script. One example of how to apply bindpcie to trtllm-serve in an sbatch script is as follows:

# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0

# Prevent OpenMPI from overriding affinity set by bindpcie
export OMPI_MCA_hwloc_base_binding_policy=none

# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1

# Prevent Slurm from assigning a default CPU affinity
export SLURM_CPU_BIND=none

srun -l \
    --container-image=${CONTAINER_IMAGE} \
    --container-mounts=${MOUNT_DIR}:${MOUNT_DEST} \
    --container-workdir=${WORKDIR} \
    --export=ALL,PYTHONPATH=${SOURCE_ROOT} \
    --mpi=pmix \
    bash -c "
        set -ex
        $PROLOGUE
        export PATH=$PATH:~/.local/bin

        bindpcie trtllm-llmapi-launch \
         trtllm-serve $LOCAL_MODEL \
            ${ADDITIONAL_OPTIONS}

Note

This is not a complete or exhaustive example of an sbatch script to launch trtllm-serve and is only intended to highlight the application of bindpcie within an existing sbatch script.

Using numactl

# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0

# Prevent OpenMPI from overriding affinity set by numactl
export OMPI_MCA_hwloc_base_binding_policy=none

# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1

# Use numactl to specify CPU and memory binding for all ranks (not per-rank)
numactl --physcpubind=0,1,16,17 --membind=0 mpirun --report-bindings --oversubscribe --allow-run-as-root \
<trtllm-serve | trtllm-bench> <arguments>

Using mpirun

If a manually-specified per-rank CPU affinity is desired when running on a single node with mpirun, this can be achieved most easily using an OpenMPI rankfile. The following is an example of how a rankfile can be used to arbitrarily map each of 4 MPI ranks to a distinct set of 4 cores:

# Prevent TensorRT-LLM from autoconfiguring or clearing CPU affinity
export TLLM_NUMA_AWARE_WORKER_AFFINITY=0

# Not strictly needed here, since we are overriding with explicit bindings from
# a rankfile
# export OMPI_MCA_hwloc_base_binding_policy=none

# Ensure that MPI binding policy propagates to any MPI workers dynamically
# spawned by MPIPoolExecutor
export OMPI_MCA_rmaps_base_inherit=1

# Create a rankfile to enumerate a set of 4 cores to which each rank is bound
cat > ./rankfile <<EOF
rank 0=localhost slot=0,1,2,3
rank 1=localhost slot=4,5,6,7
rank 2=localhost slot=8,9,10,11
rank 3=localhost slot=12,13,14,15
EOF

# Run with (and report to verify) the bindings from the rankfile
mpirun --rankfile ./rankfile -n 1 --report-bindings --oversubscribe --allow-run-as-root \
<trtllm-serve | trtllm-bench> <arguments>

See the official OpenMPI Documentation for more details on mapping and binding of MPI ranks.

8.6 KiB Raw Blame History