[TRTLLM-9762] [doc] Update documents for GB300 NVL72 (#9987)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
Kaiyu Xie 2025-12-15 11:30:28 +08:00 committed by GitHub
parent b57650f1e6
commit 0788635d6c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 15 additions and 14 deletions

View File

@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release
* - GPU Model Architectures
-
- [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
- [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
- [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
- [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)

View File

@ -4,7 +4,7 @@
## About TensorRT LLM
[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
## Key Capabilities
@ -40,7 +40,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
### 🚀 **Advanced Optimization & Production Features**
- **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
- **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
- **[Advanced Quantization](./features/quantization.md)**:
- **[Advanced Quantization](./features/quantization.md)**:
- **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
- **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
- **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
@ -54,7 +54,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
### 🔧 **Latest GPU Architecture Support**
TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization
- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization
- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration
- **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
- **NVIDIA Ampere**: A100, RTX 30 series for production workloads

View File

@ -31,7 +31,7 @@ slurm:
job_name: "<job_name>"
extra_args: "" # Additional SLURM arguments (e.g., "--gres=gpu:4 --exclude=node1")
set_segment: true # Optional: whether to set the segment for the job
numa_bind: true # Enable NUMA binding for GB200 NVL72
numa_bind: true # Enable NUMA binding for GB200/GB300 NVL72
```
### 2. Benchmark Configuration

View File

@ -7,7 +7,7 @@ slurm:
job_name: "<job_name>"
extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
set_segment: true # Optional: whether to set the segment for the job
numa_bind: true # Only enable for GB200 NVL72
numa_bind: true # Only enable for GB200/GB300 NVL72
# Benchmark Mode
benchmark:

View File

@ -27,10 +27,10 @@ done
if [ "${numa_bind}" = "true" ]; then
numa_bind_cmd="numactl -m 0,1"
echo "numactl -m 0,1 - Only allocate memory from nodes on GB200"
echo "numactl -m 0,1 - Only allocate memory from nodes on GB200/GB300 NVL72"
else
numa_bind_cmd=""
echo "Not binding memory. If on GB200, use \"numactl -m 0,1\" to only allocate memory from nodes."
echo "Not binding memory. If on GB200/GB300 NVL72, use \"numactl -m 0,1\" to only allocate memory from nodes."
fi
if [ "${benchmark_mode}" = "gen_only" ]; then

View File

@ -21,13 +21,13 @@ Wide-EP solves these challenges through:
### Prerequisites
* GPU: GB200 NVL72, H20, or RTX 6000D.
* GPU: GB200 NVL72, GB300 NVL72, H20, or RTX 6000D.
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)
For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
For GB200/GB300 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
@ -108,16 +108,16 @@ If `never` is highlighted, enable Transparent HugePages by the following command
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
```
### GB200 NUMA binding
### GB200/GB300 NVL72 NUMA binding
GPU memory is also on NUMA nodes on GB200 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
GPU memory is also on NUMA nodes on GB200/GB300 NVL72 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
```bash
numactl -m 0,1 <command>
```
### Shared Memory on EPLB
To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory.
To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200/GB300 NVL72 node share the same expert weights to save memory.
There is one environment variable `TRTLLM_EPLB_SHM_NAME` to specify the base name of the shared memory. This environment variable may need to be specified if there are multiple instances on one node. If not, you can ignore it.

View File

@ -51,7 +51,7 @@ Before running benchmarks, ensure you have:
1. **SLURM Cluster Access**: Valid account and partition allocation
2. **Container Environment**:
- NVIDIA Container Toolkit configured
- Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200, `/dev/gdrdrv` for GDRCopy)
- Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200/GB300 NVL72, `/dev/gdrdrv` for GDRCopy)
3. **Model Files**: Checkpoint files accessible from all cluster nodes
4. **Configuration**: Updated `config.yaml` with your cluster-specific settings

View File

@ -6,7 +6,7 @@ slurm:
job_time: "02:00:00"
job_name: "<job_name>"
extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
numa_bind: true # Only enable for GB200 NVL72
numa_bind: true # Only enable for GB200/GB300 NVL72
# Benchmark Mode
benchmark: