[TRTLLM-9762] [doc] Update documents for GB300 NVL72 (#9987)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2026-01-14 06:27:45 +08:00 · 2025-12-15 11:30:28 +08:00 · 2025-12-15 11:30:28 +08:00 · 0788635d6c
commit 0788635d6c
parent b57650f1e6
8 changed files with 15 additions and 14 deletions
--- a/docs/source/legacy/reference/support-matrix.md
+++ b/docs/source/legacy/reference/support-matrix.md
@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release
 * - GPU Model Architectures
  -
    - [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
+    - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
    - [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
    - [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
    - [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)
--- a/docs/source/overview.md
+++ b/docs/source/overview.md
@ -4,7 +4,7 @@

 ## About TensorRT LLM

-[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs. 
+[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.

 ## Key Capabilities

@ -40,7 +40,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
 ### 🚀 **Advanced Optimization & Production Features**
 - **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
 - **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
- **[Advanced Quantization](./features/quantization.md)**: 
+- **[Advanced Quantization](./features/quantization.md)**:
  - **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
  - **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
 - **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
@ -54,7 +54,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
 ### 🔧 **Latest GPU Architecture Support**

 TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization
+- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization
 - **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration
 - **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
 - **NVIDIA Ampere**: A100, RTX 30 series for production workloads
--- a/examples/disaggregated/slurm/benchmark/README.md
+++ b/examples/disaggregated/slurm/benchmark/README.md
@ -31,7 +31,7 @@ slurm:
  job_name: "<job_name>"
  extra_args: ""  # Additional SLURM arguments (e.g., "--gres=gpu:4 --exclude=node1")
  set_segment: true # Optional: whether to set the segment for the job
-  numa_bind: true  # Enable NUMA binding for GB200 NVL72
+  numa_bind: true  # Enable NUMA binding for GB200/GB300 NVL72
 ```

 ### 2. Benchmark Configuration
--- a/examples/disaggregated/slurm/benchmark/config.yaml
+++ b/examples/disaggregated/slurm/benchmark/config.yaml
@ -7,7 +7,7 @@ slurm:
  job_name: "<job_name>"
  extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
  set_segment: true # Optional: whether to set the segment for the job
-  numa_bind: true # Only enable for GB200 NVL72
+  numa_bind: true # Only enable for GB200/GB300 NVL72

 # Benchmark Mode
 benchmark:
--- a/examples/disaggregated/slurm/benchmark/start_worker.sh
+++ b/examples/disaggregated/slurm/benchmark/start_worker.sh
@ -27,10 +27,10 @@ done

 if [ "${numa_bind}" = "true" ]; then
    numa_bind_cmd="numactl -m 0,1"
-    echo "numactl -m 0,1 - Only allocate memory from nodes on GB200"
+    echo "numactl -m 0,1 - Only allocate memory from nodes on GB200/GB300 NVL72"
 else
    numa_bind_cmd=""
-    echo "Not binding memory. If on GB200, use \"numactl -m 0,1\" to only allocate memory from nodes."
+    echo "Not binding memory. If on GB200/GB300 NVL72, use \"numactl -m 0,1\" to only allocate memory from nodes."
 fi

 if [ "${benchmark_mode}" = "gen_only" ]; then
--- a/examples/wide_ep/README.md
+++ b/examples/wide_ep/README.md
@ -21,13 +21,13 @@ Wide-EP solves these challenges through:

 ### Prerequisites

-* GPU: GB200 NVL72, H20, or RTX 6000D.
+* GPU: GB200 NVL72, GB300 NVL72, H20, or RTX 6000D.
 * OS: Linux
 * Drivers: CUDA Driver 575 or Later
 * Docker with NVIDIA Container Toolkit installed
 * Python3 and python3-pip (Optional, for accuracy evaluation only)

-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
+For GB200/GB300 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.

 For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.

@ -108,16 +108,16 @@ If `never` is highlighted, enable Transparent HugePages by the following command
 echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
 ```

-### GB200 NUMA binding
+### GB200/GB300 NVL72 NUMA binding

-GPU memory is also on NUMA nodes on GB200 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
+GPU memory is also on NUMA nodes on GB200/GB300 NVL72 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
 ```bash
 numactl -m 0,1 <command>
 ```

 ### Shared Memory on EPLB

-To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory.
+To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200/GB300 NVL72 node share the same expert weights to save memory.

 There is one environment variable `TRTLLM_EPLB_SHM_NAME` to specify the base name of the shared memory. This environment variable may need to be specified if there are multiple instances on one node. If not, you can ignore it.

--- a/examples/wide_ep/slurm_scripts/README.md
+++ b/examples/wide_ep/slurm_scripts/README.md
@ -51,7 +51,7 @@ Before running benchmarks, ensure you have:
 1. **SLURM Cluster Access**: Valid account and partition allocation
 2. **Container Environment**:
   - NVIDIA Container Toolkit configured
-   - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200, `/dev/gdrdrv` for GDRCopy)
+   - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200/GB300 NVL72, `/dev/gdrdrv` for GDRCopy)
 3. **Model Files**: Checkpoint files accessible from all cluster nodes
 4. **Configuration**: Updated `config.yaml` with your cluster-specific settings

--- a/examples/wide_ep/slurm_scripts/config.yaml
+++ b/examples/wide_ep/slurm_scripts/config.yaml
@ -6,7 +6,7 @@ slurm:
  job_time: "02:00:00"
  job_name: "<job_name>"
  extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
-  numa_bind: true # Only enable for GB200 NVL72
+  numa_bind: true # Only enable for GB200/GB300 NVL72

 # Benchmark Mode
 benchmark: