diff --git a/docs/source/legacy/reference/support-matrix.md b/docs/source/legacy/reference/support-matrix.md index 1dc59fcfa0..24a3a01512 100644 --- a/docs/source/legacy/reference/support-matrix.md +++ b/docs/source/legacy/reference/support-matrix.md @@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release * - GPU Model Architectures - - [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/) + - [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/) - [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/) - [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/) - [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/) diff --git a/docs/source/overview.md b/docs/source/overview.md index 0df4f72539..471e57ff23 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -4,7 +4,7 @@ ## About TensorRT LLM -[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs. +[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs. ## Key Capabilities @@ -40,7 +40,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**. ### 🚀 **Advanced Optimization & Production Features** - **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency. - **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API. -- **[Advanced Quantization](./features/quantization.md)**: +- **[Advanced Quantization](./features/quantization.md)**: - **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels - **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture - **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram @@ -54,7 +54,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**. ### 🔧 **Latest GPU Architecture Support** TensorRT LLM supports the full spectrum of NVIDIA GPU architectures: -- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization +- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization - **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration - **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration - **NVIDIA Ampere**: A100, RTX 30 series for production workloads diff --git a/examples/disaggregated/slurm/benchmark/README.md b/examples/disaggregated/slurm/benchmark/README.md index 29b7301f08..5feb896aee 100644 --- a/examples/disaggregated/slurm/benchmark/README.md +++ b/examples/disaggregated/slurm/benchmark/README.md @@ -31,7 +31,7 @@ slurm: job_name: "" extra_args: "" # Additional SLURM arguments (e.g., "--gres=gpu:4 --exclude=node1") set_segment: true # Optional: whether to set the segment for the job - numa_bind: true # Enable NUMA binding for GB200 NVL72 + numa_bind: true # Enable NUMA binding for GB200/GB300 NVL72 ``` ### 2. Benchmark Configuration diff --git a/examples/disaggregated/slurm/benchmark/config.yaml b/examples/disaggregated/slurm/benchmark/config.yaml index afe7282348..b0952d9b7c 100644 --- a/examples/disaggregated/slurm/benchmark/config.yaml +++ b/examples/disaggregated/slurm/benchmark/config.yaml @@ -7,7 +7,7 @@ slurm: job_name: "" extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2" set_segment: true # Optional: whether to set the segment for the job - numa_bind: true # Only enable for GB200 NVL72 + numa_bind: true # Only enable for GB200/GB300 NVL72 # Benchmark Mode benchmark: diff --git a/examples/disaggregated/slurm/benchmark/start_worker.sh b/examples/disaggregated/slurm/benchmark/start_worker.sh index f51fccd6f0..e2ac1f7530 100644 --- a/examples/disaggregated/slurm/benchmark/start_worker.sh +++ b/examples/disaggregated/slurm/benchmark/start_worker.sh @@ -27,10 +27,10 @@ done if [ "${numa_bind}" = "true" ]; then numa_bind_cmd="numactl -m 0,1" - echo "numactl -m 0,1 - Only allocate memory from nodes on GB200" + echo "numactl -m 0,1 - Only allocate memory from nodes on GB200/GB300 NVL72" else numa_bind_cmd="" - echo "Not binding memory. If on GB200, use \"numactl -m 0,1\" to only allocate memory from nodes." + echo "Not binding memory. If on GB200/GB300 NVL72, use \"numactl -m 0,1\" to only allocate memory from nodes." fi if [ "${benchmark_mode}" = "gen_only" ]; then diff --git a/examples/wide_ep/README.md b/examples/wide_ep/README.md index a9b52cbe8a..cce3993b32 100644 --- a/examples/wide_ep/README.md +++ b/examples/wide_ep/README.md @@ -21,13 +21,13 @@ Wide-EP solves these challenges through: ### Prerequisites -* GPU: GB200 NVL72, H20, or RTX 6000D. +* GPU: GB200 NVL72, GB300 NVL72, H20, or RTX 6000D. * OS: Linux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for accuracy evaluation only) -For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container. +For GB200/GB300 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container. For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. @@ -108,16 +108,16 @@ If `never` is highlighted, enable Transparent HugePages by the following command echo madvise > /sys/kernel/mm/transparent_hugepage/enabled ``` -### GB200 NUMA binding +### GB200/GB300 NVL72 NUMA binding -GPU memory is also on NUMA nodes on GB200 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory. +GPU memory is also on NUMA nodes on GB200/GB300 NVL72 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory. ```bash numactl -m 0,1 ``` ### Shared Memory on EPLB -To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. +To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200/GB300 NVL72 node share the same expert weights to save memory. There is one environment variable `TRTLLM_EPLB_SHM_NAME` to specify the base name of the shared memory. This environment variable may need to be specified if there are multiple instances on one node. If not, you can ignore it. diff --git a/examples/wide_ep/slurm_scripts/README.md b/examples/wide_ep/slurm_scripts/README.md index a3865035fe..625dfc78e8 100644 --- a/examples/wide_ep/slurm_scripts/README.md +++ b/examples/wide_ep/slurm_scripts/README.md @@ -51,7 +51,7 @@ Before running benchmarks, ensure you have: 1. **SLURM Cluster Access**: Valid account and partition allocation 2. **Container Environment**: - NVIDIA Container Toolkit configured - - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200, `/dev/gdrdrv` for GDRCopy) + - Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200/GB300 NVL72, `/dev/gdrdrv` for GDRCopy) 3. **Model Files**: Checkpoint files accessible from all cluster nodes 4. **Configuration**: Updated `config.yaml` with your cluster-specific settings diff --git a/examples/wide_ep/slurm_scripts/config.yaml b/examples/wide_ep/slurm_scripts/config.yaml index c019c0d29d..2f10c9707d 100644 --- a/examples/wide_ep/slurm_scripts/config.yaml +++ b/examples/wide_ep/slurm_scripts/config.yaml @@ -6,7 +6,7 @@ slurm: job_time: "02:00:00" job_name: "" extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2" - numa_bind: true # Only enable for GB200 NVL72 + numa_bind: true # Only enable for GB200/GB300 NVL72 # Benchmark Mode benchmark: