mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
[TRTLLM-9762] [doc] Update documents for GB300 NVL72 (#9987)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
parent
b57650f1e6
commit
0788635d6c
@ -133,6 +133,7 @@ In addition, older architectures can have limitations for newer software release
|
||||
* - GPU Model Architectures
|
||||
-
|
||||
- [NVIDIA GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/)
|
||||
- [NVIDIA GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/)
|
||||
- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)
|
||||
- [NVIDIA Grace Hopper Superchip](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/)
|
||||
- [NVIDIA Hopper Architecture](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)
|
||||
|
||||
@ -4,7 +4,7 @@
|
||||
|
||||
## About TensorRT LLM
|
||||
|
||||
[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
|
||||
[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
|
||||
|
||||
## Key Capabilities
|
||||
|
||||
@ -40,7 +40,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
|
||||
### 🚀 **Advanced Optimization & Production Features**
|
||||
- **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
|
||||
- **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
|
||||
- **[Advanced Quantization](./features/quantization.md)**:
|
||||
- **[Advanced Quantization](./features/quantization.md)**:
|
||||
- **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
|
||||
- **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
|
||||
- **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
|
||||
@ -54,7 +54,7 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
|
||||
### 🔧 **Latest GPU Architecture Support**
|
||||
|
||||
TensorRT LLM supports the full spectrum of NVIDIA GPU architectures:
|
||||
- **NVIDIA Blackwell**: B200, GB200, RTX Pro 6000 SE with FP4 optimization
|
||||
- **NVIDIA Blackwell**: B200, GB200, B300, GB300, and RTX Pro 6000 SE with FP4 optimization
|
||||
- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration
|
||||
- **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration
|
||||
- **NVIDIA Ampere**: A100, RTX 30 series for production workloads
|
||||
|
||||
@ -31,7 +31,7 @@ slurm:
|
||||
job_name: "<job_name>"
|
||||
extra_args: "" # Additional SLURM arguments (e.g., "--gres=gpu:4 --exclude=node1")
|
||||
set_segment: true # Optional: whether to set the segment for the job
|
||||
numa_bind: true # Enable NUMA binding for GB200 NVL72
|
||||
numa_bind: true # Enable NUMA binding for GB200/GB300 NVL72
|
||||
```
|
||||
|
||||
### 2. Benchmark Configuration
|
||||
|
||||
@ -7,7 +7,7 @@ slurm:
|
||||
job_name: "<job_name>"
|
||||
extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
|
||||
set_segment: true # Optional: whether to set the segment for the job
|
||||
numa_bind: true # Only enable for GB200 NVL72
|
||||
numa_bind: true # Only enable for GB200/GB300 NVL72
|
||||
|
||||
# Benchmark Mode
|
||||
benchmark:
|
||||
|
||||
@ -27,10 +27,10 @@ done
|
||||
|
||||
if [ "${numa_bind}" = "true" ]; then
|
||||
numa_bind_cmd="numactl -m 0,1"
|
||||
echo "numactl -m 0,1 - Only allocate memory from nodes on GB200"
|
||||
echo "numactl -m 0,1 - Only allocate memory from nodes on GB200/GB300 NVL72"
|
||||
else
|
||||
numa_bind_cmd=""
|
||||
echo "Not binding memory. If on GB200, use \"numactl -m 0,1\" to only allocate memory from nodes."
|
||||
echo "Not binding memory. If on GB200/GB300 NVL72, use \"numactl -m 0,1\" to only allocate memory from nodes."
|
||||
fi
|
||||
|
||||
if [ "${benchmark_mode}" = "gen_only" ]; then
|
||||
|
||||
@ -21,13 +21,13 @@ Wide-EP solves these challenges through:
|
||||
|
||||
### Prerequisites
|
||||
|
||||
* GPU: GB200 NVL72, H20, or RTX 6000D.
|
||||
* GPU: GB200 NVL72, GB300 NVL72, H20, or RTX 6000D.
|
||||
* OS: Linux
|
||||
* Drivers: CUDA Driver 575 or Later
|
||||
* Docker with NVIDIA Container Toolkit installed
|
||||
* Python3 and python3-pip (Optional, for accuracy evaluation only)
|
||||
|
||||
For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
|
||||
For GB200/GB300 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container.
|
||||
|
||||
For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
|
||||
|
||||
@ -108,16 +108,16 @@ If `never` is highlighted, enable Transparent HugePages by the following command
|
||||
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
|
||||
```
|
||||
|
||||
### GB200 NUMA binding
|
||||
### GB200/GB300 NVL72 NUMA binding
|
||||
|
||||
GPU memory is also on NUMA nodes on GB200 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
|
||||
GPU memory is also on NUMA nodes on GB200/GB300 NVL72 and the system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory.
|
||||
```bash
|
||||
numactl -m 0,1 <command>
|
||||
```
|
||||
|
||||
### Shared Memory on EPLB
|
||||
|
||||
To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory.
|
||||
To achieve online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200/GB300 NVL72 node share the same expert weights to save memory.
|
||||
|
||||
There is one environment variable `TRTLLM_EPLB_SHM_NAME` to specify the base name of the shared memory. This environment variable may need to be specified if there are multiple instances on one node. If not, you can ignore it.
|
||||
|
||||
|
||||
@ -51,7 +51,7 @@ Before running benchmarks, ensure you have:
|
||||
1. **SLURM Cluster Access**: Valid account and partition allocation
|
||||
2. **Container Environment**:
|
||||
- NVIDIA Container Toolkit configured
|
||||
- Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200, `/dev/gdrdrv` for GDRCopy)
|
||||
- Required device mappings (e.g., `/dev/nvidia-caps-imex-channels` for GB200/GB300 NVL72, `/dev/gdrdrv` for GDRCopy)
|
||||
3. **Model Files**: Checkpoint files accessible from all cluster nodes
|
||||
4. **Configuration**: Updated `config.yaml` with your cluster-specific settings
|
||||
|
||||
|
||||
@ -6,7 +6,7 @@ slurm:
|
||||
job_time: "02:00:00"
|
||||
job_name: "<job_name>"
|
||||
extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
|
||||
numa_bind: true # Only enable for GB200 NVL72
|
||||
numa_bind: true # Only enable for GB200/GB300 NVL72
|
||||
|
||||
# Benchmark Mode
|
||||
benchmark:
|
||||
|
||||
Loading…
Reference in New Issue
Block a user