[None][doc] Add llama4 hybrid guide (#6640)

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
This commit is contained in:
jiahanc 2025-08-05 22:25:38 -07:00 committed by GitHub
parent da072277d1
commit 3170039e36
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -8,6 +8,7 @@ This document shows how to run Llama4-Maverick on B200 with PyTorch workflow and
- [Performance Benchmarks](#performance-benchmarks)
- [B200 Max-throughput](#b200-max-throughput)
- [B200 Min-latency](#b200-min-latency)
- [B200 Balanced](#b200-balanced)
- [Advanced Configuration](#advanced-configuration)
- [Configuration tuning](#configuration-tuning)
- [Troubleshooting](#troubleshooting)
@ -94,9 +95,7 @@ Explanation:
#### 2. Launch trtllm-serve OpenAI-compatible API server
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
Currently parallel weight loading conflicts with min_latency, disable the parallel weight loading to enable min_latency for now.
``` bash
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--max_batch_size 8 \
--tp_size 8 \
@ -121,6 +120,52 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
--max-concurrency 1 \
```
### B200 Balanced
#### 1. Prepare TensorRT-LLM extra configs
```bash
cat >./extra-llm-api-config.yml <<EOF
stream_interval: 2
cuda_graph_config:
max_batch_size: 1024
enable_padding: true
EOF
```
Explanation:
- `stream_interval`: The iteration interval to create responses under the streaming mode.
- `cuda_graph_config`: CUDA Graph config.
- `max_batch_size`: Max CUDA graph batch size to capture.
- `enable_padding`: Whether to enable CUDA graph padding.
#### 2. Launch trtllm-serve OpenAI-compatible API server
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
``` bash
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tp_size 8 \
--ep_size 2 \
--num_postprocess_workers 2 \
--trust_remote_code \
--extra_llm_api_options ./extra-llm-api-config.yml
```
#### 3. Run performance benchmark
TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve
Prepare a new terminal and run `benchmark_serving`
```bash
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--dataset-name random \
--ignore-eos \
--num-prompts 1000 \
--random-input-len 1024 \
--random-output-len 2048 \
--random-ids \
--max-concurrency 64 \
```
## Advanced Configuration
### Configuration tuning