[None][doc] Add llama4 hybrid guide (#6640)

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2026-01-14 06:27:45 +08:00 · 2025-08-05 22:25:38 -07:00 · 2025-08-05 22:25:38 -07:00 · 3170039e36
commit 3170039e36
parent da072277d1
1 changed files with 47 additions and 2 deletions
--- a/examples/models/core/llama4/README.md
+++ b/examples/models/core/llama4/README.md
@ -8,6 +8,7 @@ This document shows how to run Llama4-Maverick on B200 with PyTorch workflow and
 - [Performance Benchmarks](#performance-benchmarks)
  - [B200 Max-throughput](#b200-max-throughput)
  - [B200 Min-latency](#b200-min-latency)
+  - [B200 Balanced](#b200-balanced)
 - [Advanced Configuration](#advanced-configuration)
  - [Configuration tuning](#configuration-tuning)
  - [Troubleshooting](#troubleshooting)
@ -94,9 +95,7 @@ Explanation:

 #### 2. Launch trtllm-serve OpenAI-compatible API server
 TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
-Currently parallel weight loading conflicts with min_latency, disable the parallel weight loading to enable min_latency for now.
 ``` bash
-TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
 trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
    --max_batch_size 8 \
    --tp_size 8 \
@ -121,6 +120,52 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --max-concurrency 1 \
 ```

+### B200 Balanced
+
+
+#### 1. Prepare TensorRT-LLM extra configs
+```bash
+cat >./extra-llm-api-config.yml <<EOF
+stream_interval: 2
+cuda_graph_config:
+  max_batch_size: 1024
+  enable_padding: true
+EOF
+```
+Explanation:
+- `stream_interval`: The iteration interval to create responses under the streaming mode.
+- `cuda_graph_config`: CUDA Graph config.
+  - `max_batch_size`: Max CUDA graph batch size to capture.
+  - `enable_padding`: Whether to enable CUDA graph padding.
+
+
+#### 2. Launch trtllm-serve OpenAI-compatible API server
+TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
+``` bash
+trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
+    --tp_size 8 \
+    --ep_size 2 \
+    --num_postprocess_workers 2 \
+    --trust_remote_code \
+    --extra_llm_api_options ./extra-llm-api-config.yml
+```
+
+
+#### 3. Run performance benchmark
+TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve
+Prepare a new terminal and run `benchmark_serving`
+```bash
+python -m tensorrt_llm.serve.scripts.benchmark_serving \
+        --model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
+        --dataset-name random \
+        --ignore-eos \
+        --num-prompts 1000 \
+        --random-input-len 1024 \
+        --random-output-len 2048 \
+        --random-ids \
+        --max-concurrency 64 \
+```
+
 ## Advanced Configuration

 ### Configuration tuning