mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
[None][doc] Add llama4 hybrid guide (#6640)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
This commit is contained in:
parent
da072277d1
commit
3170039e36
@ -8,6 +8,7 @@ This document shows how to run Llama4-Maverick on B200 with PyTorch workflow and
|
||||
- [Performance Benchmarks](#performance-benchmarks)
|
||||
- [B200 Max-throughput](#b200-max-throughput)
|
||||
- [B200 Min-latency](#b200-min-latency)
|
||||
- [B200 Balanced](#b200-balanced)
|
||||
- [Advanced Configuration](#advanced-configuration)
|
||||
- [Configuration tuning](#configuration-tuning)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
@ -94,9 +95,7 @@ Explanation:
|
||||
|
||||
#### 2. Launch trtllm-serve OpenAI-compatible API server
|
||||
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
|
||||
Currently parallel weight loading conflicts with min_latency, disable the parallel weight loading to enable min_latency for now.
|
||||
``` bash
|
||||
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
|
||||
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
|
||||
--max_batch_size 8 \
|
||||
--tp_size 8 \
|
||||
@ -121,6 +120,52 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
|
||||
--max-concurrency 1 \
|
||||
```
|
||||
|
||||
### B200 Balanced
|
||||
|
||||
|
||||
#### 1. Prepare TensorRT-LLM extra configs
|
||||
```bash
|
||||
cat >./extra-llm-api-config.yml <<EOF
|
||||
stream_interval: 2
|
||||
cuda_graph_config:
|
||||
max_batch_size: 1024
|
||||
enable_padding: true
|
||||
EOF
|
||||
```
|
||||
Explanation:
|
||||
- `stream_interval`: The iteration interval to create responses under the streaming mode.
|
||||
- `cuda_graph_config`: CUDA Graph config.
|
||||
- `max_batch_size`: Max CUDA graph batch size to capture.
|
||||
- `enable_padding`: Whether to enable CUDA graph padding.
|
||||
|
||||
|
||||
#### 2. Launch trtllm-serve OpenAI-compatible API server
|
||||
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
|
||||
``` bash
|
||||
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
|
||||
--tp_size 8 \
|
||||
--ep_size 2 \
|
||||
--num_postprocess_workers 2 \
|
||||
--trust_remote_code \
|
||||
--extra_llm_api_options ./extra-llm-api-config.yml
|
||||
```
|
||||
|
||||
|
||||
#### 3. Run performance benchmark
|
||||
TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve
|
||||
Prepare a new terminal and run `benchmark_serving`
|
||||
```bash
|
||||
python -m tensorrt_llm.serve.scripts.benchmark_serving \
|
||||
--model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
|
||||
--dataset-name random \
|
||||
--ignore-eos \
|
||||
--num-prompts 1000 \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 2048 \
|
||||
--random-ids \
|
||||
--max-concurrency 64 \
|
||||
```
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Configuration tuning
|
||||
|
||||
Loading…
Reference in New Issue
Block a user