mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

[None][doc] Add llama4 hybrid guide (#6640 )

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

2025-08-06 01:25:38 -04:00

7.1 KiB

Raw Blame History

Llama4-Maverick

This document shows how to run Llama4-Maverick on B200 with PyTorch workflow and how to run performance benchmarks

Performance Benchmarks
Advanced Configuration
- Configuration tuning
- Troubleshooting
  - Out of memory issues

Performance Benchmarks

This section provides the steps to launch TensorRT-LLM server and run performance benchmarks for different scenarios.

B200 Max-throughput

1. Prepare TensorRT-LLM extra configs

cat >./extra-llm-api-config.yml <<EOF
enable_attention_dp: true
stream_interval: 2
cuda_graph_config:
  max_batch_size: 512
  enable_padding: true
EOF

Explanation:

enable_attention_dp: Enable attention Data Parallel which is recommend to enable in high concurrency.
stream_interval: The iteration interval to create responses under the streaming mode.
cuda_graph_config: CUDA Graph config.
- max_batch_size: Max CUDA graph batch size to capture.
- enable_padding: Whether to enable CUDA graph padding.

2. Launch trtllm-serve OpenAI-compatible API server

TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint

trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
    --max_batch_size 512 \
    --tp_size 8 \
    --ep_size 8 \
    --num_postprocess_workers 2 \
    --trust_remote_code \
    --extra_llm_api_options ./extra-llm-api-config.yml

3. Run performance benchmarks

TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve Prepare a new terminal and run benchmark_serving

python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --dataset-name random \
        --ignore-eos \
        --num-prompts 8192 \
        --random-input-len 1024 \
        --random-output-len 2048 \
        --random-ids \
        --max-concurrency 1024 \

B200 Min-latency

1. Prepare TensorRT-LLM extra configs

cat >./extra-llm-api-config.yml <<EOF
enable_attention_dp: false
enable_min_latency: true
stream_interval: 2
cuda_graph_config:
  max_batch_size: 8
  enable_padding: true
EOF

Explanation:

enable_attention_dp: Enable attention Data Parallel which is recommend to disable in low concurrency.
enable_min_latency Enable optimizations for low latency scenarios, where concurrency is very small (like 1 or 2).
stream_interval: The iteration interval to create responses under the streaming mode.
cuda_graph_config: CUDA Graph config.
- max_batch_size: Max CUDA graph batch size to capture.
- enable_padding: Whether to enable CUDA graph padding.

2. Launch trtllm-serve OpenAI-compatible API server

TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.

trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
    --max_batch_size 8 \
    --tp_size 8 \
    --ep_size 1 \
    --trust_remote_code \
    --extra_llm_api_options ./extra-llm-api-config.yml

3. Run performance benchmark

TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve Prepare a new terminal and run benchmark_serving

python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --dataset-name random \
        --ignore-eos \
        --num-prompts 1000 \
        --random-input-len 1024 \
        --random-output-len 2048 \
        --random-ids \
        --max-concurrency 1 \

B200 Balanced

1. Prepare TensorRT-LLM extra configs

cat >./extra-llm-api-config.yml <<EOF
stream_interval: 2
cuda_graph_config:
  max_batch_size: 1024
  enable_padding: true
EOF

Explanation:

stream_interval: The iteration interval to create responses under the streaming mode.
cuda_graph_config: CUDA Graph config.
- max_batch_size: Max CUDA graph batch size to capture.
- enable_padding: Whether to enable CUDA graph padding.

2. Launch trtllm-serve OpenAI-compatible API server

TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.

trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
    --tp_size 8 \
    --ep_size 2 \
    --num_postprocess_workers 2 \
    --trust_remote_code \
    --extra_llm_api_options ./extra-llm-api-config.yml

3. Run performance benchmark

TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve Prepare a new terminal and run benchmark_serving

python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
        --dataset-name random \
        --ignore-eos \
        --num-prompts 1000 \
        --random-input-len 1024 \
        --random-output-len 2048 \
        --random-ids \
        --max-concurrency 64 \

Advanced Configuration

Configuration tuning

Attention DP only provides throughput gains in high concurrency scenarios. Consider disabling it for low concurrency and enabling it for high concurrency. The concurrency threshold needs to be tuned based on your specific ISL/OSL configuration.
Expert Parallel (EP) usually benefits in high concurrency scenarios. The ep_size needs to be tuned based on your specific ISL/OSL and concurrency configuration.
enable_min_latency enables optimizations for low latency scenarios, where concurrency is very small (like 1 or 2). The concurrency threshold needs to be tuned based on your specific ISL/OSL configuration.
stream_interval and num_postprocess_workers are both used to reduce streaming mode overhead. stream_interval controls the iteration interval to create responses under streaming mode, which benefits performance across all concurrency levels. num_postprocess_workers controls the number of processes used for postprocessing generated tokens, which provides benefits in high concurrency scenarios. These values need to be tuned based on your specific ISL/OSL and concurrency configuration.
max_batch_size and max_num_tokens can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
max_batch_size should not be too low to bottleneck the throughput. Note with Attention DP, the the whole system's max_batch_size will be max_batch_size*dp_size.
CUDA grah max_batch_size should be same value as TensorRT-LLM server's max_batch_size.
For more details on max_batch_size and max_num_tokens, refer to Tuning Max Batch Size and Max Num Tokens.

Troubleshooting

Out of memory issues

It's possible to see OOM issues in some cases. Considering reducing kv_cache_free_gpu_mem_fraction to a smaller value as a workaround. We're working on investigating and addressing the problem.

7.1 KiB Raw Blame History

Llama4-Maverick

Table of Contents

Performance Benchmarks

B200 Max-throughput

1. Prepare TensorRT-LLM extra configs

2. Launch trtllm-serve OpenAI-compatible API server

3. Run performance benchmarks

B200 Min-latency

1. Prepare TensorRT-LLM extra configs

2. Launch trtllm-serve OpenAI-compatible API server

3. Run performance benchmark

B200 Balanced

1. Prepare TensorRT-LLM extra configs

2. Launch trtllm-serve OpenAI-compatible API server

3. Run performance benchmark

Advanced Configuration

Configuration tuning

Troubleshooting

Out of memory issues

7.1 KiB

Raw Blame History