diff --git a/latest/.buildinfo b/latest/.buildinfo index 7eaa80657f..4d83aad3b8 100644 --- a/latest/.buildinfo +++ b/latest/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: cb3cbe8a473ef8fd1cf27e6890eb63f4 +config: ee79abf721be5d1b28815a3912832a13 tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/latest/_cpp_gen/executor.html b/latest/_cpp_gen/executor.html index 5dbf38d7a5..07cfa3ce9e 100644 --- a/latest/_cpp_gen/executor.html +++ b/latest/_cpp_gen/executor.html @@ -58,7 +58,7 @@ @@ -70,7 +70,7 @@ - + @@ -348,6 +348,7 @@
  • Generate text with guided decoding
  • Control generated text using logits processor
  • Generate text with multiple LoRA adapters
  • +
  • Speculative Decoding
  • Run LLM-API with pytorch backend on Slurm
  • Run trtllm-bench with pytorch backend on Slurm
  • Run trtllm-serve with pytorch backend on Slurm
  • @@ -408,7 +409,7 @@
  • KV Cache Management: Pools, Blocks, and Events
  • KV cache reuse
  • Speculative Sampling
  • -
  • Disaggregated-Service (experimental)
  • +
  • Disaggregated-Service (Experimental)
  • Performance

    Performance

    Performance

    @@ -741,6 +747,19 @@ $\frac{\text{Total Output Tokens/sec}}{\left(\frac{\text{NumCtxGPUs} \times \tex

    For Pareto curves with MTP = 1, 2, 3, it can be observed that disaggregated results show a 1.7x improvement over aggregated results at 50 tokens/sec/user (20 ms latency). Enabling MTP provides a larger speedup at higher concurrencies.

    +
    +

    Qwen 3#

    +
    +

    ISL 8192 - OSL 1024 (Machine Translation Dataset)#

    +
    +
    + Qwen 3 Pareto curves +
    +
    +

    Figure 15. Qwen 3 Pareto curves.

    +

    We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indicate that the speedups achieved by disaggregation over aggregation range from 1.7x to 6.11x.

    +
    +

    Reproducing Steps#

    We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in this document.

    @@ -814,6 +833,10 @@ $\frac{\text{Total Output Tokens/sec}}{\left(\frac{\text{NumCtxGPUs} \times \tex
  • ISL 4096 - OSL 1024 (Machine Translation Dataset)
  • +
  • Qwen 3 +
  • Reproducing Steps
  • @@ -913,9 +936,9 @@ $\frac{\text{Total Output Tokens/sec}}{\left(\frac{\text{NumCtxGPUs} \times \tex diff --git a/latest/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.html b/latest/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.html new file mode 100644 index 0000000000..65ca216c7d --- /dev/null +++ b/latest/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.html @@ -0,0 +1,797 @@ + + + + + + + + + + + + How to launch Llama4 Maverick + Eagle3 TensorRT-LLM server — TensorRT-LLM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + + + +
    + +
    + + + + + +
    +
    + + + + +
    + + + + + + + + + + + + + + + + + + + + + +
    + +
    + + +
    +
    + +
    +
    + +
    + +
    + + +
    + +
    + + +
    +
    + + + + + +
    + +
    +

    How to launch Llama4 Maverick + Eagle3 TensorRT-LLM server#

    +

    Artificial Analysis has benchmarked the Llama4 Maverick with Eagle3 enabled TensorRT-LLM server running at over 1000 tokens per second per user on 8xB200 GPUs. This implementation leverages NVIDIA’s TensorRT-LLM combined with speculative decoding using the Eagle3 model to further boost performance.

    +

    In the guide below, we will walk you through how to launch your own high-performance Llama4 Maverick with Eagle3 enabled TensorRT-LLM server, from build to deployment. (Note that your specific performance numbers may vary—speculative decoding speedups depend upon the dataset!)

    +
    +

    Prerequisites#

    +
      +
    • 8x NVIDIA B200 GPUs in a single node (we have a forthcoming guide for getting great performance on H100)

    • +
    • CUDA Toolkit 12.8 or later

    • +
    • Docker with NVIDIA Container Toolkit installed

    • +
    • Fast SSD storage for model weights

    • +
    • Access to Llama4 Maverick and Eagle3 model checkpoints

    • +
    • A love of speed

    • +
    +
    +
    +

    Download Artifacts#

    + +

    In Step 4: Start the TensorRT-LLM server, /path/to/maverick and /path/to/eagle refer to the download paths of the above respective models.

    +
    +
    +

    Launching the server#

    +
    +

    Step 1: Clone the repository#

    +
    git clone https://github.com/NVIDIA/TensorRT-LLM.git
    +cd TensorRT-LLM
    +git submodule update --init --recursive
    +git lfs pull
    +
    +
    +

    The last command, git lfs pull, ensures all large files stored with Git LFS are properly downloaded. If git lfs is not installed, please install following Install Git LFS

    +
    +
    +

    Step 2: Prepare the TensorRT-LLM release Docker image#

    +
    +

    Option 1. Use weekly release NGC docker image#

    +

    TensorRT-LLM provides weekly release docker image

    +
    +
    +

    Option 2. Build TensorRT-LLM Docker image (Alternative way)#

    +

    If you want to compile a specific TensorRT-LLM commit, you can build the docker image by checking out the specific branch or commit and running a make command. This may take 15-30 minutes depending on your system.

    +
    make -C docker release_build
    +
    +
    +
    +
    +
    +

    Step 3: (Optional) Tag and push the Docker image to your registry#

    +

    If you want to use this image on multiple machines or in a cluster:

    +
    docker tag tensorrt_llm/release:latest docker.io/<username>/tensorrt_llm:main
    +docker push docker.io/<username>/tensorrt_llm:main
    +
    +
    +

    Replace <username> with your Docker Hub username or your private registry path.

    +
    +
    +

    Step 4: Start the TensorRT-LLM server#

    +

    This command launches the server with Llama4 Maverick as the main model and Eagle3 as the draft model for speculative decoding. Make sure you have downloaded both model checkpoints before running this command.

    +

    Important: Replace /path/to/maverick and /path/to/eagle with the actual paths to your Maverick and Eagle3 model checkpoints on your host machine, downloaded in the Download Artifacts stage

    +
    docker run -d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    +    -p 8000:8000 --gpus=all -e "TRTLLM_ENABLE_PDL=1" \
    +    -v /path/to/maverick:/config/models/maverick -v /path/to/eagle:/config/models/eagle \
    +    docker.io/<username>/tensorrt_llm:main sh \
    +        -c "echo -e 'enable_attention_dp: false\nenable_min_latency: true\nenable_autotuner: false\ncuda_graph_config:\n  max_batch_size: 8\nspeculative_config:\n  decoding_type: Eagle\n  max_draft_len: 3\n  speculative_model_dir: /config/models/eagle\nkv_cache_config:\n  enable_block_reuse: false' > c.yaml && \
    +        TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
    +        trtllm-serve /config/models/maverick \
    +            --host 0.0.0.0 --port 8000 \
    +            --backend pytorch --tp_size 8 --ep_size 1 \
    +            --trust_remote_code --extra_llm_api_options c.yaml \
    +            --kv_cache_free_gpu_memory_fraction 0.75"
    +
    +
    +

    This command:

    +
      +
    • Runs the container in detached mode (-d)

    • +
    • Sets up shared memory and stack limits for optimal performance

    • +
    • Maps port 8000 from the container to your host

    • +
    • Enables all GPUs with tensor parallelism across all 8 GPUs

    • +
    • Creates a configuration file for speculative decoding with Eagle3

    • +
    • Configures memory settings for optimal throughput

    • +
    +

    After running this command, the server will initialize, which may take several minutes as it loads and optimizes the models.

    +

    You can query the health/readiness of the server using

    +
    curl -s -o /dev/null -w "%{http_code}" "http://localhost:8000/health"
    +
    +
    +

    When the 200 code is returned the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

    +
    +
    +

    Step 5: Test the server with a sample request#

    +

    Once the server is running, you can test it with a simple curl request:

    +
    curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    +        "model": "Llama4-eagle",
    +        "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
    +        "max_tokens": 1024
    +    }' -w "\n"
    +
    +# {"id":"chatcmpl-e752184d1181494c940579c007ab2c5f","object":"chat.completion","created":1748018634,"model":"Llama4-eagle","choices":[{"index":0,"message":{"role":"assistant","content":"NVIDIA is considered a great company for several reasons:\n\n1. **Innovative Technology**: NVIDIA is a leader in the development of graphics processing units (GPUs) and high-performance computing hardware. Their GPUs are used in a wide range of applications, from gaming and professional visualization to artificial intelligence (AI), deep learning, and autonomous vehicles.\n2. ...","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":552,"completion_tokens":535}}
    +
    +
    +

    The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like max_tokens, temperature, and others according to your needs.

    +
    +
    +

    Step 6: (Optional) Monitor server logs#

    +

    To view the logs of the running container:

    +
    docker ps # get the container id
    +docker logs -f <container_id>
    +
    +
    +

    This is useful for troubleshooting or monitoring performance statistics reported by the server.

    +
    +
    +

    Step 7: (Optional) Stop the server#

    +

    When you’re done with the server:

    +
    docker ps # get the container id
    +docker kill <container_id>
    +
    +
    +
    +
    +
    +

    Troubleshooting Tips#

    +
      +
    • If you encounter CUDA out-of-memory errors, try reducing max_batch_size or max_seq_len

    • +
    • Ensure your model checkpoints are compatible with the expected format

    • +
    • For performance issues, check GPU utilization with nvidia-smi while the server is running

    • +
    • If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed

    • +
    • For connection issues, make sure port 8000 is not being used by another application

    • +
    +
    +
    +

    Performance Tuning#

    +

    The configuration provided is optimized for 8xB200 GPUs, but you can adjust several parameters for your specific workload:

    +
      +
    • max_batch_size: Controls how many requests can be batched together

    • +
    • max_draft_len: The number of tokens Eagle can speculate ahead

    • +
    • kv_cache_free_gpu_memory_fraction: Controls memory allocation for the KV cache

    • +
    +
    +
    + + +
    + + + + + +
    + +
    +
    +
    + +
    + + + + + + + + + + +
    +
    + +
    + +
    +
    +
    + + + + + + + + \ No newline at end of file diff --git a/latest/commands/trtllm-build.html b/latest/commands/trtllm-build.html index 685c9dfd58..ab6974cdc8 100644 --- a/latest/commands/trtllm-build.html +++ b/latest/commands/trtllm-build.html @@ -58,7 +58,7 @@ @@ -70,7 +70,7 @@ - + @@ -348,6 +348,7 @@
  • Generate text with guided decoding
  • Control generated text using logits processor
  • Generate text with multiple LoRA adapters
  • +
  • Speculative Decoding
  • Run LLM-API with pytorch backend on Slurm
  • Run trtllm-bench with pytorch backend on Slurm
  • Run trtllm-serve with pytorch backend on Slurm
  • @@ -408,7 +409,7 @@
  • KV Cache Management: Pools, Blocks, and Events
  • KV cache reuse
  • Speculative Sampling
  • -
  • Disaggregated-Service (experimental)
  • +
  • Disaggregated-Service (Experimental)
  • Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    @@ -675,9 +677,9 @@ diff --git a/latest/examples/llm_guided_decoding.html b/latest/examples/llm_guided_decoding.html index 0596d36160..0128ae1575 100644 --- a/latest/examples/llm_guided_decoding.html +++ b/latest/examples/llm_guided_decoding.html @@ -58,7 +58,7 @@ @@ -70,7 +70,7 @@ - + @@ -344,6 +344,7 @@
  • Generate text with guided decoding
  • Control generated text using logits processor
  • Generate text with multiple LoRA adapters
  • +
  • Speculative Decoding
  • Run LLM-API with pytorch backend on Slurm
  • Run trtllm-bench with pytorch backend on Slurm
  • Run trtllm-serve with pytorch backend on Slurm
  • @@ -404,7 +405,7 @@
  • KV Cache Management: Pools, Blocks, and Events
  • KV cache reuse
  • Speculative Sampling
  • -
  • Disaggregated-Service (experimental)
  • +
  • Disaggregated-Service (Experimental)
  • Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

    Performance

  • auto_parallel_world_size (tensorrt_llm.llmapi.TrtLlmArgs attribute) -
  • -
  • autotuner_enabled (tensorrt_llm.llmapi.TorchLlmArgs attribute)
  • avg_pool2d() (in module tensorrt_llm.functional)
  • @@ -878,17 +877,19 @@
  • beam_width_array (tensorrt_llm.llmapi.SamplingParams attribute) +
  • +
  • BEGIN_THINKING_PHASE_TOKEN (tensorrt_llm.llmapi.MTPDecodingConfig attribute)
  • bert_attention() (in module tensorrt_llm.functional)
  • BertAttention (class in tensorrt_llm.layers.attention)
  • BertForQuestionAnswering (class in tensorrt_llm.models) -
  • -
  • BertForSequenceClassification (class in tensorrt_llm.models)
  • DeepseekForCausalLM (class in tensorrt_llm.models) @@ -1280,6 +1283,8 @@
  • draft_tokens (tensorrt_llm.llmapi.DisaggregatedParams attribute)
  • DRAFT_TOKENS_EXTERNAL (tensorrt_llm.models.SpeculativeDecodingMode attribute) +
  • +
  • drafter (tensorrt_llm.llmapi.UserProvidedDecodingConfig attribute)
  • DraftTargetDecodingConfig (class in tensorrt_llm.llmapi)
  • @@ -1348,6 +1353,8 @@
  • embedding_bias (tensorrt_llm.llmapi.SamplingParams attribute)
  • embedding_parallel_mode (tensorrt_llm.llmapi.TrtLlmArgs attribute) +
  • +
  • enable_autotuner (tensorrt_llm.llmapi.TorchLlmArgs attribute)
  • enable_batch_size_tuning (tensorrt_llm.llmapi.DynamicBatchConfig attribute)
  • @@ -1377,6 +1384,8 @@ +
  • END_THINKING_PHASE_TOKEN (tensorrt_llm.llmapi.MTPDecodingConfig attribute) +
  • engine (tensorrt_llm.runtime.Session property)
  • engine_inspector (tensorrt_llm.runtime.GenerationSession property) @@ -1681,6 +1692,8 @@
  • (tensorrt_llm.llmapi.NGramDecodingConfig class method)
  • (tensorrt_llm.llmapi.QuantConfig class method) +
  • +
  • (tensorrt_llm.llmapi.UserProvidedDecodingConfig class method)
  • (tensorrt_llm.models.PretrainedConfig class method)
  • @@ -1910,6 +1923,8 @@
  • get_config_group() (tensorrt_llm.models.PretrainedConfig method)
  • get_context_phase_params() (tensorrt_llm.llmapi.DisaggregatedParams method) +
  • +
  • get_draft_model_prompt() (tensorrt_llm.llmapi.EagleDecodingConfig method)
  • get_first_past_key_value() (tensorrt_llm.layers.attention.KeyValueCacheParams method)
  • @@ -2516,8 +2531,6 @@
  • Mish (class in tensorrt_llm.layers.activation)
  • MIXED_PRECISION (tensorrt_llm.llmapi.QuantAlgo attribute) -
  • -
  • mixed_sampler (tensorrt_llm.llmapi.TorchLlmArgs attribute)
  • MLLaMAForCausalLM (class in tensorrt_llm.models)
  • @@ -2570,6 +2583,8 @@
  • (tensorrt_llm.llmapi.TorchLlmArgs attribute)
  • (tensorrt_llm.llmapi.TrtLlmArgs attribute) +
  • +
  • (tensorrt_llm.llmapi.UserProvidedDecodingConfig attribute)
  • model_name (tensorrt_llm.runtime.ModelConfig attribute) @@ -2723,10 +2738,10 @@
  • num_beams (tensorrt_llm.runtime.SamplingConfig attribute)
  • - - +
  • num_nextn_predict_layers (tensorrt_llm.llmapi.MTPDecodingConfig attribute) +
  • +
  • num_nextn_predict_layers_from_model_config (tensorrt_llm.llmapi.MTPDecodingConfig attribute)
  • num_return_sequences (tensorrt_llm.runtime.SamplingConfig attribute)
  • @@ -2963,8 +2980,6 @@
  • (tensorrt_llm.llmapi.SamplingParams attribute)
  • -
  • prompt_lookup_num_tokens (tensorrt_llm.llmapi.NGramDecodingConfig attribute) -
  • prompt_token_ids (tensorrt_llm.llmapi.RequestOutput attribute)
  • PromptTuningEmbedding (class in tensorrt_llm.layers.embedding) @@ -2981,12 +2996,6 @@
  • python_e2e (tensorrt_llm.runtime.MultimodalModelRunner property)
  • -
  • pytorch_weights_path (tensorrt_llm.llmapi.DraftTargetDecodingConfig attribute) - -
  • @@ -3317,8 +3326,6 @@
  • skip_cross_attn_blocks (tensorrt_llm.runtime.ModelConfig attribute)
  • - - + @@ -8968,6 +8995,8 @@
  • update_from_dict() (tensorrt_llm.llmapi.BuildConfig method) +
  • +
  • update_from_model_config() (tensorrt_llm.llmapi.MTPDecodingConfig method)
  • update_kv_cache_type() (tensorrt_llm.llmapi.BuildConfig method)
  • @@ -8985,10 +9014,10 @@
  • use_gpt_attention_plugin (tensorrt_llm.runtime.GenerationSession property)
  • - - +
    • use_lora() (tensorrt_llm.models.DecoderModel method) @@ -9037,6 +9070,8 @@

      V

      + -

      Performance

      -
    • Disaggregated-Service (experimental)

      Performance

      Performance

      Performance

      Performance

      Performance

      +
    • +
    • UserProvidedDecodingConfig
    • TorchCompileConfig
    • LlmArgs
    • TorchLlmArgs
    • disable_overlap_scheduler
    • +
    • enable_autotuner
    • enable_iter_perf_stats
    • enable_iter_req_stats
    • enable_layerwise_nvtx_marker
    • enable_min_latency
    • +
    • enable_mixed_sampler
    • enable_trtllm_sampler
    • extra_resource_managers
    • force_dynamic_quantization
    • @@ -5227,7 +5357,6 @@ Whether to use a common pool for all requests, or the pool is private for each r
    • field_name
    • -
    • mixed_sampler
    • model_config
    • model_post_init()
    • moe_backend
    • @@ -5391,9 +5520,9 @@ Whether to use a common pool for all requests, or the pool is private for each r diff --git a/latest/objects.inv b/latest/objects.inv index 12b96253cd..1468c2f94f 100644 Binary files a/latest/objects.inv and b/latest/objects.inv differ diff --git a/latest/overview.html b/latest/overview.html index 2caba71352..90fa00856d 100644 --- a/latest/overview.html +++ b/latest/overview.html @@ -58,7 +58,7 @@ @@ -70,7 +70,7 @@ - + @@ -348,6 +348,7 @@
    • Generate text with guided decoding
    • Control generated text using logits processor
    • Generate text with multiple LoRA adapters
    • +
    • Speculative Decoding
    • Run LLM-API with pytorch backend on Slurm
    • Run trtllm-bench with pytorch backend on Slurm
    • Run trtllm-serve with pytorch backend on Slurm
    • @@ -408,7 +409,7 @@
    • KV Cache Management: Pools, Blocks, and Events
    • KV cache reuse
    • Speculative Sampling
    • -
    • Disaggregated-Service (experimental)
    • +
    • Disaggregated-Service (Experimental)
    • Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      @@ -3629,9 +3636,9 @@ ranges of the dimensions of when using TRT dynamic shapes.

      diff --git a/latest/python-api/tensorrt_llm.plugin.html b/latest/python-api/tensorrt_llm.plugin.html index ba906fb160..22e4aff957 100644 --- a/latest/python-api/tensorrt_llm.plugin.html +++ b/latest/python-api/tensorrt_llm.plugin.html @@ -58,7 +58,7 @@ @@ -70,7 +70,7 @@ - + @@ -348,6 +348,7 @@
    • Generate text with guided decoding
    • Control generated text using logits processor
    • Generate text with multiple LoRA adapters
    • +
    • Speculative Decoding
    • Run LLM-API with pytorch backend on Slurm
    • Run trtllm-bench with pytorch backend on Slurm
    • Run trtllm-serve with pytorch backend on Slurm
    • @@ -408,7 +409,7 @@
    • KV Cache Management: Pools, Blocks, and Events
    • KV cache reuse
    • Speculative Sampling
    • -
    • Disaggregated-Service (experimental)
    • +
    • Disaggregated-Service (Experimental)
    • Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance

      Performance