mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

chore: bump version to 0.19.0 (#3598 ) (#3841 )

test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

2025-04-29 16:57:22 +08:00

20 KiB

Raw Blame History

Draft-Target-Model Speculative Decoding (DTM)

This document shows how to build and run a model using DTM speculative decoding (also known as Speculative-Sampling, Paper) in TensorRT-LLM on single GPU, or single node multiple GPU.

Overview

We provide two styles of running DTM now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Here we introduce the detailed steps of running DTM in both workflows.

Support Matrix

GPU Compute Capability >= 8.0 (Ampere or newer)
FP16 / BF16 / FP8 (both draft and target model)
Paged KV Cache
Tensor Parallel

Usage

Build draft and target engines (the same in two workflows)

We use open-source llama-7B/13B as draft and target models in this example, assuming the paths to the models' repository are DRAFT_MODEL_PATH and TARGET_MODEL_PATH.
--use_paged_context_fmha=enable must be specified since we need KV-Cache reuse in this approach.
--speculative_decoding_mode=draft_tokens_external and --max_draft_len must be specified for target model.
--gather_generation_logits is necessary if using generation logits for selecting tokens in target model.
--tp_size can be modified set if using TP mode for draft / target model.

cd examples/models/core/llama
export DRAFT_CKPT_PATH=/workspace/ckpt-draft
export TARGET_CKPT_PATH=/workspace/ckpt-target
export DRAFT_ENGINE_PATH=/workspace/engine-draft
export TARGET_ENGINE_PATH=/workspace/engine-target
export MAX_BATCH_SIZE=4
export MAX_DRAFT_LEN=10
export MAX_INPUT_LEN=3200
export MAX_SEQ_LEN=4800

python3 convert_checkpoint.py \
    --model_dir=${DRAFT_MODEL_PATH} \
    --output_dir=${DRAFT_CKPT_PATH} \
    --dtype=float16

python3 convert_checkpoint.py \
    --model_dir=${TARGET_MODEL_PATH} \
    --output_dir=${TARGET_CKPT_PATH} \
    --dtype=float16

trtllm-build \
    --checkpoint_dir=${DRAFT_CKPT_PATH} \
    --output_dir=${DRAFT_ENGINE_PATH} \
    --gemm_plugin=float16 \
    --use_paged_context_fmha=enable \
    --max_batch_size=${MAX_BATCH_SIZE} \
    --max_input_len=${MAX_INPUT_LEN} \
    --max_seq_len=${MAX_SEQ_LEN}

trtllm-build \
    --checkpoint_dir=${TARGET_CKPT_PATH} \
    --output_dir=${TARGET_ENGINE_PATH} \
    --gemm_plugin=float16 \
    --use_paged_context_fmha=enable \
    --speculative_decoding_mode=draft_tokens_external \
    --max_batch_size=${MAX_BATCH_SIZE} \
    --max_draft_len=${MAX_DRAFT_LEN} \
    --max_input_len=${MAX_INPUT_LEN} \
    --max_seq_len=${MAX_SEQ_LEN}

TensorRT-LLM workflow

--draft_engine_dir and --engine_dir must be specified for the draft and target engines respectively.
--draft_target_model_config is corresponding configuration of DTM, which has 4 hyperparameters that you need to specify to control the process of generation:
- draft_len: the number of tokens the draft model generated in one iteration, which the range is from 4 to 10 in common usage. Empirically, the larger the value is, the higher acceptance ratio but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
- draft_model_device_list: the index list of device(s) to run the draft model. The length of it must be the same as the TP size of the draft model engine. For instances, draft_model_device_list=[1] means using tp_size=1 and GPU 1 for draft model, draft_model_device_list=[4,5,6,7] means using tp=4 and GPU from 4 to 7 for draft model.
- target_model_device_list: the index list of device(s) to run the target model. The length of it must be the same as the TP size of the target model engine. For instances, draft_model_device_list=[0] means using tp_size=1 and GPU 0 for target model, draft_model_device_list=[2,3] means using tp=2 and GPU from 2 to 3 for target model.
- use_logits: there are two methods to accept tokens proposed by draft model. When use_logits=True, the draft tokens are accepted based on the ratio of the logits from draft and target model (modified rejection sampling method in the original paper); When use_logits=False, the draft tokens are accepted based on per-token comparison with target predictions regardless of the logits.
- As an example, [4,[0],[1],False] means draft_len=4, device of draft model is GPU0, device of target model is GPU1, and use tokens rather than logits to accept.
--kv_cache_enable_block_reuse must be specified for this approach.
Only CPP session is supported, so --use_py_session must not be specified.
--kv_cache_free_gpu_memory_fraction should be specified if we want to place two models on one GPU, or one of the models would use out of the GPU memory.
--num_beams can not be specified as larger than 1 since beam search is not supported in this approach yet.
--output_generation_logits is optional. In original paper, we accept the tokens by comparing logits of draft and target models, so this parameter is needed. But for simplification, we can accept the tokens by comparing the output token directly, in this occasion, we can skip this parameter.

python3 examples/run.py \
    --tokenizer_dir=${TARGET_MODEL_PATH} \
    --draft_engine_dir=/workspace/engine-draft \
    --engine_dir=/workspace/engine-target \
    --draft_target_model_config="[4,[0],[1],False]" \
    --max_output_len=256 \
    --kv_cache_enable_block_reuse \
    --kv_cache_free_gpu_memory_fraction=0.5 \
    --output_generation_logits \
    --input_text="How does Draft-Sampling work?"

Triton Inference Server workflow

This example is based on TensorRT-LLM-0.18.0 and TRTLLM-backend-0.18.0 with docker image nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3.
DTM model approach is supported since TensorRT-LLM-0.7.0 (using two separate Tritonserver to maintain draft and target model respectively), but has significant optimization in TensorRT-LLM-0.10.0 (using one Tritonserver with Business Logic Scripting, BLS).

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git checkout rel
git lfs pull
git submodule update --init --recursive
pip install -r requirements.txt
pip install SentencePiece tritonclient

export DRAFT_MODEL_NAME="tensorrt_llm_draft"
export TARGET_MODEL_NAME="tensorrt_llm"
export TRITON_MODEL_REPO=llama_dtm

Simple deploy

Edit model configuration.

export DRAFT_DEVICE_IDS="0"
export TARGET_DEVICE_IDS="1"

rm -rf ${TRITON_MODEL_REPO}
cp -r all_models/inflight_batcher_llm/ ${TRITON_MODEL_REPO}
cp -r ${TRITON_MODEL_REPO}/tensorrt_llm ${TRITON_MODEL_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt            triton_max_batch_size:4,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt       triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},preprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt      triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},postprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt    triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,bls_instance_count:1,accumulate_tokens:False,tensorrt_llm_model_name:${TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:${DRAFT_MODEL_NAME}

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt        triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,triton_backend:tensorrtllm,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,engine_dir:${TARGET_ENGINE_PATH},gpu_device_ids:${TARGET_DEVICE_IDS}
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt  triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,triton_backend:tensorrtllm,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,engine_dir:${DRAFT_ENGINE_PATH},gpu_device_ids:${DRAFT_DEVICE_IDS}

Start the triton inference server.
- Verbose log will be written in to file triton_log.txt if specifying --log.

python3 scripts/launch_triton_server.py \
    --model_repo=${TRITON_MODEL_REPO} \
    --multi-model \
    --log

You can see the output below in the file if Triton server launches successfully:

Started HTTPService at 0.0.0.0:8000
Started GRPCInferenceService at 0.0.0.0:8001
Started Metrics Service at 0.0.0.0:8002

Send a request for inference.

python3 inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py \
    --url-target=localhost:8001 \
    --draft-tensorrt-llm-model-name=${DRAFT_MODEL_NAME} \
    --target-tensorrt-llm-model-name=${TARGET_MODEL_NAME} \
    --output-len=100 \
    --num-draft-tokens=4 \
    --end-id=2 \
    --pad-id=2 \
    --prompt "What is Ubuntu operation system?"

You can receive the following results if everything goes smoothly.

Final text:
 What is Ubuntu operation system?
Ubuntu is a free and open source operating system that runs from the desktop, to the cloud, to all your internet connected things. Ubuntu is used by millions of people around the world who want to explore new ideas and discover new opportunities.
Ubuntu is a community developed operating system that is perfect for laptops, desktops, servers, and cloud. It is used by millions of people around the world who want to explore new ideas and discover new opportunities.

Test DTM with a script.
- Prepare a JSON file input_data.json containing input data as below (more requests are acceptable).

[
  {
      "input": "What is Ubuntu operation system?",
      "instruction": "Answer the question shortly.",
      "output": "                                                                "
  }
]

Use command below to launch test.

### Use BLS speculative decoding
python3 tools/inflight_batcher_llm/speculative_decoding_test.py \
    --max-input-len 2500 \
    --dataset input_data.json \
    --url-control=localhost:8001 \
    --url-target=localhost:8001 \
    --url-draft=localhost:8001 \
    --draft-tensorrt-llm-model-name="${DRAFT_MODEL_NAME}" \
    --target-tensorrt-llm-model-name="${TARGET_MODEL_NAME}" \
    --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
    --execute-bls-speculative-decoding \
    --disable-output-comparison \
    --num-draft-tokens=4 \
    --use-draft-logits

### Use client-side speculative decoding
python3 tools/inflight_batcher_llm/speculative_decoding_test.py \
    --max-input-len 2500 \
    --dataset input_data.json \
    --url-control=localhost:8001 \
    --url-target=localhost:8001 \
    --url-draft=localhost:8001 \
    --draft-tensorrt-llm-model-name="${DRAFT_MODEL_NAME}" \
    --target-tensorrt-llm-model-name="${TARGET_MODEL_NAME}" \
    --bls-speculative-tensorrt-llm-model-name="tensorrt_llm_bls" \
    --disable-output-comparison \
    --num-draft-tokens=4 \
    --use-draft-logits

You can receive the following results if everything goes smoothly.

Ubuntu is a free and open source operating system. It is a Linux based operating system. ...

Stop triton inference server after all work is done.

pkill tritonserver

In addition, it appears better performance can be achieved with both draft and target engines deployed on a single GPU (llama-7B-FP8 + llama-30B-FP8, for a total of 40GiB on one H100-80GiB GPU for example).

Usage of Tensor-Parallelization mode.

In this example, we use draft engine with TP=1 and target engine with TP=2 (both symmetrical or asymmetrical TP size are acceptable), and want to place the draft engine on GPU0, target engine on GPU1 and GPU2.
Edit model configuration.

export DRAFT_DEVICE_IDS="0"
export TARGET_DEVICE_IDS="1,2"

rm -rf ${TRITON_MODEL_REPO}
cp -r all_models/inflight_batcher_llm/ ${TRITON_MODEL_REPO}
cp -r ${TRITON_MODEL_REPO}/tensorrt_llm ${TRITON_MODEL_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt            triton_max_batch_size:4,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt       triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},preprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt      triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},postprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt    triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,bls_instance_count:1,accumulate_tokens:False,tensorrt_llm_model_name:${TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:${DRAFT_MODEL_NAME}

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt        triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,triton_backend:tensorrtllm,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,engine_dir:${TARGET_ENGINE_PATH}
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt  triton_max_batch_size:4,decoupled_mode:False,logits_datatype:TYPE_FP32,triton_backend:tensorrtllm,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,engine_dir:${DRAFT_ENGINE_PATH}

sed -i 's/\${gpu_device_ids}/'"${DRAFT_DEVICE_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt
sed -i 's/\${gpu_device_ids}/'"${TARGET_DEVICE_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt

As you see, the only difference is gpu_device_ids, which needs fix manually since comma is not supported in script python3 tools/fill_template.py.
Start the triton inference server
- Use --multi-model to enable orchestrator mode in TP>1 scenario. See model config for more information.

python3 scripts/launch_triton_server.py \
    --model_repo=${TRITON_MODEL_REPO} \
    --tensorrt_llm_model_name "tensorrt_llm,tensorrt_llm_draft" \
    --multi-model

All other operations are the same as Simple deploy part.

Usage of Fast logits D2D transfer

Fast logits boosts the performance (TPS) by hiding the latency of logits transfer from draft engine to target engine supported since TensorRT-LLM-0.15.0.
In this example, we use draft engine with TP=1 and target engine with TP=2 (both symmetrical or asymmetrical TP size are acceptable), and want to place the draft engine on GPU0, target engine on GPU1 and GPU2.
For participant_ids, rank 0 is reserved for the orchestrator; rank (1 ~ tp_size_draft) are for draft engine; rank (tp_size_draft+1 ~ tp_size_draft+tp_size_target) are for target engine.
Edit model configuration.

export DRAFT_DEVICE_IDS="0"
export TARGET_DEVICE_IDS="1,2"
export DRAFT_PARTICIPANT_IDS="1"
export TARGET_PARTICIPANT_IDS="2,3"

cd /work/tekit-backend
rm -rf ${TRITON_MODEL_REPO}
cp -r all_models/inflight_batcher_llm/ ${TRITON_MODEL_REPO}
cp -r ${TRITON_MODEL_REPO}/tensorrt_llm ${TRITON_MODEL_REPO}/tensorrt_llm_draft
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_draft"/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/ensemble/config.pbtxt            triton_max_batch_size:4,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/preprocessing/config.pbtxt       triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},preprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/postprocessing/config.pbtxt      triton_max_batch_size:4,tokenizer_dir:${HF_MODEL},postprocessing_instance_count:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_bls/config.pbtxt    triton_max_batch_size:4,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False,tensorrt_llm_model_name:${TARGET_MODEL_NAME},logits_datatype:TYPE_FP32,tensorrt_llm_draft_model_name:${DRAFT_MODEL_NAME}

python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt        triton_max_batch_size:4,triton_backend:tensorrtllm,decoupled_mode:False,max_beam_width:1,engine_dir:${TARGET_ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32,gpu_device_ids:${TARGET_DEVICE_IDS},participant_ids:2,3,speculative_decoding_fast_logits:1
python3 tools/fill_template.py -i ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt  triton_max_batch_size:4,triton_backend:tensorrtllm,decoupled_mode:False,max_beam_width:1,engine_dir:${DRAFT_ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32,gpu_device_ids:${DRAFT_DEVICE_IDS},participant_ids:1,speculative_decoding_fast_logits:1

sed -i 's/\${gpu_device_ids}/'"${DRAFT_DEVICE_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt
sed -i 's/\${participant_ids}/'"${DRAFT_PARTICIPANT_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm_draft/config.pbtxt
sed -i 's/\${gpu_device_ids}/'"${TARGET_DEVICE_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt
sed -i 's/\${participant_ids}/'"${TARGET_PARTICIPANT_IDS}"'/g' ${TRITON_MODEL_REPO}/tensorrt_llm/config.pbtxt

As you see, the differences are participant_ids and speculative_decoding_fast_logits.
Start the triton inference server.
- Use --disable-spawn-process to enable pre-spawn variant in orchestrator mode.
- --world_size must be equal to 1 + tp_size_draft + tp_size_target, which is 4 in this example.

python3 scripts/launch_triton_server.py \
    --model_repo ${TRITON_MODEL_REPO} \
    --tensorrt_llm_model_name tensorrt_llm,tensorrt_llm_draft \
    --multi-model \
    --world_size 4 \
    --disable-spawn-processes

All other operations are the same as the Simple deploy part.

Additional information

With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target.
Streaming mode or batched-request mode are not supported in DTM yet.

20 KiB Raw Blame History