mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Erin e277766f0d chores: merge examples for v1.0 doc (#5736 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>		2025-07-08 21:00:42 -07:00
..
README.md	chores: merge examples for v1.0 doc (#5736 )	2025-07-08 21:00:42 -07:00
requirements.txt	feat: adding multimodal (only image for now) support in trtllm-bench (#3490 )	2025-04-18 07:06:16 +08:00
run_dtm_pld.py	[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936 )	2025-05-21 10:40:00 +08:00

README.md

Prompt-Lookup Speculative Decoding

This document shows how to build and run a model using Prompt-Lookup speculative decoding (supported as ASSISTED_GENERATION in transformers and vLLM, source: GitHub) in TensorRT-LLM on single GPU, or single node multiple GPU.

Overview

We provide two styles of workflow to run Prompt-Lookup (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling runner.generate() multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling runner.generate() only one time to get outputs, which provides higher performance but fixed process.

The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation:

prompt_lookup_num_tokens: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
max_matching_ngram_size: the maximum number of tokens extracted from the tail of the input prompt or generated output as a pattern, which is used to search corresponding draft tokens (default value: 2). Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance rate, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration).
device_list: the index list of device(s) to run the model in V1 workflow. The length of it must be the same as the TP size of the draft model engine. For instances, device_list=[0] means using tp_size=1 and GPU 0 for the model, device_list=[4,5,6,7] means using tp=4 and GPU from 4 to 7 for the model. This parameter is neddless in V2 workflow.

For example, the process of getting draft tokens using prompt_lookup_num_tokens=2 and max_matching_ngram_size=4 with a sentence prefix=[..., t1, t2, t3, t4] is like below:

pattern = prefix[:-2]                               # pattern=[t3, t4] (length=2)
if pattern in pool and len(pool[pattern]) == 4:     # assuming it is {(t3, t4): (t5, t6, t7, t8)}
    return pool[pattern]                            # draft token = [t5, t6, t7, t8]
elif pattern in pool and len(pool[pattern]) == <4:  # assuming it is {(t3, t4): (t9, t10, t11)}
    return pool[pattern]                            # draft token = [t9, t10, t11]
pattern = prefix[:-1]                               # Try shorter pattern if no candidate of length=2 exists, pattern=[t4] (length=1)
if pattern in pool and len(pool[pattern]) == 4:     # The same process as above
    return pool[pattern]
elif pattern in pool and len(pool[pattern]) == <4:
    return pool[pattern]
return None                                         # No any candidate exists

Support Matrix

GPU Compute Capability >= 8.0 (Ampere or newer)
FP16 / BF16 / FP8
Paged KV Cache
Tensor Parallel

Usage

V1 workflow

We use an open-source llama-v2-13B models in this example.
--use_paged_context_fmha=enable must be specified since we need KVcache reuse in this approach.
--speculative_decoding_mode=draft_tokens_external must be specified.
--max_draft_len must be specified larger or equal to prompt_lookup_num_tokens.
---prompt_lookup_config is corresponding configuration of Prompt-Lookup, we can see its usage in util.py.
- As an example, [10,2,[0]] means prompt_lookup_num_tokens=10, max_matching_ngram_size=2, and device of target model is GPU0.
--kv_cache_enable_block_reuse must be specified for this approach.
Only CPP session is supported, so --use_py_session must not be specified.
--num_beams can not be specified as larger than 1 since beam search is not supported in this approach yet.

# Build engine
python3 examples/models/core/llama/convert_checkpoint.py \
    --model_dir=<Path To Llama-v2-13B repo> \
    --output_dir=./ckpt-target \
    --dtype=float16

trtllm-build \
    --checkpoint_dir=./ckpt-target \
    --output_dir=./target-engine \
    --gemm_plugin=float16 \
    --use_paged_context_fmha=enable \
    --speculative_decoding_mode=draft_tokens_external \
    --max_draft_len=10 \
    --max_batch_size=4 \
    --max_input_len=3200 \
    --max_seq_len=4800

# Run decoding
python3 examples/run.py \
    --tokenizer_dir <Path To Llama-v2-7B repo> \
    --engine_dir ./target-engine \
    --prompt_lookup_config="[10,2,[0]]" \
    --max_output_len=256 \
    --kv_cache_enable_block_reuse \
    --input_text="How does Draft-Sampling work?"

# Run summarization tasks
python examples/summarize.py \
    --test_hf \
    --test_trt_llm \
    --check_accuracy \
    --hf_model_dir <Path To Llama-v2-7B repo> \
    --engine_dir ./target-engine \
    --batch_size=1 \
    --prompt_lookup_config="[10,2,[0]]" \
    --kv_cache_enable_block_reuse

V2 workflow

python3 examples/llm-api/quickstart_advanced.py \
    --max_matching_ngram_size=2 \
    --spec_decode_nextn=4