mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* feat: adding multimodal (only image for now) support in trtllm-bench Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fix: add in load_dataset() calls to maintain the v2.19.2 behavior Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * re-adding prompt_token_ids and using that for prompt_len Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * updating the datasets version in examples as well Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * api changes are not needed Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * moving datasets requirement and removing a missed api change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * refactoring the quickstart example Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com> |
||
|---|---|---|
| .. | ||
| README.md | ||
| requirements.txt | ||
| run_dtm_pld.py | ||
Prompt-Lookup Speculative Decoding
This document shows how to build and run a model using Prompt-Lookup speculative decoding (supported as ASSISTED_GENERATION in transformers and vLLM, source: GitHub) in TensorRT-LLM on single GPU, or single node multiple GPU.
Overview
The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation:
prompt_lookup_num_tokens: the number of tokens we extract from input prompt or previous generated output as draft tokens in one iteration, which the range is from 4 to 10 in common usage. Empirically, the larger the value is, the higher acceptance ratio but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.max_matching_ngram_size: the number of tokens we get from the tail of the generated output as a pattern, which is used to match in input prompt or previous generated output. Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance ratio, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration).device_list: the index list of device(s) to run the model. The length of it must be the same as the TP size of the draft model engine. For instances,device_list=[0]means using tp_size=1 and GPU 0 for the model,device_list=[4,5,6,7]means using tp=4 and GPU from 4 to 7 for the model.
Support Matrix
- GPU Compute Capability >= 8.0 (Ampere or newer)
- FP16 / BF16 / FP8
- Paged KV Cache
- Tensor Parallel
Usage
Build engines
- We use an open-source
llama-v2-13Bmodels in this example. --use_paged_context_fmha=enablemust be specified since we need KVcache reuse in this approach.--speculative_decoding_mode=draft_tokens_externalmust be specified.--max_draft_lenmust be specified larger or equal toprompt_lookup_num_tokens.
cd examples/llama
python3 convert_checkpoint.py \
--model_dir=<Path To Llama-v2-13B repo> \
--output_dir=./ckpt-target \
--dtype=float16
trtllm-build \
--checkpoint_dir=./ckpt-target \
--output_dir=./target-engine \
--gemm_plugin=float16 \
--use_paged_context_fmha=enable \
--speculative_decoding_mode=draft_tokens_external \
--max_draft_len=10 \
--max_batch_size=4 \
--max_input_len=3200 \
--max_seq_len=4800
Run decoding
---prompt_lookup_configis corresponding configuration of Prompt-Lookup, we can see its usage in util.py.- As an example,
[10,2,[0]]meansprompt_lookup_num_tokens=10,max_matching_ngram_size=2, and device of target model isGPU0.
- As an example,
--kv_cache_enable_block_reusemust be specified for this approach.- Only CPP session is supported, so
--use_py_sessionmust not be specified. --num_beamscan not be specified as larger than 1 since beam search is not supported in this approach yet.
cd examples/llama
python3 ../run.py \
--tokenizer_dir <Path To Llama-v2-7B repo> \
--engine_dir ./target-engine \
--prompt_lookup_config="[10,2,[0]]" \
--max_output_len=256 \
--kv_cache_enable_block_reuse \
--input_text="How does Draft-Sampling work?"
Run summarization tasks
cd examples/llama
python ../summarize.py \
--test_hf \
--test_trt_llm \
--check_accuracy \
--hf_model_dir <Path To Llama-v2-7B repo> \
--engine_dir ./target-engine \
--batch_size=1 \
--prompt_lookup_config="[10,2,[0]]" \
--kv_cache_enable_block_reuse