[refactor] Unify name of NGram speculative decoding (#5937)

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
This commit is contained in:
wili 2025-07-19 12:59:57 +08:00 committed by GitHub
parent 152e2df43b
commit 82d3587bb8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
15 changed files with 140 additions and 143 deletions

View File

@ -3,7 +3,7 @@
- [About Speculative Sampling](#about-speculative-sampling)
- [Performance Improvements](#Performance-improvements)
- [Draft-Target-Model](#Draft-Target-Model)
- [Prompt-Lookup-Decoding](#prompt-lookup-decoding)
- [NGram](#ngram)
- [Medusa](#medusa)
- [Medusa Tree](#medusa-tree)
- [Using Medusa with TensorRT-LLM](#using-medusa-with-tensorrt-llm)
@ -36,7 +36,7 @@ TensorRT-LLM supports several approaches for generating draft tokens, including:
1. [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads paper](https://arxiv.org/abs/2401.10774).
2. [Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/html/2403.09919v1).
3. [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077).
3. Utilizing prompt tokens as draft tokens. For more information, refer to [Prompt Lookup Decoding](https://github.com/apoorvumang/prompt-lookup-decoding/).
3. Utilizing prompt tokens as draft tokens. For more information, refer to [NGram](https://github.com/apoorvumang/prompt-lookup-decoding/).
4. Utilizing Jacobi-like decoding to predict and verify draft tokens using the same model which does not need additional fine-tuning. Refer to [Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/pdf/2402.02057).
@ -62,13 +62,13 @@ Subsequently, the prompt, now updated with the accepted tokens, is sent back to
This iterative process continues until a predefined stop conditions are met.
An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py).
We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).
## Prompt-Lookup-Decoding
## NGram
The Prompt-Lookup speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The Prompt-Lookup profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.
The NGram speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The NGram profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.
See document in [examples/prompt_lookup/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
See document in [examples/ngram/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).
## Medusa

View File

@ -40,9 +40,10 @@ python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --mo
python3 quickstart_advanced.py \
--model_dir meta-llama/Llama-3.1-8B-Instruct \
--spec_decode_algo NGRAM \
--max_matching_ngram_size=2 \
--spec_decode_nextn=4 \
--disable_overlap_scheduler
--spec_decode_nextn 4 \
--max_matching_ngram_size 2 \
--disable_overlap_scheduler \
--disable_kv_cache_reuse
```
```bash
@ -52,6 +53,6 @@ python3 quickstart_advanced.py \
--spec_decode_algo draft_target \
--spec_decode_nextn 5 \
--draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
--disable_overlap_scheduler
--disable_overlap_scheduler \
--disable_kv_cache_reuse
```

View File

@ -1,17 +1,17 @@
# Prompt-Lookup Speculative Decoding
# NGram Speculative Decoding
This document shows how to build and run a model using Prompt-Lookup speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
This document shows how to build and run a model using NGram speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
## Overview
We provide two styles of workflow to run Prompt-Lookup (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.
We provide two styles of workflow to run NGram (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.
The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation:
- `prompt_lookup_num_tokens`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
The NGram has 3 additional hyperparameters that you need to specify to control the process of generation:
- `max_draft_len`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
- `max_matching_ngram_size`: the maximum number of tokens extracted from the tail of the input prompt or generated output as a pattern, which is used to search corresponding draft tokens (default value: 2). Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance rate, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration).
- `device_list`: the index list of device(s) to run the model in V1 workflow. The length of it must be the same as the TP size of the draft model engine. For instances, `device_list=[0]` means using tp_size=1 and GPU 0 for the model, `device_list=[4,5,6,7]` means using tp=4 and GPU from 4 to 7 for the model. This parameter is neddless in V2 workflow.
+ For example, the process of getting draft tokens using `prompt_lookup_num_tokens=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:
+ For example, the process of getting draft tokens using `max_draft_len=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:
```Python
pattern = prefix[:-2] # pattern=[t3, t4] (length=2)
@ -40,9 +40,9 @@ return None # No any candidate exists
+ We use an open-source `llama-v2-13B` models in this example.
+ `--use_paged_context_fmha=enable` must be specified since we need KVcache reuse in this approach.
+ `--speculative_decoding_mode=draft_tokens_external` must be specified.
+ `--max_draft_len` must be specified larger or equal to `prompt_lookup_num_tokens`.
+ `---prompt_lookup_config` is corresponding configuration of Prompt-Lookup, we can see its usage in [util.py](../util.py).
+ As an example, `[10,2,[0]]` means `prompt_lookup_num_tokens=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
+ `--max_draft_len` must be specified as the length maximum of the draft tokens.
+ `--ngram_config` is corresponding configuration of NGram, we can see its usage in [util.py](../util.py).
+ As an example, `[10,2,[0]]` means `max_draft_len=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
+ `--kv_cache_enable_block_reuse` must be specified for this approach.
+ Only CPP session is supported, so `--use_py_session` must not be specified.
+ `--num_beams` can not be specified as larger than 1 since beam search is not supported in this approach yet.
@ -50,29 +50,29 @@ return None # No any candidate exists
```bash
# Build engine
python3 examples/models/core/llama/convert_checkpoint.py \
--model_dir=<Path To Llama-v2-13B repo> \
--output_dir=./ckpt-target \
--dtype=float16
--model_dir <Path To Llama-v2-13B repo> \
--output_dir ./ckpt-target \
--dtype float16
trtllm-build \
--checkpoint_dir=./ckpt-target \
--output_dir=./target-engine \
--gemm_plugin=float16 \
--use_paged_context_fmha=enable \
--speculative_decoding_mode=draft_tokens_external \
--max_draft_len=10 \
--max_batch_size=4 \
--max_input_len=3200 \
--max_seq_len=4800
--checkpoint_dir ./ckpt-target \
--output_dir ./target-engine \
--gemm_plugin float16 \
--use_paged_context_fmha enable \
--speculative_decoding_mode draft_tokens_external \
--max_draft_len 10 \
--max_batch_size 4 \
--max_input_len 3200 \
--max_seq_len 4800
# Run decoding
python3 examples/run.py \
--tokenizer_dir <Path To Llama-v2-7B repo> \
--engine_dir ./target-engine \
--prompt_lookup_config="[10,2,[0]]" \
--max_output_len=256 \
--ngram_config "[10,2,[0]]" \
--max_output_len 256 \
--kv_cache_enable_block_reuse \
--input_text="How does Draft-Sampling work?"
--input_text "How does Draft-Sampling work?"
# Run summarization tasks
python examples/summarize.py \
@ -81,8 +81,8 @@ python examples/summarize.py \
--check_accuracy \
--hf_model_dir <Path To Llama-v2-7B repo> \
--engine_dir ./target-engine \
--batch_size=1 \
--prompt_lookup_config="[10,2,[0]]" \
--batch_size 1 \
--ngram_config "[10,2,[0]]" \
--kv_cache_enable_block_reuse
```
@ -90,6 +90,8 @@ python examples/summarize.py \
```bash
python3 examples/llm-api/quickstart_advanced.py \
--max_matching_ngram_size=2 \
--spec_decode_nextn=4
--spec_decode_nextn 4 \
--max_matching_ngram_size 2 \
--disable_overlap_scheduler \
--disable_kv_cache_reuse
```

View File

@ -23,12 +23,12 @@ from tensorrt_llm.logger import logger
from tensorrt_llm.runtime import ModelRunnerCpp
class PLDPool: # Ngrams pool for Prompt-Lookup-Decoding
class NgramPool: # Ngrams pool for Ngram
def __init__(
self,
input_batch_size: int,
prompt_lookup_num_tokens: int,
max_draft_len: int,
max_matching_ngram_size: int,
end_id: int,
max_seq_len: list[int],
@ -36,7 +36,7 @@ class PLDPool: # Ngrams pool for Prompt-Lookup-Decoding
is_use_oldest: bool = True,
):
self.input_batch_size = input_batch_size
self.prompt_lookup_num_tokens = prompt_lookup_num_tokens
self.max_draft_len = max_draft_len
self.max_matching_ngram_size = max_matching_ngram_size
self.end_id = end_id
self.max_seq_len = max_seq_len
@ -45,7 +45,7 @@ class PLDPool: # Ngrams pool for Prompt-Lookup-Decoding
self.pool = [{} for _ in range(input_batch_size)]
self.start_index = [0 for _ in range(input_batch_size)]
assert self.prompt_lookup_num_tokens > 0, f"prompt_lookup_num_tokens must be greater than 0, but got {self.prompt_lookup_num_tokens}"
assert self.max_draft_len > 0, f"max_draft_len must be greater than 0, but got {self.max_draft_len}"
assert self.max_matching_ngram_size > 0, f"max_matching_ngram_size must be greater than 0, but got {self.max_matching_ngram_size}"
def print_pool(self):
@ -82,16 +82,15 @@ class PLDPool: # Ngrams pool for Prompt-Lookup-Decoding
-1):
# Find each possible key-value combination, and use tuple for hash
for l in range(len(sequence) - size):
r = min(l + size + self.prompt_lookup_num_tokens,
len(sequence))
r = min(l + size + self.max_draft_len, len(sequence))
key = tuple(sequence[l:l + size])
value = tuple(sequence[l + size:r])
if key not in self.pool[gbi] or not self.is_keep_all or \
len(self.pool[gbi][key][0]) < self.prompt_lookup_num_tokens:
len(self.pool[gbi][key][0]) < self.max_draft_len:
# Update the value if
# 1. the key does not exist
# 2. we only keep the newest one value for each key (MRU)
# 3. the length of the value saved before is less than `prompt_lookup_num_tokens`
# 3. the length of the value saved before is less than `max_draft_len`
self.pool[gbi][key] = OrderedSet((value, ))
elif value not in self.pool[gbi][key]:
# Extend the value if the key is already existed but count of values is not enough
@ -113,26 +112,26 @@ class PLDPool: # Ngrams pool for Prompt-Lookup-Decoding
break
draft_tokens.append(chosen_ids)
self.start_index[gbi] = max(
0, prefix_len[bi] - (self.prompt_lookup_num_tokens +
self.max_matching_ngram_size - 1))
0, prefix_len[bi] -
(self.max_draft_len + self.max_matching_ngram_size - 1))
return draft_tokens, None
def run_dtm_pld(batch_input_ids,
args,
runtime_rank,
end_id,
pad_id,
stop_words_list,
bad_words_list,
vocab_size,
*,
target_runner=None):
# `dtm` for Draft-Target-Model, `pld` for Prompt-Lookup-Decoding
def run_dtm_ngram(batch_input_ids,
args,
runtime_rank,
end_id,
pad_id,
stop_words_list,
bad_words_list,
vocab_size,
*,
target_runner=None):
# `dtm` for Draft-Target-Model, `ngram` for NGram
is_dtm = (args.draft_target_model_config is not None)
is_pld = (args.prompt_lookup_config is not None)
assert is_dtm ^ is_pld, "`--draft_target_model_config` and `--prompt_lookup_config` can not be specified at the same time."
is_ngram = (args.ngram_config is not None)
assert is_dtm ^ is_ngram, "`--draft_target_model_config` and `--ngram_config` can not be specified at the same time."
if is_dtm:
assert args.draft_engine_dir is not None, "`--draft_engine_dir` must be specified in Draft-Target-Model."
draft_len, draft_device_list, target_device_list, use_logits = ast.literal_eval(
@ -142,12 +141,11 @@ def run_dtm_pld(batch_input_ids,
logger.info(f"Device(s) for draft model: {draft_device_list}")
logger.info(f"Device(s) for target model: {target_device_list}")
logger.info(f"Use logits to accept tokens: {use_logits}")
if is_pld:
logger.info(
f"Using Prompt-Lookup-Decoding speculative decoding V1 workflow")
prompt_lookup_num_tokens, max_matching_ngram_size, target_device_list = ast.literal_eval(
args.prompt_lookup_config)
logger.info(f"prompt_lookup_num_tokens: {prompt_lookup_num_tokens}")
if is_ngram:
logger.info(f"Using NGram speculative decoding V1 workflow")
max_draft_len, max_matching_ngram_size, target_device_list = ast.literal_eval(
args.ngram_config)
logger.info(f"max_draft_len: {max_draft_len}")
logger.info(f"max_matching_ngram_size: {max_matching_ngram_size}")
logger.info(f"Device(s) for the model: {target_device_list}")
use_logits = False # `logits` is useless in this approach yet
@ -166,9 +164,9 @@ def run_dtm_pld(batch_input_ids,
n_draft_token = [0 for _ in range(input_batch_size)]
n_accept_token = [0 for _ in range(input_batch_size)]
if is_pld:
pld_pool = PLDPool(input_batch_size, prompt_lookup_num_tokens,
max_matching_ngram_size, end_id, max_seq_len)
if is_ngram:
ngram_pool = NgramPool(input_batch_size, max_draft_len,
max_matching_ngram_size, end_id, max_seq_len)
# Repack the output like the output of function `generate`
outputs = {}
@ -297,8 +295,8 @@ def run_dtm_pld(batch_input_ids,
if use_logits:
d_logits[bi] = draft["generation_logits"][bi, 0,
-d_len[bi]:, :]
if is_pld:
d_ids, d_logits = pld_pool.get_draft_tokens(prefix, batch_slot)
if is_ngram:
d_ids, d_logits = ngram_pool.get_draft_tokens(prefix, batch_slot)
d_len = [len(i) for i in d_ids]
# Run target model
@ -310,8 +308,8 @@ def run_dtm_pld(batch_input_ids,
draft_logits_list=d_logits)
if is_dtm:
max_new_tokens = draft_len + 1
if is_pld:
max_new_tokens = prompt_lookup_num_tokens + 1
if is_ngram:
max_new_tokens = max_draft_len + 1
target_generation_kwargs.update(max_new_tokens=max_new_tokens)
target = target_runner.generate(**target_generation_kwargs)
torch.cuda.synchronize()

View File

@ -35,7 +35,7 @@ from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
if PYTHON_BINDINGS:
from tensorrt_llm.runtime import ModelRunnerCpp
from prompt_lookup.run_dtm_pld import run_dtm_pld
from ngram.run_dtm_ngram import run_dtm_ngram
def parse_arguments(args=None):
@ -430,17 +430,17 @@ def main(args):
logger.info(f"Using {'Python' if args.use_py_session else 'C++'} session")
if args.draft_target_model_config is not None or args.prompt_lookup_config is not None:
# Speculative-Decoding of Draft-Target-Model (DTM) and Prompt-Lookup-Decoding (PLD)
# If the parameters of `runner_kwargs` and `runner.generate()` in the "else" branch change, the same change should be done for `examples/prompt_lookup/run_dtm_pld.py`
if args.draft_target_model_config is not None or args.ngram_config is not None:
# Speculative-Decoding of Draft-Target-Model (DTM) and NGram
# If the parameters of `runner_kwargs` and `runner.generate()` in the "else" branch change, the same change should be done for `examples/ngram/run_dtm_ngram.py`
assert args.kv_cache_enable_block_reuse, "`--kv_cache_enable_block_reuse` must be specified in speculative decoding."
assert not args.use_py_session, "`--use_py_session` is not supported in Speculative decoding."
assert not is_enc_dec, "Encoder-Decoder model is not supported in Speculative decoding."
assert args.num_beams == 1, "`--num_beams>1` is not supported in Speculative decoding."
outputs = run_dtm_pld(batch_input_ids, args, runtime_rank, end_id,
pad_id, stop_words_list, bad_words_list,
len(tokenizer))
outputs = run_dtm_ngram(batch_input_ids, args, runtime_rank, end_id,
pad_id, stop_words_list, bad_words_list,
len(tokenizer))
if not args.streaming: # Unpack runner from the return value in No-Streaming mode
outputs, runner = list(outputs)[0]

View File

@ -41,7 +41,7 @@ from tensorrt_llm.tools.ppl import ppl
if PYTHON_BINDINGS:
from tensorrt_llm.runtime import ModelRunnerCpp
from prompt_lookup.run_dtm_pld import run_dtm_pld
from ngram.run_dtm_ngram import run_dtm_ngram
def ensemble_mrope_params(batch_input_ids, max_position_embeddings,
@ -318,17 +318,17 @@ def main(args):
return [], [], [], {}
input_lengths = [x.size(0) for x in batch_input_ids]
if args.prompt_lookup_config is not None:
# Speculative decoding of Prompt-Lookup-Decoding (PLD)
outputs = run_dtm_pld(batch_input_ids,
args,
runtime_rank,
end_id,
pad_id,
stop_words_list,
bad_words_list,
tokenizer.vocab_size,
target_runner=runner)
if args.ngram_config is not None:
# Speculative decoding of NGram
outputs = run_dtm_ngram(batch_input_ids,
args,
runtime_rank,
end_id,
pad_id,
stop_words_list,
bad_words_list,
tokenizer.vocab_size,
target_runner=runner)
if not args.streaming: # Unpack runner from the return value in No-Streaming mode
outputs, runner = list(outputs)[0]
else: # Normal run
@ -596,18 +596,17 @@ def main(args):
args.lookahead_config
) == 3, "Lookahead needs [max_window_size, max_ngram_size, max_verification_set_size]"
runner_kwargs.update(lookahead_config=args.lookahead_config)
if args.prompt_lookup_config is not None:
if args.ngram_config is not None:
assert args.kv_cache_enable_block_reuse, "`--kv_cache_enable_block_reuse` must be specified in speculative decoding."
assert not args.use_py_session, "`--use_py_session` is not supported in Speculative decoding."
assert not is_enc_dec, "Encoder-Decoder model is not supported in Speculative decoding."
assert args.num_beams == 1, "`--num_beams>1` is not supported in Speculative decoding."
prompt_lookup_num_tokens, _, target_device_list = ast.literal_eval(
args.prompt_lookup_config)
args.max_output_len = output_len # Specialization for PLD
max_draft_len, _, target_device_list = ast.literal_eval(
args.ngram_config)
args.max_output_len = output_len # Specialization for NGram
runner_kwargs.update(is_orchestrator_mode=True,
device_ids=target_device_list,
max_input_len=test_token_num +
prompt_lookup_num_tokens + output_len)
max_input_len=test_token_num + max_draft_len +
output_len)
runner = runner_cls.from_dir(**runner_kwargs)
assert not (args.eval_ppl and not runner.gather_context_logits), \

View File

@ -439,12 +439,12 @@ def add_common_args(parser):
" E.g.: [4, [0], [1], False] for [draft_len, draft_model_device_list, target_model_device_list, use_logits]."
)
parser.add_argument(
'--prompt_lookup_config',
'--ngram_config',
type=str,
default=None,
help=
"Configuration of Prompt-Lookup decoding, see `examples/prompt_lookup/README.md` for more information."
" E.g.: [10,2,[0]] for [prompt_lookup_num_tokens, max_matching_ngram_size, device_list].",
"Configuration of NGram decoding, see `examples/ngram/README.md` for more information."
" E.g.: [10,2,[0]] for [max_draft_len, max_matching_ngram_size, device_list].",
)
parser.add_argument(
'--medusa_choices',

View File

@ -124,7 +124,7 @@
"examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2]": 257.3995385244489,
"examples/test_enc_dec.py::test_llm_enc_dec_general[compare_hf-bart-large-cnn-float32-enable_gemm_plugin-enable_attention_plugin-enable_paged_kv_cache-tp:1-pp:1-nb:2-disable_fp8]": 276.10329104214907,
"examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-hf-vision-trtllm-pp:1-tp:1-float16-bs:1-cpp_e2e:False-nb:1]": 306.38610201328993,
"examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]": 195.90045699477196,
"examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]": 195.90045699477196,
"test_unittests.py::test_unittests_v2[unittest/trt/model/test_gpt.py -k \"partition2\"]": 357.6496359631419,
"accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=eagle-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]": 413.903915906325,
"accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=eagle-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile=False]": 143.841789112892,
@ -329,7 +329,7 @@
"examples/test_gpt.py::test_llm_gpt2_medium_stop_words_1gpu[non_streaming-use_py_session]": 194.89357279613614,
"examples/test_granite.py::test_llm_granite[granite-3.0-2b-instruct-bfloat16]": 155.801738537848,
"examples/test_llama.py::test_llm_llama_v2_1gpu_auto_parallel[llama-v2-7b-hf]": 535.973838724196,
"examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]": 196.1214354224503,
"examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]": 196.1214354224503,
"examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_cpp_session-recurrentgemma-2b-use_paged_cache-int4_awq-float16-enable_attn_plugin-enable_gemm_plugin]": 648.7579195387661,
"accuracy/test_cli_flow.py::TestLlama3_2_1B::test_smooth_quant_ootb": 457.93785213679075,
"accuracy/test_cli_flow.py::TestLlama3_2_1B::test_smooth_quant_ootb_manage_weights": 216.66169160604477,

View File

@ -308,7 +308,7 @@ def convert_weights(llm_venv,
f"--dtype={data_type}",
]
elif "prompt_lookup" in model:
elif "ngram" in model:
if "gpt" in model_path:
example_name = "gpt"
elif "llama" in model_path:

View File

@ -487,9 +487,9 @@ def draft_target_model_example_root(llm_root, llm_venv):
@pytest.fixture(scope="module")
def prompt_lookup_example_root(llm_root, llm_venv):
"Get Prompt-Lookup example root"
example_root = os.path.join(llm_root, "examples", "prompt_lookup")
def ngram_example_root(llm_root, llm_venv):
"Get NGram example root"
example_root = os.path.join(llm_root, "examples", "ngram")
llm_venv.run_cmd([
"-m", "pip", "install", "-r",
os.path.join(example_root, "requirements.txt")
@ -1084,7 +1084,7 @@ def draft_target_model_roots(request):
@pytest.fixture(scope="function")
def prompt_lookup_root(request):
def ngram_root(request):
models_root = llm_models_root()
assert models_root, "Did you set LLM_MODELS_ROOT?"
if request.param == "gpt2":
@ -1094,7 +1094,7 @@ def prompt_lookup_root(request):
"llama-models-v2/llama-v2-13b-hf")
assert os.path.exists(
models_root
), f"Prompt-Lookup model path {models_root} does not exist under NFS LLM_MODELS_ROOT dir"
), f"NGram model path {models_root} does not exist under NFS LLM_MODELS_ROOT dir"
return models_root

View File

@ -22,36 +22,34 @@ from defs.conftest import skip_post_blackwell
from defs.trt_test_alternative import check_call
# TODO: remove skip after support prompt lookup on B200
# TODO: remove skip after support NGram on B200
@skip_post_blackwell
@pytest.mark.parametrize("batch_size", [1, 2], ids=['bs1', 'bs2'])
@pytest.mark.parametrize("data_type", ['float16'])
@pytest.mark.parametrize(
"prompt_lookup_num_tokens", [4, 8],
ids=['prompt_lookup_num_tokens_4', 'prompt_lookup_num_tokens_8'])
@pytest.mark.parametrize("max_draft_len", [4, 8],
ids=['max_draft_len_4', 'max_draft_len_8'])
@pytest.mark.parametrize(
"max_matching_ngram_size", [2, 4],
ids=['max_matching_ngram_size_2', 'max_matching_ngram_size_4'])
@pytest.mark.parametrize("use_logits", [False, True],
ids=['use_tokens', 'use_logits']) # useless yet
@pytest.mark.parametrize("use_py_session", [False], ids=["use_cpp_session"])
@pytest.mark.parametrize("prompt_lookup_root", ["gpt2"], indirect=True)
@pytest.mark.parametrize("ngram_root", ["gpt2"], indirect=True)
@pytest.mark.parametrize("streaming", [False, True],
ids=["no_streaming", "streaming"])
def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
max_matching_ngram_size, use_logits,
use_py_session, prompt_lookup_root, streaming,
prompt_lookup_example_root, llm_datasets_root,
llm_rouge_root, llm_venv, cmodel_dir,
engine_dir):
model_name = "prompt_lookup"
def test_llm_ngram_1gpu(batch_size, data_type, max_draft_len,
max_matching_ngram_size, use_logits, use_py_session,
ngram_root, streaming, ngram_example_root,
llm_datasets_root, llm_rouge_root, llm_venv, cmodel_dir,
engine_dir):
model_name = "ngram"
print("Build checkpoint ...")
model_dir = convert_weights(llm_venv=llm_venv,
example_root=prompt_lookup_example_root,
example_root=ngram_example_root,
cmodel_dir=cmodel_dir,
model=model_name,
model_path=prompt_lookup_root,
model_path=ngram_root,
data_type=data_type)
print("Build engines ...")
@ -72,7 +70,7 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
target_model_build_cmd.extend([
f"--output_dir={target_engine_dir}",
"--speculative_decoding_mode=draft_tokens_external",
f"--max_draft_len={prompt_lookup_num_tokens+1}",
f"--max_draft_len={max_draft_len+1}",
])
baseline_model_build_cmd = deepcopy(common_build_cmd)
baseline_model_build_cmd.extend([
@ -88,8 +86,8 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
print("Run inferences ...")
common_run_cmd = [
f"{prompt_lookup_example_root}/../run.py",
f"--tokenizer_dir={prompt_lookup_root}",
f"{ngram_example_root}/../run.py",
f"--tokenizer_dir={ngram_root}",
f"--max_output_len=64",
f"--kv_cache_enable_block_reuse",
f"--kv_cache_free_gpu_memory_fraction=0.25",
@ -105,11 +103,11 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
assert not use_py_session, "Only CPP session is supported in Draft-Target-Model."
run_cmd = deepcopy(common_run_cmd)
prompt_lookup_config = f"[{prompt_lookup_num_tokens},{max_matching_ngram_size},[0]]"
ngram_config = f"[{max_draft_len},{max_matching_ngram_size},[0]]"
run_cmd.extend([
f"--engine_dir={target_engine_dir}",
f"--prompt_lookup_config={prompt_lookup_config}",
f"--output_csv={engine_dir}/prompt_lookup_output.csv",
f"--ngram_config={ngram_config}",
f"--output_csv={engine_dir}/ngram_output.csv",
])
baseline_run_cmd = deepcopy(common_run_cmd)
baseline_run_cmd.extend([
@ -121,7 +119,7 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
venv_check_call(llm_venv, baseline_run_cmd)
print("Compare outputs ...")
with open(f"{engine_dir}/prompt_lookup_output.csv") as dt_f, open(
with open(f"{engine_dir}/ngram_output.csv") as dt_f, open(
f"{engine_dir}/baseline_output.csv") as b_f:
for bs, (dt_request,
b_request) in enumerate(zip(csv.reader(dt_f),
@ -138,20 +136,20 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
return
print("Run summarize...")
prompt_lookup_config = f"[{prompt_lookup_num_tokens},{max_matching_ngram_size},[0]]"
ngram_config = f"[{max_draft_len},{max_matching_ngram_size},[0]]"
run_cmd = [
f"{prompt_lookup_example_root}/../summarize.py",
f"{ngram_example_root}/../summarize.py",
"--test_hf",
"--test_trt_llm",
"--check_accuracy",
"--batch_size=1",
f"--hf_model_dir={prompt_lookup_root}",
f"--hf_model_dir={ngram_root}",
f"--engine_dir={target_engine_dir}",
f"--dataset_dir={llm_datasets_root}",
f"--rouge_dir={llm_rouge_root}",
"--kv_cache_enable_block_reuse",
f"--prompt_lookup_config={prompt_lookup_config}",
f"--ngram_config={ngram_config}",
"--tensorrt_llm_rouge1_threshold=20",
f"--kv_cache_free_gpu_memory_fraction=0.25",
]

View File

@ -97,10 +97,10 @@ examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streami
examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-llama_v2-use_cpp_session-use_logits-draft_len_4-float16-bs2]
examples/test_draft_target_model.py::test_llm_draft_target_llama_1gpu
examples/test_draft_target_model.py::test_llm_draft_target_llama_fp8_2gpu
examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1]
examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]
examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1]
examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]
examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs1]
examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]
examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs1]
examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]
examples/test_internlm.py::test_llm_internlm2_7b_1node_1gpu[bfloat16-enable_context_fmha-enable_gemm_plugin-enable_attention_plugin-nb:2]
examples/test_llama.py::test_llm_llama_1gpu_streaming_llm[ailab-deepseek-coder-6.7b-instruct]
examples/test_llama.py::test_llm_llama_2gpu_fp8_summary[llama-7b-enable_reduce_fusion-disable_fp8_context_fmha_xqa]

View File

@ -108,7 +108,7 @@ l0_a30:
- examples/test_internlm.py::test_llm_internlm2_7b_1node_1gpu[bfloat16-enable_context_fmha-enable_gemm_plugin-enable_attention_plugin-nb:2] # 5 mins
- examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2] # 1 min
- examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_logits-draft_len_4-float16-bs2] # 1 min
- examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2] # 1 min
- examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2] # 1 min
- condition:
ranges:
system_gpu_count:
@ -159,7 +159,7 @@ l0_a30:
- examples/test_granite.py::test_llm_granite[granite-3.0-2b-instruct-bfloat16] # 5 mins
- examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2] # 1 min
- examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streaming-gpt2-use_cpp_session-use_logits-draft_len_4-float16-bs2] # 1 min
- examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2] # 1 min
- examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2] # 1 min
- condition:
ranges:
system_gpu_count:

View File

@ -381,7 +381,6 @@ accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype
full:B200/examples/test_gemma.py::test_llm_gemma_1gpu_summary_vswa[gemma-3-1b-it-other-bfloat16-8] SKIP (https://nvbugs/5292737)
full:B200/accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype SKIP (https://nvbugs/5295470)
examples/test_mistral.py::test_llm_mistral_v1_1gpu[mistral-7b-v0.1-float16-max_attention_window_size_4096-summarization_long] SKIP (https://nvbugs/5324976)
examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1] SKIP (https://nvbugs/5344070)
examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1gpu[fp8-use_py_session-medusa-vicuna-7b-v1.3-4-heads-float16-bs1] SKIP (https://nvbugs/5333849)
examples/test_multimodal.py::test_llm_multimodal_general[Llama-3.2-11B-Vision-pp:1-tp:1-bfloat16-bs:1-cpp_e2e:False-nb:1] SKIP (https://nvbugs/5333818)
examples/test_multimodal.py::test_llm_multimodal_general[Llama-3.2-11B-Vision-pp:1-tp:1-bfloat16-bs:8-cpp_e2e:False-nb:1] SKIP (https://nvbugs/5333818)