[refactor] Unify name of NGram speculative decoding (#5937)

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-07-19 12:59:57 +08:00 · 2025-07-19 12:59:57 +08:00 · 82d3587bb8
commit 82d3587bb8
parent 152e2df43b
15 changed files with 140 additions and 143 deletions
--- a/docs/source/advanced/speculative-decoding.md
+++ b/docs/source/advanced/speculative-decoding.md
@ -3,7 +3,7 @@
 - [About Speculative Sampling](#about-speculative-sampling)
 - [Performance Improvements](#Performance-improvements)
 - [Draft-Target-Model](#Draft-Target-Model)
- [Prompt-Lookup-Decoding](#prompt-lookup-decoding)
+- [NGram](#ngram)
 - [Medusa](#medusa)
  - [Medusa Tree](#medusa-tree)
  - [Using Medusa with TensorRT-LLM](#using-medusa-with-tensorrt-llm)
@ -36,7 +36,7 @@ TensorRT-LLM supports several approaches for generating draft tokens, including:
    1. [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads paper](https://arxiv.org/abs/2401.10774).
    2. [Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/html/2403.09919v1).
    3. [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077).
-3. Utilizing prompt tokens as draft tokens. For more information, refer to [Prompt Lookup Decoding](https://github.com/apoorvumang/prompt-lookup-decoding/).
+3. Utilizing prompt tokens as draft tokens. For more information, refer to [NGram](https://github.com/apoorvumang/prompt-lookup-decoding/).
 4. Utilizing Jacobi-like decoding to predict and verify draft tokens using the same model which does not need additional fine-tuning. Refer to [Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/pdf/2402.02057).


@ -62,13 +62,13 @@ Subsequently, the prompt, now updated with the accepted tokens, is sent back to
 This iterative process continues until a predefined stop conditions are met.
 An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py).

-We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
+We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).

-## Prompt-Lookup-Decoding
+## NGram

-The Prompt-Lookup speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The Prompt-Lookup profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.
+The NGram speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The NGram profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.

-See document in [examples/prompt_lookup/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
+See document in [examples/ngram/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).

 ## Medusa

--- a/examples/llm-api/README.md
+++ b/examples/llm-api/README.md
@ -40,9 +40,10 @@ python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --mo
 python3 quickstart_advanced.py \
    --model_dir meta-llama/Llama-3.1-8B-Instruct \
    --spec_decode_algo NGRAM \
-    --max_matching_ngram_size=2 \
-    --spec_decode_nextn=4 \
-    --disable_overlap_scheduler
+    --spec_decode_nextn 4 \
+    --max_matching_ngram_size 2 \
+    --disable_overlap_scheduler \
+    --disable_kv_cache_reuse
 ```

 ```bash
@ -52,6 +53,6 @@ python3 quickstart_advanced.py \
    --spec_decode_algo draft_target \
    --spec_decode_nextn 5 \
    --draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
-    --disable_overlap_scheduler
+    --disable_overlap_scheduler \
    --disable_kv_cache_reuse
 ```
--- a/examples/prompt_lookup/README.md
+++ b/examples/prompt_lookup/README.md
@ -1,17 +1,17 @@
-# Prompt-Lookup Speculative Decoding
+# NGram Speculative Decoding

-This document shows how to build and run a model using Prompt-Lookup speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
+This document shows how to build and run a model using NGram speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.

 ## Overview

-We provide two styles of workflow to run Prompt-Lookup (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.
+We provide two styles of workflow to run NGram (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.

-The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation:
- `prompt_lookup_num_tokens`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
+The NGram has 3 additional hyperparameters that you need to specify to control the process of generation:
+- `max_draft_len`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
 - `max_matching_ngram_size`: the maximum number of tokens extracted from the tail of the input prompt or generated output as a pattern, which is used to search corresponding draft tokens (default value: 2). Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance rate, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration).
 - `device_list`: the index list of device(s) to run the model in V1 workflow. The length of it must be the same as the TP size of the draft model engine. For instances, `device_list=[0]` means using tp_size=1 and GPU 0 for the model, `device_list=[4,5,6,7]` means using tp=4 and GPU from 4 to 7 for the model. This parameter is neddless in V2 workflow.

-+ For example, the process of getting draft tokens using `prompt_lookup_num_tokens=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:
+ For example, the process of getting draft tokens using `max_draft_len=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:

 ```Python
 pattern = prefix[:-2]                               # pattern=[t3, t4] (length=2)
@ -40,9 +40,9 @@ return None                                         # No any candidate exists
 + We use an open-source `llama-v2-13B` models in this example.
 + `--use_paged_context_fmha=enable` must be specified since we need KVcache reuse in this approach.
 + `--speculative_decoding_mode=draft_tokens_external` must be specified.
-+ `--max_draft_len` must be specified larger or equal to `prompt_lookup_num_tokens`.
-+ `---prompt_lookup_config` is corresponding configuration of Prompt-Lookup, we can see its usage in [util.py](../util.py).
-  + As an example, `[10,2,[0]]` means `prompt_lookup_num_tokens=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
+ `--max_draft_len` must be specified as the length maximum of the draft tokens.
+ `--ngram_config` is corresponding configuration of NGram, we can see its usage in [util.py](../util.py).
+  + As an example, `[10,2,[0]]` means `max_draft_len=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
 + `--kv_cache_enable_block_reuse` must be specified for this approach.
 + Only CPP session is supported, so `--use_py_session` must not be specified.
 + `--num_beams` can not be specified as larger than 1 since beam search is not supported in this approach yet.
@ -50,29 +50,29 @@ return None                                         # No any candidate exists
 ```bash
 # Build engine
 python3 examples/models/core/llama/convert_checkpoint.py \
-    --model_dir=<Path To Llama-v2-13B repo> \
-    --output_dir=./ckpt-target \
-    --dtype=float16
+    --model_dir <Path To Llama-v2-13B repo> \
+    --output_dir ./ckpt-target \
+    --dtype float16

 trtllm-build \
-    --checkpoint_dir=./ckpt-target \
-    --output_dir=./target-engine \
-    --gemm_plugin=float16 \
-    --use_paged_context_fmha=enable \
-    --speculative_decoding_mode=draft_tokens_external \
-    --max_draft_len=10 \
-    --max_batch_size=4 \
-    --max_input_len=3200 \
-    --max_seq_len=4800
+    --checkpoint_dir ./ckpt-target \
+    --output_dir ./target-engine \
+    --gemm_plugin float16 \
+    --use_paged_context_fmha enable \
+    --speculative_decoding_mode draft_tokens_external \
+    --max_draft_len 10 \
+    --max_batch_size 4 \
+    --max_input_len 3200 \
+    --max_seq_len 4800

 # Run decoding
 python3 examples/run.py \
    --tokenizer_dir <Path To Llama-v2-7B repo> \
    --engine_dir ./target-engine \
-    --prompt_lookup_config="[10,2,[0]]" \
-    --max_output_len=256 \
+    --ngram_config "[10,2,[0]]" \
+    --max_output_len 256 \
    --kv_cache_enable_block_reuse \
-    --input_text="How does Draft-Sampling work?"
+    --input_text "How does Draft-Sampling work?"

 # Run summarization tasks
 python examples/summarize.py \
@ -81,8 +81,8 @@ python examples/summarize.py \
    --check_accuracy \
    --hf_model_dir <Path To Llama-v2-7B repo> \
    --engine_dir ./target-engine \
-    --batch_size=1 \
-    --prompt_lookup_config="[10,2,[0]]" \
+    --batch_size 1 \
+    --ngram_config "[10,2,[0]]" \
    --kv_cache_enable_block_reuse
 ```

@ -90,6 +90,8 @@ python examples/summarize.py \

 ```bash
 python3 examples/llm-api/quickstart_advanced.py \
-    --max_matching_ngram_size=2 \
-    --spec_decode_nextn=4
+    --spec_decode_nextn 4 \
+    --max_matching_ngram_size 2 \
+    --disable_overlap_scheduler \
+    --disable_kv_cache_reuse
 ```
--- a/examples/prompt_lookup/requirements.txt
+++ b/examples/prompt_lookup/requirements.txt
--- a/examples/prompt_lookup/run_dtm_pld.py
+++ b/examples/prompt_lookup/run_dtm_pld.py
@ -23,12 +23,12 @@ from tensorrt_llm.logger import logger
 from tensorrt_llm.runtime import ModelRunnerCpp


-class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
+class NgramPool:  # Ngrams pool for Ngram

    def __init__(
        self,
        input_batch_size: int,
-        prompt_lookup_num_tokens: int,
+        max_draft_len: int,
        max_matching_ngram_size: int,
        end_id: int,
        max_seq_len: list[int],
@ -36,7 +36,7 @@ class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
        is_use_oldest: bool = True,
    ):
        self.input_batch_size = input_batch_size
-        self.prompt_lookup_num_tokens = prompt_lookup_num_tokens
+        self.max_draft_len = max_draft_len
        self.max_matching_ngram_size = max_matching_ngram_size
        self.end_id = end_id
        self.max_seq_len = max_seq_len
@ -45,7 +45,7 @@ class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
        self.pool = [{} for _ in range(input_batch_size)]
        self.start_index = [0 for _ in range(input_batch_size)]

-        assert self.prompt_lookup_num_tokens > 0, f"prompt_lookup_num_tokens must be greater than 0, but got {self.prompt_lookup_num_tokens}"
+        assert self.max_draft_len > 0, f"max_draft_len must be greater than 0, but got {self.max_draft_len}"
        assert self.max_matching_ngram_size > 0, f"max_matching_ngram_size must be greater than 0, but got {self.max_matching_ngram_size}"

    def print_pool(self):
@ -82,16 +82,15 @@ class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
                    -1):
                # Find each possible key-value combination, and use tuple for hash
                for l in range(len(sequence) - size):
-                    r = min(l + size + self.prompt_lookup_num_tokens,
-                            len(sequence))
+                    r = min(l + size + self.max_draft_len, len(sequence))
                    key = tuple(sequence[l:l + size])
                    value = tuple(sequence[l + size:r])
                    if key not in self.pool[gbi] or not self.is_keep_all or \
-                        len(self.pool[gbi][key][0]) < self.prompt_lookup_num_tokens:
+                        len(self.pool[gbi][key][0]) < self.max_draft_len:
                        # Update the value if
                        # 1. the key does not exist
                        # 2. we only keep the newest one value for each key (MRU)
-                        # 3. the length of the value saved before is less than `prompt_lookup_num_tokens`
+                        # 3. the length of the value saved before is less than `max_draft_len`
                        self.pool[gbi][key] = OrderedSet((value, ))
                    elif value not in self.pool[gbi][key]:
                        # Extend the value if the key is already existed but count of values is not enough
@ -113,26 +112,26 @@ class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
                break
            draft_tokens.append(chosen_ids)
            self.start_index[gbi] = max(
-                0, prefix_len[bi] - (self.prompt_lookup_num_tokens +
-                                     self.max_matching_ngram_size - 1))
+                0, prefix_len[bi] -
+                (self.max_draft_len + self.max_matching_ngram_size - 1))

        return draft_tokens, None


-def run_dtm_pld(batch_input_ids,
-                args,
-                runtime_rank,
-                end_id,
-                pad_id,
-                stop_words_list,
-                bad_words_list,
-                vocab_size,
-                *,
-                target_runner=None):
-    # `dtm` for Draft-Target-Model, `pld` for Prompt-Lookup-Decoding
+def run_dtm_ngram(batch_input_ids,
+                  args,
+                  runtime_rank,
+                  end_id,
+                  pad_id,
+                  stop_words_list,
+                  bad_words_list,
+                  vocab_size,
+                  *,
+                  target_runner=None):
+    # `dtm` for Draft-Target-Model, `ngram` for NGram
    is_dtm = (args.draft_target_model_config is not None)
-    is_pld = (args.prompt_lookup_config is not None)
-    assert is_dtm ^ is_pld, "`--draft_target_model_config` and `--prompt_lookup_config` can not be specified at the same time."
+    is_ngram = (args.ngram_config is not None)
+    assert is_dtm ^ is_ngram, "`--draft_target_model_config` and `--ngram_config` can not be specified at the same time."
    if is_dtm:
        assert args.draft_engine_dir is not None, "`--draft_engine_dir` must be specified in Draft-Target-Model."
        draft_len, draft_device_list, target_device_list, use_logits = ast.literal_eval(
@ -142,12 +141,11 @@ def run_dtm_pld(batch_input_ids,
        logger.info(f"Device(s) for draft model: {draft_device_list}")
        logger.info(f"Device(s) for target model: {target_device_list}")
        logger.info(f"Use logits to accept tokens: {use_logits}")
-    if is_pld:
-        logger.info(
-            f"Using Prompt-Lookup-Decoding speculative decoding V1 workflow")
-        prompt_lookup_num_tokens, max_matching_ngram_size, target_device_list = ast.literal_eval(
-            args.prompt_lookup_config)
-        logger.info(f"prompt_lookup_num_tokens: {prompt_lookup_num_tokens}")
+    if is_ngram:
+        logger.info(f"Using NGram speculative decoding V1 workflow")
+        max_draft_len, max_matching_ngram_size, target_device_list = ast.literal_eval(
+            args.ngram_config)
+        logger.info(f"max_draft_len: {max_draft_len}")
        logger.info(f"max_matching_ngram_size: {max_matching_ngram_size}")
        logger.info(f"Device(s) for the model: {target_device_list}")
        use_logits = False  # `logits` is useless in this approach yet
@ -166,9 +164,9 @@ def run_dtm_pld(batch_input_ids,
        n_draft_token = [0 for _ in range(input_batch_size)]
        n_accept_token = [0 for _ in range(input_batch_size)]

-    if is_pld:
-        pld_pool = PLDPool(input_batch_size, prompt_lookup_num_tokens,
-                           max_matching_ngram_size, end_id, max_seq_len)
+    if is_ngram:
+        ngram_pool = NgramPool(input_batch_size, max_draft_len,
+                               max_matching_ngram_size, end_id, max_seq_len)

    # Repack the output like the output of function `generate`
    outputs = {}
@ -297,8 +295,8 @@ def run_dtm_pld(batch_input_ids,
                if use_logits:
                    d_logits[bi] = draft["generation_logits"][bi, 0,
                                                              -d_len[bi]:, :]
-        if is_pld:
-            d_ids, d_logits = pld_pool.get_draft_tokens(prefix, batch_slot)
+        if is_ngram:
+            d_ids, d_logits = ngram_pool.get_draft_tokens(prefix, batch_slot)
            d_len = [len(i) for i in d_ids]

        # Run target model
@ -310,8 +308,8 @@ def run_dtm_pld(batch_input_ids,
                                        draft_logits_list=d_logits)
        if is_dtm:
            max_new_tokens = draft_len + 1
-        if is_pld:
-            max_new_tokens = prompt_lookup_num_tokens + 1
+        if is_ngram:
+            max_new_tokens = max_draft_len + 1
        target_generation_kwargs.update(max_new_tokens=max_new_tokens)
        target = target_runner.generate(**target_generation_kwargs)
        torch.cuda.synchronize()
--- a/examples/run.py
+++ b/examples/run.py
@ -35,7 +35,7 @@ from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
 if PYTHON_BINDINGS:
    from tensorrt_llm.runtime import ModelRunnerCpp

-from prompt_lookup.run_dtm_pld import run_dtm_pld
+from ngram.run_dtm_ngram import run_dtm_ngram


 def parse_arguments(args=None):
@ -430,17 +430,17 @@ def main(args):

    logger.info(f"Using {'Python' if args.use_py_session else 'C++'} session")

-    if args.draft_target_model_config is not None or args.prompt_lookup_config is not None:
-        # Speculative-Decoding of Draft-Target-Model (DTM) and Prompt-Lookup-Decoding (PLD)
-        # If the parameters of `runner_kwargs` and `runner.generate()` in the "else" branch change, the same change should be done for `examples/prompt_lookup/run_dtm_pld.py`
+    if args.draft_target_model_config is not None or args.ngram_config is not None:
+        # Speculative-Decoding of Draft-Target-Model (DTM) and NGram
+        # If the parameters of `runner_kwargs` and `runner.generate()` in the "else" branch change, the same change should be done for `examples/ngram/run_dtm_ngram.py`
        assert args.kv_cache_enable_block_reuse, "`--kv_cache_enable_block_reuse` must be specified in speculative decoding."
        assert not args.use_py_session, "`--use_py_session` is not supported in Speculative decoding."
        assert not is_enc_dec, "Encoder-Decoder model is not supported in Speculative decoding."
        assert args.num_beams == 1, "`--num_beams>1` is not supported in Speculative decoding."

-        outputs = run_dtm_pld(batch_input_ids, args, runtime_rank, end_id,
-                              pad_id, stop_words_list, bad_words_list,
-                              len(tokenizer))
+        outputs = run_dtm_ngram(batch_input_ids, args, runtime_rank, end_id,
+                                pad_id, stop_words_list, bad_words_list,
+                                len(tokenizer))
        if not args.streaming:  # Unpack runner from the return value in No-Streaming mode
            outputs, runner = list(outputs)[0]

--- a/examples/summarize.py
+++ b/examples/summarize.py
@ -41,7 +41,7 @@ from tensorrt_llm.tools.ppl import ppl
 if PYTHON_BINDINGS:
    from tensorrt_llm.runtime import ModelRunnerCpp

-from prompt_lookup.run_dtm_pld import run_dtm_pld
+from ngram.run_dtm_ngram import run_dtm_ngram


 def ensemble_mrope_params(batch_input_ids, max_position_embeddings,
@ -318,17 +318,17 @@ def main(args):
            return [], [], [], {}
        input_lengths = [x.size(0) for x in batch_input_ids]

-        if args.prompt_lookup_config is not None:
-            # Speculative decoding of Prompt-Lookup-Decoding (PLD)
-            outputs = run_dtm_pld(batch_input_ids,
-                                  args,
-                                  runtime_rank,
-                                  end_id,
-                                  pad_id,
-                                  stop_words_list,
-                                  bad_words_list,
-                                  tokenizer.vocab_size,
-                                  target_runner=runner)
+        if args.ngram_config is not None:
+            # Speculative decoding of NGram
+            outputs = run_dtm_ngram(batch_input_ids,
+                                    args,
+                                    runtime_rank,
+                                    end_id,
+                                    pad_id,
+                                    stop_words_list,
+                                    bad_words_list,
+                                    tokenizer.vocab_size,
+                                    target_runner=runner)
            if not args.streaming:  # Unpack runner from the return value in No-Streaming mode
                outputs, runner = list(outputs)[0]
        else:  # Normal run
@ -596,18 +596,17 @@ def main(args):
                args.lookahead_config
            ) == 3, "Lookahead needs [max_window_size, max_ngram_size, max_verification_set_size]"
            runner_kwargs.update(lookahead_config=args.lookahead_config)
-        if args.prompt_lookup_config is not None:
+        if args.ngram_config is not None:
            assert args.kv_cache_enable_block_reuse, "`--kv_cache_enable_block_reuse` must be specified in speculative decoding."
            assert not args.use_py_session, "`--use_py_session` is not supported in Speculative decoding."
-            assert not is_enc_dec, "Encoder-Decoder model is not supported in Speculative decoding."
            assert args.num_beams == 1, "`--num_beams>1` is not supported in Speculative decoding."
-            prompt_lookup_num_tokens, _, target_device_list = ast.literal_eval(
-                args.prompt_lookup_config)
-            args.max_output_len = output_len  # Specialization for PLD
+            max_draft_len, _, target_device_list = ast.literal_eval(
+                args.ngram_config)
+            args.max_output_len = output_len  # Specialization for NGram
            runner_kwargs.update(is_orchestrator_mode=True,
                                 device_ids=target_device_list,
-                                 max_input_len=test_token_num +
-                                 prompt_lookup_num_tokens + output_len)
+                                 max_input_len=test_token_num + max_draft_len +
+                                 output_len)

        runner = runner_cls.from_dir(**runner_kwargs)
        assert not (args.eval_ppl and not runner.gather_context_logits), \
--- a/examples/utils.py
+++ b/examples/utils.py
@ -439,12 +439,12 @@ def add_common_args(parser):
        "   E.g.: [4, [0], [1], False] for [draft_len, draft_model_device_list, target_model_device_list, use_logits]."
    )
    parser.add_argument(
-        '--prompt_lookup_config',
+        '--ngram_config',
        type=str,
        default=None,
        help=
-        "Configuration of Prompt-Lookup decoding, see `examples/prompt_lookup/README.md` for more information."
-        "   E.g.: [10,2,[0]] for [prompt_lookup_num_tokens, max_matching_ngram_size, device_list].",
+        "Configuration of NGram decoding, see `examples/ngram/README.md` for more information."
+        "   E.g.: [10,2,[0]] for [max_draft_len, max_matching_ngram_size, device_list].",
    )
    parser.add_argument(
        '--medusa_choices',
--- a/tests/integration/defs/.test_durations
+++ b/tests/integration/defs/.test_durations
@ -124,7 +124,7 @@
   "examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2]": 257.3995385244489,
   "examples/test_enc_dec.py::test_llm_enc_dec_general[compare_hf-bart-large-cnn-float32-enable_gemm_plugin-enable_attention_plugin-enable_paged_kv_cache-tp:1-pp:1-nb:2-disable_fp8]": 276.10329104214907,
   "examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-hf-vision-trtllm-pp:1-tp:1-float16-bs:1-cpp_e2e:False-nb:1]": 306.38610201328993,
-   "examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]": 195.90045699477196,
+   "examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]": 195.90045699477196,
   "test_unittests.py::test_unittests_v2[unittest/trt/model/test_gpt.py -k \"partition2\"]": 357.6496359631419,
   "accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=eagle-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]": 413.903915906325,
   "accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=eagle-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile=False]": 143.841789112892,
@ -329,7 +329,7 @@
   "examples/test_gpt.py::test_llm_gpt2_medium_stop_words_1gpu[non_streaming-use_py_session]": 194.89357279613614,
   "examples/test_granite.py::test_llm_granite[granite-3.0-2b-instruct-bfloat16]": 155.801738537848,
   "examples/test_llama.py::test_llm_llama_v2_1gpu_auto_parallel[llama-v2-7b-hf]": 535.973838724196,
-   "examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]": 196.1214354224503,
+   "examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]": 196.1214354224503,
   "examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_cpp_session-recurrentgemma-2b-use_paged_cache-int4_awq-float16-enable_attn_plugin-enable_gemm_plugin]": 648.7579195387661,
   "accuracy/test_cli_flow.py::TestLlama3_2_1B::test_smooth_quant_ootb": 457.93785213679075,
   "accuracy/test_cli_flow.py::TestLlama3_2_1B::test_smooth_quant_ootb_manage_weights": 216.66169160604477,
--- a/tests/integration/defs/common.py
+++ b/tests/integration/defs/common.py
@ -308,7 +308,7 @@ def convert_weights(llm_venv,
            f"--dtype={data_type}",
        ]

-    elif "prompt_lookup" in model:
+    elif "ngram" in model:
        if "gpt" in model_path:
            example_name = "gpt"
        elif "llama" in model_path:
--- a/tests/integration/defs/conftest.py
+++ b/tests/integration/defs/conftest.py
@ -487,9 +487,9 @@ def draft_target_model_example_root(llm_root, llm_venv):


@pytest.fixture(scope="module")
-def prompt_lookup_example_root(llm_root, llm_venv):
-    "Get Prompt-Lookup example root"
-    example_root = os.path.join(llm_root, "examples", "prompt_lookup")
+def ngram_example_root(llm_root, llm_venv):
+    "Get NGram example root"
+    example_root = os.path.join(llm_root, "examples", "ngram")
    llm_venv.run_cmd([
        "-m", "pip", "install", "-r",
        os.path.join(example_root, "requirements.txt")
@ -1084,7 +1084,7 @@ def draft_target_model_roots(request):


@pytest.fixture(scope="function")
-def prompt_lookup_root(request):
+def ngram_root(request):
    models_root = llm_models_root()
    assert models_root, "Did you set LLM_MODELS_ROOT?"
    if request.param == "gpt2":
@ -1094,7 +1094,7 @@ def prompt_lookup_root(request):
                                   "llama-models-v2/llama-v2-13b-hf")
    assert os.path.exists(
        models_root
-    ), f"Prompt-Lookup model path {models_root} does not exist under NFS LLM_MODELS_ROOT dir"
+    ), f"NGram model path {models_root} does not exist under NFS LLM_MODELS_ROOT dir"
    return models_root


--- a/tests/integration/defs/examples/test_prompt_lookup.py
+++ b/tests/integration/defs/examples/test_prompt_lookup.py
@ -22,36 +22,34 @@ from defs.conftest import skip_post_blackwell
 from defs.trt_test_alternative import check_call


-# TODO: remove skip after support prompt lookup on B200
+# TODO: remove skip after support NGram on B200
@skip_post_blackwell
@pytest.mark.parametrize("batch_size", [1, 2], ids=['bs1', 'bs2'])
@pytest.mark.parametrize("data_type", ['float16'])
-@pytest.mark.parametrize(
-    "prompt_lookup_num_tokens", [4, 8],
-    ids=['prompt_lookup_num_tokens_4', 'prompt_lookup_num_tokens_8'])
+@pytest.mark.parametrize("max_draft_len", [4, 8],
+                         ids=['max_draft_len_4', 'max_draft_len_8'])
@pytest.mark.parametrize(
    "max_matching_ngram_size", [2, 4],
    ids=['max_matching_ngram_size_2', 'max_matching_ngram_size_4'])
@pytest.mark.parametrize("use_logits", [False, True],
                         ids=['use_tokens', 'use_logits'])  # useless yet
@pytest.mark.parametrize("use_py_session", [False], ids=["use_cpp_session"])
-@pytest.mark.parametrize("prompt_lookup_root", ["gpt2"], indirect=True)
+@pytest.mark.parametrize("ngram_root", ["gpt2"], indirect=True)
@pytest.mark.parametrize("streaming", [False, True],
                         ids=["no_streaming", "streaming"])
-def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
-                                max_matching_ngram_size, use_logits,
-                                use_py_session, prompt_lookup_root, streaming,
-                                prompt_lookup_example_root, llm_datasets_root,
-                                llm_rouge_root, llm_venv, cmodel_dir,
-                                engine_dir):
-    model_name = "prompt_lookup"
+def test_llm_ngram_1gpu(batch_size, data_type, max_draft_len,
+                        max_matching_ngram_size, use_logits, use_py_session,
+                        ngram_root, streaming, ngram_example_root,
+                        llm_datasets_root, llm_rouge_root, llm_venv, cmodel_dir,
+                        engine_dir):
+    model_name = "ngram"

    print("Build checkpoint ...")
    model_dir = convert_weights(llm_venv=llm_venv,
-                                example_root=prompt_lookup_example_root,
+                                example_root=ngram_example_root,
                                cmodel_dir=cmodel_dir,
                                model=model_name,
-                                model_path=prompt_lookup_root,
+                                model_path=ngram_root,
                                data_type=data_type)

    print("Build engines ...")
@ -72,7 +70,7 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
    target_model_build_cmd.extend([
        f"--output_dir={target_engine_dir}",
        "--speculative_decoding_mode=draft_tokens_external",
-        f"--max_draft_len={prompt_lookup_num_tokens+1}",
+        f"--max_draft_len={max_draft_len+1}",
    ])
    baseline_model_build_cmd = deepcopy(common_build_cmd)
    baseline_model_build_cmd.extend([
@ -88,8 +86,8 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,

    print("Run inferences ...")
    common_run_cmd = [
-        f"{prompt_lookup_example_root}/../run.py",
-        f"--tokenizer_dir={prompt_lookup_root}",
+        f"{ngram_example_root}/../run.py",
+        f"--tokenizer_dir={ngram_root}",
        f"--max_output_len=64",
        f"--kv_cache_enable_block_reuse",
        f"--kv_cache_free_gpu_memory_fraction=0.25",
@ -105,11 +103,11 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
    assert not use_py_session, "Only CPP session is supported in Draft-Target-Model."

    run_cmd = deepcopy(common_run_cmd)
-    prompt_lookup_config = f"[{prompt_lookup_num_tokens},{max_matching_ngram_size},[0]]"
+    ngram_config = f"[{max_draft_len},{max_matching_ngram_size},[0]]"
    run_cmd.extend([
        f"--engine_dir={target_engine_dir}",
-        f"--prompt_lookup_config={prompt_lookup_config}",
-        f"--output_csv={engine_dir}/prompt_lookup_output.csv",
+        f"--ngram_config={ngram_config}",
+        f"--output_csv={engine_dir}/ngram_output.csv",
    ])
    baseline_run_cmd = deepcopy(common_run_cmd)
    baseline_run_cmd.extend([
@ -121,7 +119,7 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
    venv_check_call(llm_venv, baseline_run_cmd)

    print("Compare outputs ...")
-    with open(f"{engine_dir}/prompt_lookup_output.csv") as dt_f, open(
+    with open(f"{engine_dir}/ngram_output.csv") as dt_f, open(
            f"{engine_dir}/baseline_output.csv") as b_f:
        for bs, (dt_request,
                 b_request) in enumerate(zip(csv.reader(dt_f),
@ -138,20 +136,20 @@ def test_llm_prompt_lookup_1gpu(batch_size, data_type, prompt_lookup_num_tokens,
        return

    print("Run summarize...")
-    prompt_lookup_config = f"[{prompt_lookup_num_tokens},{max_matching_ngram_size},[0]]"
+    ngram_config = f"[{max_draft_len},{max_matching_ngram_size},[0]]"

    run_cmd = [
-        f"{prompt_lookup_example_root}/../summarize.py",
+        f"{ngram_example_root}/../summarize.py",
        "--test_hf",
        "--test_trt_llm",
        "--check_accuracy",
        "--batch_size=1",
-        f"--hf_model_dir={prompt_lookup_root}",
+        f"--hf_model_dir={ngram_root}",
        f"--engine_dir={target_engine_dir}",
        f"--dataset_dir={llm_datasets_root}",
        f"--rouge_dir={llm_rouge_root}",
        "--kv_cache_enable_block_reuse",
-        f"--prompt_lookup_config={prompt_lookup_config}",
+        f"--ngram_config={ngram_config}",
        "--tensorrt_llm_rouge1_threshold=20",
        f"--kv_cache_free_gpu_memory_fraction=0.25",
    ]
--- a/tests/integration/test_lists/qa/examples_test_list.txt
+++ b/tests/integration/test_lists/qa/examples_test_list.txt
@ -97,10 +97,10 @@ examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streami
 examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-llama_v2-use_cpp_session-use_logits-draft_len_4-float16-bs2]
 examples/test_draft_target_model.py::test_llm_draft_target_llama_1gpu
 examples/test_draft_target_model.py::test_llm_draft_target_llama_fp8_2gpu
-examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1]
-examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]
-examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1]
-examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2]
+examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs1]
+examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]
+examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs1]
+examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2]
 examples/test_internlm.py::test_llm_internlm2_7b_1node_1gpu[bfloat16-enable_context_fmha-enable_gemm_plugin-enable_attention_plugin-nb:2]
 examples/test_llama.py::test_llm_llama_1gpu_streaming_llm[ailab-deepseek-coder-6.7b-instruct]
 examples/test_llama.py::test_llm_llama_2gpu_fp8_summary[llama-7b-enable_reduce_fusion-disable_fp8_context_fmha_xqa]
--- a/tests/integration/test_lists/test-db/l0_a30.yml
+++ b/tests/integration/test_lists/test-db/l0_a30.yml
@ -108,7 +108,7 @@ l0_a30:
  - examples/test_internlm.py::test_llm_internlm2_7b_1node_1gpu[bfloat16-enable_context_fmha-enable_gemm_plugin-enable_attention_plugin-nb:2] # 5 mins
  - examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2] # 1 min
  - examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[streaming-gpt2-use_cpp_session-use_logits-draft_len_4-float16-bs2] # 1 min
-  - examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2] # 1 min
+  - examples/test_ngram.py::test_llm_ngram_1gpu[streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2] # 1 min
 - condition:
    ranges:
      system_gpu_count:
@ -159,7 +159,7 @@ l0_a30:
  - examples/test_granite.py::test_llm_granite[granite-3.0-2b-instruct-bfloat16] # 5 mins
  - examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-draft_len_4-float16-bs2] # 1 min
  - examples/test_draft_target_model.py::test_llm_draft_target_model_1gpu[no_streaming-gpt2-use_cpp_session-use_logits-draft_len_4-float16-bs2] # 1 min
-  - examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs2] # 1 min
+  - examples/test_ngram.py::test_llm_ngram_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-max_draft_len_8-float16-bs2] # 1 min
 - condition:
    ranges:
      system_gpu_count:
--- a/tests/integration/test_lists/waives.txt
+++ b/tests/integration/test_lists/waives.txt
@ -381,7 +381,6 @@ accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype
 full:B200/examples/test_gemma.py::test_llm_gemma_1gpu_summary_vswa[gemma-3-1b-it-other-bfloat16-8] SKIP (https://nvbugs/5292737)
 full:B200/accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype SKIP (https://nvbugs/5295470)
 examples/test_mistral.py::test_llm_mistral_v1_1gpu[mistral-7b-v0.1-float16-max_attention_window_size_4096-summarization_long] SKIP (https://nvbugs/5324976)
-examples/test_prompt_lookup.py::test_llm_prompt_lookup_1gpu[no_streaming-gpt2-use_cpp_session-use_tokens-max_matching_ngram_size_2-prompt_lookup_num_tokens_8-float16-bs1] SKIP (https://nvbugs/5344070)
 examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1gpu[fp8-use_py_session-medusa-vicuna-7b-v1.3-4-heads-float16-bs1] SKIP (https://nvbugs/5333849)
 examples/test_multimodal.py::test_llm_multimodal_general[Llama-3.2-11B-Vision-pp:1-tp:1-bfloat16-bs:1-cpp_e2e:False-nb:1] SKIP (https://nvbugs/5333818)
 examples/test_multimodal.py::test_llm_multimodal_general[Llama-3.2-11B-Vision-pp:1-tp:1-bfloat16-bs:8-cpp_e2e:False-nb:1] SKIP (https://nvbugs/5333818)