From a48bef4d9f52741f92191804291ace016b38962a Mon Sep 17 00:00:00 2001 From: Stefan Niebler <82932102+stnie@users.noreply.github.com> Date: Wed, 17 Dec 2025 11:30:11 +0000 Subject: [PATCH 1/2] [TRTLLM-8425][doc] Update sampling documentation Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> --- docs/source/features/sampling.md | 65 +++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 13 deletions(-) diff --git a/docs/source/features/sampling.md b/docs/source/features/sampling.md index bac0cf355e..2b7c26a478 100644 --- a/docs/source/features/sampling.md +++ b/docs/source/features/sampling.md @@ -1,19 +1,45 @@ # Sampling -The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability and logits processors + +The Pytorch backend supports a wide variety of features, listed below: + +| Forward Pass | Sampling Strategies | Sampling Features | +|--------------------|----------------------------------|--------------------------------| +| No drafting | Greedy | Guided Decoding | +| Draft target model | TopP | Plugging Logits Post-Processor | +| Eagle 3 | TopK | Temperature | +| Ngram | TopK + TopP | MinP | +| | Beam Search | Embedding / Logits Bias | +| | Best of / n (composable) | Stop criteria | +| | Rejection sampling (composable) | Return Logits | +| | | Return LogProbs | +| | | TopK LogProbs | ## General usage -To use the feature: +By default, the sampling backend is chosen to be `auto`. This will use: -1. Enable the `enable_trtllm_sampler` option in the `LLM` class -2. Pass a [`SamplingParams`](source:tensorrt_llm/sampling_params.py#L125) object with the desired options to the `generate()` function +* TRTLLM Sampler when using Beam Search. +* Torch Sampler otherwise. -The following example prepares two identical prompts which will give different results due to the sampling parameters chosen: +Torch Sampler currently supports a superset of features of TRTLLM Sampler, and is intended as the long term solution. One can specify which sampler to use explicitly with: + +```python +from tensorrt_llm import LLM + +# Chooses TorchSampler explicitly +llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', + sampler_type="TorchSampler") + +# Chooses TRTLLMSampler explicitly +llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', + sampler_type="TRTLLMSampler") +``` + +Here is an example to run a model with basic usage of sampling parameters. The following example prepares two identical prompts which will give different results due to the sampling parameters chosen: ```python from tensorrt_llm import LLM, SamplingParams -llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', - enable_trtllm_sampler=True) +llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8') sampling_params = SamplingParams( temperature=1.0, top_k=8, @@ -23,7 +49,24 @@ llm.generate(["Hello, my name is", "Hello, my name is"], sampling_params) ``` -Note: The `enable_trtllm_sampler` option is not currently supported when using speculative decoders, such as MTP or Eagle-3, so there is a smaller subset of sampling options available. +It is also possible to specify different sampling parameters on a per-prompt basis: + +```python +from tensorrt_llm import LLM, SamplingParams +llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8') +sampling_params_0 = SamplingParams( + temperature=1.0, + top_k=8, + top_p=0.5, + ) +sampling_params_1 = SamplingParams( + top_k=4, + ) +llm.generate(["Hello, my name is", + "Hello, my name is"], + [sampling_params_0, + sampling_params_1]) +``` ## Beam search @@ -33,8 +76,6 @@ To enable beam search, you must: 1. Enable the `use_beam_search` option in the `SamplingParams` object 2. Set the `max_beam_width` parameter in the `LLM` class to match the `best_of` parameter in `SamplingParams` -3. Disable overlap scheduling using the `disable_overlap_scheduler` parameter of the `LLM` class -4. Disable the usage of CUDA Graphs by passing `None` to the `cuda_graph_config` parameter of the `LLM` class Parameter Configuration: - `best_of`: Controls the number of beams processed during generation (beam width) @@ -47,10 +88,8 @@ The following example demonstrates beam search with a beam width of 4, returning ```python from tensorrt_llm import LLM, SamplingParams llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', - enable_trtllm_sampler=True, max_beam_width=4, # must equal SamplingParams.best_of - disable_overlap_scheduler=True, - cuda_graph_config=None) + ) sampling_params = SamplingParams( best_of=4, # must equal LLM.max_beam_width use_beam_search=True, From 8340ae6b009a9081bc7302e930255613d7949781 Mon Sep 17 00:00:00 2001 From: Stefan Niebler <82932102+stnie@users.noreply.github.com> Date: Mon, 5 Jan 2026 11:01:23 +0000 Subject: [PATCH 2/2] [TRTLLM-8425][doc] Clarify sampling backends in documentation Updated the sampling documentation to clearly outline the two available backends: Torch Sampler and TRTLLM Sampler. Added details on default behavior and usage examples for better clarity. Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> --- docs/source/features/sampling.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/docs/source/features/sampling.md b/docs/source/features/sampling.md index 2b7c26a478..1b9e5b907b 100644 --- a/docs/source/features/sampling.md +++ b/docs/source/features/sampling.md @@ -16,10 +16,10 @@ The Pytorch backend supports a wide variety of features, listed below: ## General usage -By default, the sampling backend is chosen to be `auto`. This will use: +There are two sampling backends available. -* TRTLLM Sampler when using Beam Search. -* Torch Sampler otherwise. +* Torch Sampler +* TRTLLM Sampler Torch Sampler currently supports a superset of features of TRTLLM Sampler, and is intended as the long term solution. One can specify which sampler to use explicitly with: @@ -35,7 +35,12 @@ llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8', sampler_type="TRTLLMSampler") ``` -Here is an example to run a model with basic usage of sampling parameters. The following example prepares two identical prompts which will give different results due to the sampling parameters chosen: +By default, the sampling backend is chosen to be `auto`. This will use: + +* TRTLLM Sampler when using Beam Search. +* Torch Sampler otherwise. + +Here is an example to run a model with basic usage of sampling parameters. This example prepares two identical prompts which will give different results due to the sampling parameters chosen: ```python from tensorrt_llm import LLM, SamplingParams