liji-nv
dcbfa7e509
[ https://nvbugs/5252313 ][fix] Fix torch compile + MTP ( #6554 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-05 10:31:29 -04:00
Simeng Liu
8cf3faa26a
[feat] Auto-enable ngram with concurrency <= 32. ( #6232 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
Co-authored-by: Mike Iovine <mike.iovine7@gmail.com>
2025-07-31 18:45:51 -04:00
Ziyi Xiong
8062e0fe7c
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically ( #6363 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-31 15:31:39 -04:00
YueWeng
2dd3186727
fix: remove cudaStreamSynchronize when using relaxed acceptance ( #5262 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-28 09:18:41 +08:00
ameynaik-hub
1e5e71aa42
Mtp optimizations round1 ( #5689 )
...
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-07-25 13:48:27 -04:00
Mike Iovine
0f2f11f90b
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model ( #6104 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-24 21:50:11 -04:00
wili
8ecdeee300
[refactor] Simplification of Speculative decoding configs - Part 2 ( #5936 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-23 09:20:27 +08:00
Ziyi Xiong
d7f0b0ab68
[fix] Correct the returned value of has_spec_drafter ( #6178 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-21 11:38:59 -04:00
Netanel Haber
d9a3530048
[nvbug/5393888][nvbug/5393042] Always use py_seq_slot ( #6147 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-07-18 22:45:16 +03:00
Ziyi Xiong
58d22a72f1
[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter ( #6007 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
2025-07-17 21:15:01 +08:00
Mike Iovine
fa34cb7234
[refactor] Clean up drafter/resource manager creation logic ( #5805 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-16 12:45:46 -07:00
Zhenhuan Chen
30608a5e6d
[ https://nvbugs/5355316 ] fix: update torch.compile option to fix triton store_cubin error ( #5865 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-14 17:17:30 +08:00
Mike Iovine
8950223f6f
[fix] Remove SpecConfig and fix thread leak issues ( #5931 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-12 21:03:24 +09:00
wili
2e3cf42e03
[refactor] Simplification of Speculative decoding configs ( #5639 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-10 11:37:30 -04:00
Raayan Dhar
e3268a4221
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg ( #5732 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
2025-07-08 09:39:58 -04:00
Robin Kobus
30a19fcf7c
[TRTLLM-6291] feat: Add user-provided speculative decoding support ( #5204 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-07 16:30:43 +02:00
Netanel Haber
aa72d39b72
MTP and derivatives: Align sample state with trtllm sampler sample state ( #5675 )
...
This PR moves MTPSampler and derivatives to use the universal seq_slot indexing for sampling.
This is the last piece of the puzzle: After this, all of the samplers will use this format.
See: 6ee94c7
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-07-03 19:55:48 +02:00
Jhao-Ting Chen
77082cde38
[ https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op ( #5146 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
liji-nv
c345f5876c
[feat] Support torch compile for attention dp ( #5086 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-01 13:48:52 -04:00
Netanel Haber
6ee94c7ac8
Reintroduce with perf fixes: feature: unify new_tokens format sample state to trtllm samper tokens format ( #5513 )
...
58a8a8f - these changes were previously merged to main here.
6aef149 - the changes were temporarily reverted in main, due to a significant perf regression in models using the TorchSampler (observed by @byshiue).
This PR is meant to re-merge these changes along with a fix to prevent the regression.
The first commit of this PR is actually just the reverted revert - filter it out of the changes to see previously unmerged changes.
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-30 11:58:59 -07:00
Fanrong Li
6cbc9a5297
[nvbug/5354946][fix] Fix mtp vanilla draft inputs ( #5568 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-30 15:59:12 +08:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Netanel Haber
6aef14943c
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format ( #4401 )" ( #5474 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-25 20:56:04 -07:00
Netanel Haber
58a8a8fd37
feature: unify new_tokens format sample state to trtllm sampler new_tokens format ( #4401 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-23 10:38:37 -07:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla ( #4762 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Fanrong Li
c7af650d5a
Fix: fix the deterministic issue in the MTP Eagle path ( #5285 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-19 18:08:40 +08:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW ( #4558 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
Fanrong Li
39bba63758
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards ( #4802 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-15 23:09:16 +08:00
Yilin Fan
06342ffb4d
[feat] Implement model-agnostic one-engine eagle3 ( #4778 )
...
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-13 08:11:41 -07:00
Daniel Cámpora
d68b8180d3
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler ( #4828 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-10 07:28:34 +08:00
Mike Iovine
ec0d984656
[nvbug/5280806][fix] Fix 2 model spec decode flow ( #4807 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-08 07:40:02 -04:00
Bo Li
f414a079ad
chore: Change the type annotations of input_ids and position_ids to int32. ( #4632 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-07 16:10:47 +08:00
Fanrong Li
380a5d1690
[ https://nvbugs/5271281 ][fix] fix a pd+mtp accuracy issue ( #4536 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main ( #4739 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
Yuxian Qiu
8f055f5d14
feat: Skip sampler for intermediate pp stages. ( #4514 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-26 10:08:51 +08:00
Thor Johnsen
5d438be59a
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter ( #3936 )
...
* v1.5
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
v1.5.4 Add back draft_overhead to spec dec stats
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* v1.5.5: fix CI error
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.6: fix CI error 8196 > 8192
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* Address reviewer concerns
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* Address reviewer concerns
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* precommit run
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* v2.0: Address reviewer concerns
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v2.1: add fix from wili
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* Revert changes that require use of TypeAlias because that requires python version >= 3.10
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
---------
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-21 10:40:00 +08:00
liji-nv
58e405624a
[ https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 ( #3952 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" ( #4121 )
...
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors
* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism ( #4034 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
Fanrong Li
77f8e43592
[fix] Fix relaxed acceptance to support enabling it in context phase ( #4126 )
...
* fix relaxed acceptance to support enable this feature in context phase.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
* fix sample_and_accept_draft_tokens unit test.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 14:11:14 +08:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS ( #3865 )
...
* add relaxed acceptance for DS R1
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* clean and update docs
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* fix
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* Modified based on review
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* fix mtp manager issue
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
---------
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
Mike Iovine
8c2c969fcb
[fix] Pad requests to maximum draft length in spec decode ( #3957 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-30 11:02:18 -04:00
Fanrong Li
e6b482ef47
fix: change the seq_lens sync copy to an async one ( #3786 )
...
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-29 23:56:49 +08:00
Perkz Zheng
35c5e4f1c5
feat: add CGA reduction fmha kernels on Blackwell. ( #3763 )
...
* update cubins
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
* address the comments
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
---------
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-04-29 10:43:54 +08:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues ( #3686 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00
Yuan Tong
57944206ba
feat: return logits in PyTorch flow ( #3221 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:56:03 -07:00
Fanrong Li
bc1c4ddcb5
fix: remove the unnecessary metadata changes in mtp. ( #3787 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-23 16:01:28 +08:00
Mike Iovine
41a6c98544
Support CUDA graphs for EAGLE3 ( #3176 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-17 04:53:50 +08:00
Yuan Tong
d4c0423cdb
refactor: collect executor and decoder states into dataclass ( #3234 )
...
* fix: Proper error bubbling for PyExecutor
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-15 16:31:45 +08:00
Fanrong Li
e8b97341de
fix the py_decoding_iter update in decoder. ( #3297 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-07 11:18:33 +08:00