jellysnack
99ba723e20
[None][fix] logits device and shape issues in dynamic draft path ( #9079 )
...
Signed-off-by: jellysnack <oleg.jellysnack@gmail.com>
2025-11-18 19:22:47 -08:00
Zheyu Fu
c4e02d7f04
[TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). ( #8194 )
...
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-11-18 11:13:39 -05:00
Tri Dao
fc088e642c
[None][feat] Support Glm4MoeForCausalLM ( #8256 )
...
Signed-off-by: Tri Dao <daominhtri0503@gmail.com>
Co-authored-by: Xuanyu Chen <xuanyuc@nvidia.com>
2025-11-18 09:43:21 +08:00
Mike Iovine
6151a4c9d6
[None][feat] Add simple optimizations for MTP 2-model ( #9176 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-11-17 10:05:39 -05:00
Ziyi Xiong
a7aaf50541
[TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding ( #8706 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-11-13 10:20:16 -05:00
Stefan Niebler
326a201473
[ https://nvbugs/5508536 ][fix] Take Over ( #8627 ): Reintroduce: Move stop_criteria to sample_async ( #7041 ) ( #8794 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-11-07 09:01:15 +01:00
kris1025
e2c5a38879
[ https://nvbugs/5534574 ][fix] disable spec decoding forever once the request spec decoding is disabled ( #8446 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-10-29 19:28:43 +08:00
Chang Liu
e47c787dd7
[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache ( #8405 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-10-24 13:40:41 -04:00
sunnyqgg
ea3e0eea51
[TRTLLM-7954][feat] Target model KV cache rellocation ( #8421 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-23 09:36:50 +08:00
sunnyqgg
90080e0e09
[ https://nvbugs/5556020 ][fix] test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch ( #8517 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-10-22 09:58:22 +08:00
YueWeng
8dc4aac5b6
[TRTLLM-8160][feat] Add max_total_draft_tokens ( #8366 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-10-21 11:11:04 -04:00
mpikulski
87eb5086fb
[None][fix] restore list[list[list[int]]] in add_token ( #8502 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-20 22:34:57 -04:00
mpikulski
97ce0ecefe
[TRTLLM-8436][feat] batched sampling and top-k logprobs improvements ( #8398 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-20 11:15:41 +02:00
Jin Li
d594c2d0ff
[ https://nvbugs/5537348 ][fix] Use device tensor index for MTP ( #8062 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Enwei Zhu
57a4ef870a
[None][fix] Fix chunked prefill state of draft request ( #8067 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-16 22:46:19 +08:00
Zheyu Fu
bac665e650
[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. ( #7283 )
...
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-10-13 15:51:14 -07:00
kris1025
a7ea544dbe
[TRTLLM-7384][feat] enable rejection sampling for CDL ( #7731 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-10-12 20:38:48 +08:00
Mike Iovine
7facac077b
[None][fix] Fix MTP illegal memory access ( #8161 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-10-07 14:02:55 -04:00
Mike Iovine
ca8291133a
[None][fix] Fix MTP 2-model ( #8115 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-10-03 10:13:50 -07:00
Ziyi Xiong
7bc2d9e993
[ https://nvbugs/5537878 ][fix] Reserve an extra slot for padded batch ( #7998 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-10-03 08:42:52 -07:00
Izzy Putterman
1ad7bc4c78
[None][feat] Draft: Save state first pass ( #7012 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-10-01 18:40:55 -04:00
Mike Iovine
d7087015f1
[TRTLLM-8271][fix] Fix CDL overlap scheduling performance ( #7971 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-26 16:05:10 -04:00
YueWeng
a4243f0da5
[TRTLLM-6393][feat] add static tree sampling and verification ( #7161 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-09-26 13:16:16 -04:00
sunnyqgg
2e5850c28a
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference ( #7363 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-26 11:28:05 +08:00
mpikulski
9970345919
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) ( #7294 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-23 16:05:05 -07:00
Daniel Cámpora
9f1d9b7b18
[None][feat] Use list instead of torch tensor for new tokens in update requests ( #7730 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-09-23 10:40:08 -04:00
Enwei Zhu
59f57598a7
[ https://nvbugs/5504086 ][fix] Fix MTP vanilla ( #7904 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 08:38:28 +08:00
Ziyi Xiong
897c4dd23b
[ https://nvbugs/5517404 ][fix] Use the correct cuda graph for dynamic spec dec ( #7728 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-21 08:20:48 +08:00
sunnyqgg
80dd8fe197
[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle ( #7001 )
...
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-18 12:05:36 -04:00
Ziyi Xiong
28469dbf27
[ https://nvbugs/5523080 ][fix] Correct the batch index in device tensors ( #7803 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-18 13:45:37 +08:00
Netanel Haber
a5cfc8368f
[ https://nvbugs/5508536 ][fix] Revert #7041 : Move stop_criteria to sample_async ( #7041 ) ( #7796 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-09-17 21:27:01 -04:00
Kaiyu Xie
62042a9733
[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128 ) ( #7571 )
...
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Cheng Hang <chang@nvidia.com>
2025-09-17 09:41:32 +08:00
Ziyi Xiong
536e8776cd
[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding ( #7651 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-16 07:33:44 +08:00
Izzy Putterman
8097be7e9c
[None][feat] Eagle, use last hidden post norm ( #7546 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-09-15 12:23:57 -04:00
Zheyu Fu
c353ff342e
[None][feat] Make the should_use_spec_decode logic a bit smarter ( #7112 )
...
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-10 12:53:59 +08:00
Jin Li
d49374bc45
[TRTLLM-7408][feat] Wrap MOE with custom op. ( #7277 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-09 12:18:56 -04:00
Netanel Haber
0fee8cd028
[TRTLLM-7153] [feat] Move stop_criteria to sample_async ( #7041 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-09-07 17:36:49 +03:00
Mike Iovine
45390402fc
[ https://nvbugs/5502352 ][fix] Fix 2-model CDL path ( #7543 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-06 23:53:27 -04:00
Enwei Zhu
1745102e72
[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec ( #7481 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-04 23:30:14 +08:00
Izzy Putterman
26b133f3a7
[None][feat] MultiLayer Eagle ( #7234 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-09-04 10:49:13 -04:00
kris1025
cce9556858
[ https://nvbugs/5485886 ][fix] Fix resource free of Eagle3ResourceManager ( #7437 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-09-04 17:38:13 +08:00
Enwei Zhu
5ff3a65b23
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) ( #6948 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-03 15:16:11 -07:00
Mike Iovine
64e3bfa054
[None][fix] Fix KV cache recompute in draft_target spec decode ( #7348 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-03 15:04:14 -04:00
YueWeng
9a4f60687f
[ https://nvbugs/5480289 ][fix] release slot manager in mtp MTPHiddenStatesManager ( #7340 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-09-02 19:37:51 -07:00
Mike Iovine
b3c57a7042
[TRTLLM-7353][feat] Implement capturable drafting loops for speculation ( #7100 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-01 14:37:44 -04:00
Mike Iovine
8b216135f0
[None][refactor] Move draft token padding out of Drafter ( #7134 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-27 11:07:50 +02:00
Izzy Putterman
b36460d7b5
[None][feat] Deepseek: Start Eagle work ( #6210 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-08-22 12:57:17 -04:00
Daniel Cámpora
099f081e03
[TRTLLM-7155][feat] Unify sampler handle logits implementation. ( #6867 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-22 08:09:30 +02:00
Mike Iovine
078e907b16
[ https://nvbugs/5455651 ][fix] Make ngram use XQA attention on Blackwell ( #6873 )
...
Signed-off-by: Michael Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
2025-08-14 18:36:19 -04:00
kris1025
4aed7a7d19
[TRTLLM-6853][feat] refactor deepseekv3 model ( #6698 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-08-14 11:03:17 -04:00