Commit Graph

670 Commits

Author SHA1 Message Date
mpikulski
37c53425c1
[TRTLLM-10030][chore] improve assert in sampler (#11475)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-13 21:54:28 +08:00
Balaram Buddharaju
9c2d23c2e5
[https://nvbugs/5888410][fix] Enable warmup for Helix CP (#11460)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-02-12 14:24:51 -08:00
Yukun He
cb1d8d130f
[TRTLLM-10791][feat] TorchSampler general host time optimization (#11141)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-02-12 18:05:58 +01:00
Wanli Jiang
421eb9e39c
[None][feat] Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2026-02-12 09:25:31 -05:00
Simeng Liu
12085536df
[TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. (#11075)
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
2026-02-12 00:48:47 -05:00
William Zhang
ca9537e17c
[TRTLLM-10858][feat] Multi-image support for EPD disagg (#11264)
* Why?

Prior to this commit, we only supported a single multimodal input for
E/P/D disaggregated serving.

* What?

This commit does a minor refactor of the multimodal embedding handles
that cross process boundaries to enable this.
Existing unit tests are updated accordingly to test this.

The `RequestOutput` has its `mm_embedding_handle` replaced in favor of
`disaggregated_params`, addressing a previous TODO.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-02-11 20:50:00 -08:00
Liao Lanyu
58165d5394
[None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2026-02-12 09:18:24 +08:00
Robin Kobus
7a103035be
[None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2026-02-11 08:46:25 -08:00
mpikulski
411fa9ff87
[TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search (#11390)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-10 16:48:36 +01:00
Iman Tabrizian
7d992972b2
[TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ (#10540)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2026-02-10 07:20:56 -08:00
mpikulski
adc0d82500
[https://nvbugs/5791242][chore] remove obsolete code (#11388)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-10 10:55:29 +01:00
Yuxian Qiu
5f4df89109
[None][feat] Fully non-blocking pipeline parallelism executor loop. (#10349)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-02-10 15:43:28 +08:00
Ziyi Xiong
e76b634251
[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec (#10502)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2026-02-10 05:16:02 +08:00
Stefan Niebler
d50010cd1f [https://nvbugs/5769815][fix] Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-09 23:53:40 +08:00
mpikulski
196d94a419
[TRTLLM-10030][perf] avoid syncs in beam search + other improvements (#11349)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-09 16:13:58 +01:00
Robin Kobus
31db399042
[https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2026-02-09 17:11:45 +08:00
mpikulski
03b38e9fbf
[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search (#11341)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-07 12:31:11 +08:00
William Zhang
ffc0f54959
[https://nvbugs/5848756][fix] Re-take ownership of mrope tensors in prefill worker (#11217)
* Why?

Previously, the mrope tensors' IPC handles would just be forwarded from
encode -> prefill -> decode workers. While this is fine for the
prefill worker, it is not for the decode worker, since by the time it
tries to rebuild those tensors, they could have been garbage collected
due to their refcounts reaching zero in the producer (encode) worker.

This could lead to nasty runtime errors when running E/P/D
disaggregated serving.

* What?

This commit fixes this by having the prefill worker take ownership of
those reconstructed tensors, and stand up new copies for the decode
worker.

Closes: NvBug 5848756

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-02-06 22:37:42 -05:00
Iman Tabrizian
18e611da77
[https://nvbugs/5863392][fix] fix partial reuse disabled for disagg (#11247)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2026-02-06 14:23:51 -05:00
Yueh-Ting (eop) Chen
383c5921c2
[https://nvbugs/5756028][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2026-02-06 14:28:47 +08:00
yifeizhang-c
5521c7b7e7
[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell (#10130)
Added FP8 cute dsl gemm and batch gemm.

Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2026-02-06 09:49:30 +08:00
Chuang Zhu
a9d4927235
[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len (#11082)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-02-05 14:54:00 -05:00
Harris Nover
a7494a5ff4
[None][chore] Remove outdated comment in model_engine.py (#11240)
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-02-05 13:54:46 -05:00
mpikulski
7d235cfb23
[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes (#11281)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 16:33:22 +01:00
mpikulski
719e82c429
[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) (#11276)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 15:33:51 +01:00
Yuewei Na
0d18b2d7a4
[None][feat] Add priority-based KV cache offload filtering support (#10751)
Signed-off-by: Yuewei Na <yna@nvidia.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
2026-02-05 05:22:56 -05:00
Yi Zhang
ada463d15d
[None][fix] Fix comments for kv cache manager v2 (#11207)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2026-02-04 23:31:29 -05:00
Lizhi Zhou
f9c4bdf6cf
[TRTLLM-8921][feat] implement gen-first disagg_service (#11020)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-02-03 15:46:11 -05:00
Mike Iovine
13b0ab9c0e [None][fix] Fix MTP 1-model sampler (#10369)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Mike Iovine
d9aef94431 [https://nvbugs/5814914][fix] Fix llama sm120 spec dec (#10765)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Enwei Zhu
ccdd8461ac [None][fix] Always reset drafting states for GuidedDecoder (#10899)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Yi Zhang
0306c0f12c
[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime (#10659)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
2026-02-02 14:29:02 +08:00
Liao Lanyu
fef0e4b17d
[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns (#10988)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2026-02-02 10:36:08 +08:00
Liao Lanyu
f2dd0ee128
[None][chore] Correct sorting order for attention DP scheduling to prioritize non-relaxed requests (#11106)
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
2026-01-30 16:06:48 +08:00
Jin Li
ef268e2062
[TRTLLM-9904][feat] Changes for future KVCacheV2 MTP support (#11029)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2026-01-30 01:49:17 -05:00
Harris Nover
ab7dd34bbe
[None][chore] Consolidate duplicate kv cache reuse variables. (#10935)
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-01-29 11:03:27 -08:00
Stefan Niebler
7d31532850
[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler (#10459)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2026-01-29 11:06:09 -05:00
Balaram Buddharaju
c7a86f89de
[TRTLLM-10264][feat] Support attention DP + Helix CP (#10477)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-01-29 02:57:13 -05:00
Tailing Yuan
91528365a9
[None][feat] Add performance alignment to layer-wise benchmarks (#11018)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2026-01-29 14:01:51 +08:00
Lucas Liebenwein
ff3a494f5c
[#10013][feat] AutoDeploy: native cache manager integration (#10635)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-27 11:23:22 -05:00
Chuang Zhu
d6f76d2fae
[TRTLLM-9527][feat] change context params and disagg params (step3) (#10495)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-01-27 16:34:17 +08:00
Tailing Yuan
5553391c5e
[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler (#10943)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2026-01-27 13:18:34 +08:00
sunnyqgg
ff0dd6076e
[TRTLLM-10062][feat] Enable MTP for Nemotron Super (#10754)
Signed-off-by: qgai <qgai@nvidia.com>
2026-01-26 11:23:26 -05:00
mpikulski
0f7ec033f7 [https://nvbugs/5791242][fix] workaround for flashinfer.sampling.sampling_from_logits (#10713)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Yuxian Qiu
9fcc93ea7b
[https://nvbugs/5829097][fix] Re-init TRTLLM sampler to use sample stream in multi-stream cases. (#10918)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-01-24 14:04:10 +08:00
Kaiyu Xie
da967d0bd7
[TRTLLM-10334] [feat] Support overlap scheduler for disagg ctx instances (#10755)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2026-01-23 22:29:37 -05:00
jthomson04
cf88da7eca
[None][feat] KV Connector Support for MTP (#10932)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2026-01-23 18:58:26 -05:00
Yi Zhang
d43be7b65e
[None][fix] Avoid Double update for previous batch (#9888)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2026-01-22 13:15:06 -05:00
Jiayu Chang
1dc49b266e
[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph (#8279)
Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
2026-01-22 14:01:18 +01:00
Pengbo Wang
9462d90ec7
[None][feat] Add KV cache cleanup (#7439)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2026-01-22 15:14:17 +08:00