mpikulski
37c53425c1
[TRTLLM-10030][chore] improve assert in sampler ( #11475 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-13 21:54:28 +08:00
Gal Hubara-Agam
d0e7ba102e
[ #11455 ][fix] Fallback to triton_ssm for nvfp4 quantization ( #11456 )
...
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2026-02-13 07:38:37 +02:00
xxi
2565f0f4e4
[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework ( #11437 )
...
Signed-off-by: xxi <xxi@nvidia.com>
2026-02-13 11:05:38 +08:00
Ludwig Schneider
5130cbd73e
[None][fix] Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC ( #11326 )
...
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
2026-02-12 14:31:51 -08:00
Balaram Buddharaju
9c2d23c2e5
[ https://nvbugs/5888410 ][fix] Enable warmup for Helix CP ( #11460 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-02-12 14:24:51 -08:00
Yukun He
cb1d8d130f
[TRTLLM-10791][feat] TorchSampler general host time optimization ( #11141 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-02-12 18:05:58 +01:00
Wanli Jiang
421eb9e39c
[None][feat] Optimize NemotronH model with elementwise and nvfp4 fusion ( #11273 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2026-02-12 09:25:31 -05:00
Simeng Liu
12085536df
[TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. ( #11075 )
...
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
2026-02-12 00:48:47 -05:00
William Zhang
ca9537e17c
[TRTLLM-10858][feat] Multi-image support for EPD disagg ( #11264 )
...
* Why?
Prior to this commit, we only supported a single multimodal input for
E/P/D disaggregated serving.
* What?
This commit does a minor refactor of the multimodal embedding handles
that cross process boundaries to enable this.
Existing unit tests are updated accordingly to test this.
The `RequestOutput` has its `mm_embedding_handle` replaced in favor of
`disaggregated_params`, addressing a previous TODO.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-02-11 20:50:00 -08:00
Liao Lanyu
58165d5394
[None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations ( #11330 )
...
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2026-02-12 09:18:24 +08:00
Harris Nover
2d5ebb3fe8
[None][chore] Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation ( #11406 )
...
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-02-11 12:01:36 -05:00
Robin Kobus
7a103035be
[None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function ( #9229 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2026-02-11 08:46:25 -08:00
Guoming Zhang
c47ff4da43
[None][feat] Remove the hard code for activation type definition in T… ( #11164 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2026-02-11 21:50:45 +08:00
Yihan Wang
e8b860965b
[None][feat] Initial PR for trtllm-gen attention backend ( #10784 )
...
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
2026-02-11 17:16:52 +08:00
Taylor Yeonbok Lee
860054c859
[ #11203 ][feat] AutoDeploy: Refactor node caching and improve engine build time ( #11250 )
...
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
2026-02-10 13:35:44 -08:00
mpikulski
411fa9ff87
[TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search ( #11390 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-10 16:48:36 +01:00
Iman Tabrizian
7d992972b2
[TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ ( #10540 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2026-02-10 07:20:56 -08:00
Leslie Fang
d6e49542bd
[ https://nvbugs/5848377 ][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 ( #11266 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: Leslie Fang <leslief@nvidia.com>
Co-authored-by: Tailing Yuan <yuantailing@gmail.com>
2026-02-10 20:09:00 +08:00
chenfeiz0326
eac56b793e
[ https://nvbugs/5853720 ][fix] Disable cutedsl argmax kernel to fix perf regression ( #11403 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-02-10 18:10:38 +08:00
mpikulski
adc0d82500
[ https://nvbugs/5791242 ][chore] remove obsolete code ( #11388 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-10 10:55:29 +01:00
Yuxian Qiu
5f4df89109
[None][feat] Fully non-blocking pipeline parallelism executor loop. ( #10349 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-02-10 15:43:28 +08:00
shuyixiong
c3cdc93211
[TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph ( #11267 )
...
Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>
2026-02-10 01:12:49 -05:00
Jonas Li
8b2dc57823
[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch ( #11384 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2026-02-10 14:00:42 +08:00
Lucas Liebenwein
a2fb5afecf
[ #11032 ][feat] MLA revisited and GLM 4.7 Flash support ( #11324 )
2026-02-09 23:26:51 -05:00
Yuan Tong
4fc3644705
[None][fix] Avoid reserved filename on Windows ( #11382 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2026-02-10 11:22:59 +08:00
Yuxian Qiu
af68c29d3d
[None][chore] Reduce attention module repeated warnings. ( #11335 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-02-10 08:58:21 +08:00
Ziyi Xiong
e76b634251
[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec ( #10502 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2026-02-10 05:16:02 +08:00
Bala Marimuthu
4a743338c3
[None][infra] AutoDeploy: Dump graph IR after every transform ( #11045 )
...
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
2026-02-09 10:43:44 -08:00
Guiju Zhang
c37531c3f7
[TRTLLM-10669][fix] Fix Eagle3 draft model weight loading for throughput checkpoint ( #11010 )
...
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-09 23:53:40 +08:00
William Zhang
abb8106c01
[ https://nvbugs/5835925 ][fix] Add EPD disagg support for Qwen3 VL MoE ( #10962 )
...
* Why?
Trying to instantiate a `MultimodalEncoder` for a Qwen3 VL MoE model
would fail during weight loading.
* What?
This commit fixes the bug, alongside:
- explicit, intentional support for EPD for Qwen3 VL MoE.
- extends EPD unit tests for Qwen3 VL MoE, albeit with dummy weights.
- unit tests for the weight mapper fixes.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-09 23:53:40 +08:00
Jin Li
0ead17bb85
[ https://nvbugs/5800646 ][fix] Fix hang issue by avoid exposing UB buf… ( #10842 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-09 23:53:40 +08:00
Stefan Niebler
d50010cd1f
[ https://nvbugs/5769815 ][fix] Fix offset calculation in _are_stop_words when using speculative decoding ( #10854 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-09 23:53:40 +08:00
mpikulski
196d94a419
[TRTLLM-10030][perf] avoid syncs in beam search + other improvements ( #11349 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-09 16:13:58 +01:00
Gal Hubara-Agam
2b60cc181c
[ #10780 ][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE ( #11322 )
...
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
2026-02-09 10:07:37 -05:00
Robin Kobus
31db399042
[ https://nvbugs/5829097 ][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver ( #11354 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2026-02-09 17:11:45 +08:00
mpikulski
03b38e9fbf
[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search ( #11341 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-07 12:31:11 +08:00
William Zhang
ffc0f54959
[ https://nvbugs/5848756 ][fix] Re-take ownership of mrope tensors in prefill worker ( #11217 )
...
* Why?
Previously, the mrope tensors' IPC handles would just be forwarded from
encode -> prefill -> decode workers. While this is fine for the
prefill worker, it is not for the decode worker, since by the time it
tries to rebuild those tensors, they could have been garbage collected
due to their refcounts reaching zero in the producer (encode) worker.
This could lead to nasty runtime errors when running E/P/D
disaggregated serving.
* What?
This commit fixes this by having the prefill worker take ownership of
those reconstructed tensors, and stand up new copies for the decode
worker.
Closes: NvBug 5848756
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-02-06 22:37:42 -05:00
Iman Tabrizian
18e611da77
[ https://nvbugs/5863392 ][fix] fix partial reuse disabled for disagg ( #11247 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2026-02-06 14:23:51 -05:00
Shi Xiaowei
b1268e1b37
[TRTLLM-9527][feat] Modularization of the transceiver for KV manager v2 (step 4) ( #11225 )
...
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2026-02-06 07:15:18 -05:00
Yueh-Ting (eop) Chen
383c5921c2
[ https://nvbugs/5756028 ][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation ( #10798 )
...
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2026-02-06 14:28:47 +08:00
Chenghao Zhang
9644f024bd
[None][feat] AutoDeploy: add triton backend for causal conv ( #11124 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2026-02-05 21:33:00 -08:00
Chenghao Zhang
d160439ef9
[ #11148 ][feat] AutoDeploy: Better structure the custom op ( #11152 )
...
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2026-02-05 21:32:22 -08:00
yifeizhang-c
5521c7b7e7
[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell ( #10130 )
...
Added FP8 cute dsl gemm and batch gemm.
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2026-02-06 09:49:30 +08:00
Chuang Zhu
a9d4927235
[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len ( #11082 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-02-05 14:54:00 -05:00
Harris Nover
a7494a5ff4
[None][chore] Remove outdated comment in model_engine.py ( #11240 )
...
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-02-05 13:54:46 -05:00
jthomson04
d778b26062
[None][fix] Reduce host memory usage during model loading ( #11119 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2026-02-05 08:57:40 -08:00
mpikulski
7d235cfb23
[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes ( #11281 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 16:33:22 +01:00
chenfeiz0326
eae480b713
[ https://nvbugs/5820874 ][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope ( #11259 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-02-05 23:12:38 +08:00
mpikulski
719e82c429
[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) ( #11276 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 15:33:51 +01:00
Yuewei Na
0d18b2d7a4
[None][feat] Add priority-based KV cache offload filtering support ( #10751 )
...
Signed-off-by: Yuewei Na <yna@nvidia.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
2026-02-05 05:22:56 -05:00