Commit Graph

2209 Commits

Author SHA1 Message Date
Gal Hubara-Agam
2b60cc181c
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE (#11322)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
2026-02-09 10:07:37 -05:00
Robin Kobus
31db399042
[https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2026-02-09 17:11:45 +08:00
mpikulski
03b38e9fbf
[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search (#11341)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-07 12:31:11 +08:00
William Zhang
ffc0f54959
[https://nvbugs/5848756][fix] Re-take ownership of mrope tensors in prefill worker (#11217)
* Why?

Previously, the mrope tensors' IPC handles would just be forwarded from
encode -> prefill -> decode workers. While this is fine for the
prefill worker, it is not for the decode worker, since by the time it
tries to rebuild those tensors, they could have been garbage collected
due to their refcounts reaching zero in the producer (encode) worker.

This could lead to nasty runtime errors when running E/P/D
disaggregated serving.

* What?

This commit fixes this by having the prefill worker take ownership of
those reconstructed tensors, and stand up new copies for the decode
worker.

Closes: NvBug 5848756

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-02-06 22:37:42 -05:00
Iman Tabrizian
18e611da77
[https://nvbugs/5863392][fix] fix partial reuse disabled for disagg (#11247)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2026-02-06 14:23:51 -05:00
Shi Xiaowei
b1268e1b37
[TRTLLM-9527][feat] Modularization of the transceiver for KV manager v2 (step 4) (#11225)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2026-02-06 07:15:18 -05:00
Yueh-Ting (eop) Chen
383c5921c2
[https://nvbugs/5756028][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2026-02-06 14:28:47 +08:00
Chenghao Zhang
9644f024bd
[None][feat] AutoDeploy: add triton backend for causal conv (#11124)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2026-02-05 21:33:00 -08:00
Chenghao Zhang
d160439ef9
[#11148][feat] AutoDeploy: Better structure the custom op (#11152)
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
2026-02-05 21:32:22 -08:00
yifeizhang-c
5521c7b7e7
[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell (#10130)
Added FP8 cute dsl gemm and batch gemm.

Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2026-02-06 09:49:30 +08:00
Chuang Zhu
a9d4927235
[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len (#11082)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-02-05 14:54:00 -05:00
Harris Nover
a7494a5ff4
[None][chore] Remove outdated comment in model_engine.py (#11240)
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-02-05 13:54:46 -05:00
jthomson04
d778b26062
[None][fix] Reduce host memory usage during model loading (#11119)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2026-02-05 08:57:40 -08:00
mpikulski
7d235cfb23
[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes (#11281)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 16:33:22 +01:00
chenfeiz0326
eae480b713
[https://nvbugs/5820874][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope (#11259)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2026-02-05 23:12:38 +08:00
mpikulski
719e82c429
[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) (#11276)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-05 15:33:51 +01:00
Yuewei Na
0d18b2d7a4
[None][feat] Add priority-based KV cache offload filtering support (#10751)
Signed-off-by: Yuewei Na <yna@nvidia.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
2026-02-05 05:22:56 -05:00
Chang Su
9601b17459
[#11037][fix] Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
2026-02-05 05:00:29 -05:00
Yao Yao
d9b936be94
[None][feat] Enhance support for complex models (#11254)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2026-02-05 17:28:26 +08:00
xxi
4c1d9d0c10
[None][chore] Pass without_comm to cutlass and deepgemm (#11229)
Signed-off-by: xxi <xxi@nvidia.com>
2026-02-05 02:07:59 -05:00
Yi Zhang
ada463d15d
[None][fix] Fix comments for kv cache manager v2 (#11207)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2026-02-04 23:31:29 -05:00
dongfengy
0bd4630cd1
[https://nvbugs/5854860][fix] Fix cutedsl argmax on sm120 (#11181)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2026-02-04 17:15:31 -05:00
Grzegorz Kwasniewski
d90a8e5700
[TRTLLM-10673][feat] Improved layer classification for sharding (#10718)
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2026-02-04 18:06:10 +01:00
Lucas Liebenwein
925d911fc0
[#10966][feat] AutoDeploy: kv cache manager integration [2/2] (#11149)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-02-04 09:44:27 -05:00
Yueh-Ting (eop) Chen
f6fff18142
[https://nvbugs/5624818][fix] Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
Signed-off-by: eopXD <yuehtingc@nvidia.com>
2026-02-04 19:21:50 +08:00
mpikulski
f0ca62b175
[None][fix] make health_generate work with beam search (#11097)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2026-02-04 09:46:19 +01:00
xxi
02b80bfd58
[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends (#11128)
Signed-off-by: xxi <xxi@nvidia.com>
2026-02-04 15:57:56 +08:00
Gal Hubara-Agam
de6931bbfd
[None][fix] Fix selective_state_update perf regression for T=1 decode path (#11194)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2026-02-04 09:01:34 +02:00
tburt-nv
588db0ed64
[None][chore] bump version to 1.3.0rc3 (#11238)
Signed-off-by: Tyler Burt <tburt@nvidia.com>
2026-02-04 09:30:45 +08:00
Dmitry Barsukoff
5d522295e9
[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com>
Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>
2026-02-03 16:04:54 -08:00
Taylor Yeonbok Lee
f9e6045f39
[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU (#11059)
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
2026-02-03 13:23:10 -08:00
Lizhi Zhou
f9c4bdf6cf
[TRTLLM-8921][feat] implement gen-first disagg_service (#11020)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-02-03 15:46:11 -05:00
Chenjie Luo
2532eb5adc
[None][fix] Align kv_scales with modelopt HF checkpoint (#10745)
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
2026-02-03 08:03:42 -05:00
gramnarayan
585fbb2734
[#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation (#11073)
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
2026-02-02 09:51:10 -08:00
Izzy Putterman
3ef8a4639b
[None][feat] Nemotron H: Eagle3 support (#11131)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2026-02-02 10:26:25 -05:00
Rundong Li
f1b85fea4c
[None][feat] Integrate cuda.tile RMS norm kernels (#9725)
Signed-off-by: Rundong (David) Li <davidli@nvidia.com>
Co-authored-by: Jinman Xie <jinmanx@nvidia.com>
Co-authored-by: Alexey Bylinkin <abylinkin@nvidia.com>
Co-authored-by: Qiqi Xiao <qiqix@nvidia.com>
Co-authored-by: Biao Wang <biaow@nvidia.com>
Co-authored-by: Thomas Schmid <thschmid@nvidia.com>
2026-02-02 19:44:27 +08:00
Mike Iovine
13b0ab9c0e [None][fix] Fix MTP 1-model sampler (#10369)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Mike Iovine
d9aef94431 [https://nvbugs/5814914][fix] Fix llama sm120 spec dec (#10765)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Zheyu Fu
d31482686c [https://nvbugs/5680911][fix] Remove @cache decorator to enhance CI stability for unit tests using single process mode (#10730)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Enwei Zhu
ccdd8461ac [None][fix] Always reset drafting states for GuidedDecoder (#10899)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Michal Guzek
fafc22e3d4 [https://nvbugs/5691730][fix] Have LoRa bf16 ckpts work with Llama 3.3-70B-fp8 (#9808)
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
William Zhang
bc2487bc2c [https://nvbugs/5826962][fix] Fix PD disaggregation for VLMs that use mrope (#10865)
* Why?

Commit a6a8898 enabled EPD disaggregation for VLMs that use mrope (e.g.
qwen). However, this broke PD disaggregation for these sames models.

* What?

This commit fixes this, and adds a unit test that guards against it.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-02-02 16:26:46 +08:00
Yi Zhang
0306c0f12c
[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime (#10659)
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
2026-02-02 14:29:02 +08:00
Liao Lanyu
fef0e4b17d
[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns (#10988)
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2026-02-02 10:36:08 +08:00
Lizhi Zhou
b00e8338ec
[https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID (#11095)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-02-02 09:54:33 +08:00
Dmitry Barsukoff
ea49afdf0b
[None][fix] AttributeError with return_perf_metrics on tensorrt backend (#10662)
Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com>
Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>
2026-02-02 08:41:15 +08:00
shuyixiong
278ced972b
[TRTLLM-9771][feat] Allow overriding quantization configs (#11062)
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
2026-01-31 10:48:51 -05:00
Frida Hou
7910d4d2a9
[#8242][feat] Add int4 GPTQ support for AutoDeploy (#8248)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2026-01-30 23:07:24 -08:00
Guoming Zhang
6bace84167
[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super (#10791)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2026-01-31 13:48:25 +08:00
Balaram Buddharaju
531f85dc9b
[None][feat] Perfect routing for Deepseek models (#11127)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-01-30 23:46:35 -05:00
Karthik
5a97374f3c
[#9525][feat] add L2 norm pattern matcher and fusion transform (#10767)
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
2026-01-30 16:05:53 -05:00
nvyocox
4af47208d8
[None][feat] Export ONNX for DriveOS LLM (#10117)
Signed-off-by: yocox <yocox@nvidia.com>
2026-01-30 15:43:11 -05:00
Yao Yao
53cb762ee5
[None][feat] New KVCacheManagerV2 APIs for Transceiver (#11003)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2026-01-30 18:09:53 +08:00
Liao Lanyu
f2dd0ee128
[None][chore] Correct sorting order for attention DP scheduling to prioritize non-relaxed requests (#11106)
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
2026-01-30 16:06:48 +08:00
dongfengy
4f0c1b2489
[TRTLLM-10733][feat] Make TRTLLM MOE the default one for GPTOSS on Blackwell (#11074)
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2026-01-29 23:59:19 -08:00
Jin Li
ef268e2062
[TRTLLM-9904][feat] Changes for future KVCacheV2 MTP support (#11029)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2026-01-30 01:49:17 -05:00
Necofish
144b61715f
[None][fix] Add missing absolute pe in Qwen3-VL Vision Encoder (#11065)
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn>
2026-01-30 09:59:36 +09:00
Chang Su
dbad94715b
[None][feat] Add gRPC server for high-performance external router integration (#11037)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
2026-01-30 07:48:27 +08:00
Chenghao Zhang
e033929221
[None][feat] AutoDeploy: Flashinfer kernels bringup (#10867)
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
2026-01-29 14:59:29 -08:00
Harris Nover
ab7dd34bbe
[None][chore] Consolidate duplicate kv cache reuse variables. (#10935)
Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
2026-01-29 11:03:27 -08:00
Stefan Niebler
7d31532850
[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler (#10459)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2026-01-29 11:06:09 -05:00
Balaram Buddharaju
c7a86f89de
[TRTLLM-10264][feat] Support attention DP + Helix CP (#10477)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2026-01-29 02:57:13 -05:00
Tailing Yuan
91528365a9
[None][feat] Add performance alignment to layer-wise benchmarks (#11018)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2026-01-29 14:01:51 +08:00
Enwei Zhu
34a730aaf7
[None][fix] Fix enable_alltoall passed to CutlassFusedMoE (#11016)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2026-01-29 12:11:07 +08:00
Anish Shanbhag
24ac86c485
[https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency (#10471)
Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>
2026-01-28 19:56:32 -08:00
Frida Hou
f03908cf9e
[None][fix] fix Qwen2/3 export for AutoDeploy (#11007)
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2026-01-28 16:53:21 -08:00
Ludwig Schneider
4e10bf8950
[None][fix] nccl symmetric with graceful fallbacks (#11042)
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
2026-01-28 15:43:24 -08:00
Bala Marimuthu
393c3d259e
[#10245][feat] AutoDeploy: Add Minimax M2 support (#10525)
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
2026-01-28 17:22:32 -05:00
gramnarayan
744a955cbb
[None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 Llama checkpoint (#10674)
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
2026-01-28 12:10:49 -08:00
Evgueni Petrov
f25a2c53bb
[#10877][fix] restore ipv6 support in serve.py (#10929)
Signed-off-by: Evgueni Petrov <evgueni.s.petrov@gmail.com>
2026-01-27 11:55:59 -08:00
Lucas Liebenwein
ff3a494f5c
[#10013][feat] AutoDeploy: native cache manager integration (#10635)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-27 11:23:22 -05:00
Yukun He
b575184fca
[TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning (#10956)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2026-01-27 16:39:40 +08:00
Chuang Zhu
d6f76d2fae
[TRTLLM-9527][feat] change context params and disagg params (step3) (#10495)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2026-01-27 16:34:17 +08:00
ZhichenJiang
fae4985797
[TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth (#10987)
Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com>
2026-01-27 16:15:32 +08:00
Bo Li
6b251cc7fa
[TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. (#11002)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2026-01-27 15:55:07 +08:00
Lizhi Zhou
93ae8a14ab
[#10889][fix] fix pydantic deepcopy bug (#11004)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2026-01-27 02:40:13 -05:00
Yiqing Yan
ea5d811aec
[None][chore] Bump version to 1.3.0rc2 (#11021)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2026-01-27 15:26:03 +08:00
Tailing Yuan
5553391c5e
[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler (#10943)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2026-01-27 13:18:34 +08:00
Wanli Jiang
4a206351bb
[TRTLLM-10453][feat] Update mamba decode kernel to flashinfer (#10757)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2026-01-27 13:04:40 +08:00
ameynaik-hub
df8be0c50c
[TRTLLM-10276][feat] Integrate cutedsl argmax kernel (#10476)
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
Co-authored-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2026-01-26 22:08:47 -05:00
sunnyqgg
ff0dd6076e
[TRTLLM-10062][feat] Enable MTP for Nemotron Super (#10754)
Signed-off-by: qgai <qgai@nvidia.com>
2026-01-26 11:23:26 -05:00
Lucas Liebenwein
00f341be49
[#8982][feat] AutoDeploy attention dp support (#10728)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2026-01-26 09:43:33 -05:00
Pengyun Lin
ce37e27066
[#10614][fix] gpt_oss first iteration streaming in trtllm-serve (#10808)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2026-01-26 20:53:11 +08:00
Bo Li
e405468230
[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. (#10885)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2026-01-26 17:59:03 +08:00
Tian Zheng
5efee01da1
[None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV (#10813)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2026-01-26 16:46:33 +08:00
Enwei Zhu
72ef732bcf
[TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2026-01-25 21:02:30 +08:00
Yanchao Lu
ae58a7ed20 [None][chore] Revert NVIDIA/TensorRT-LLM#10819 (#10870)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Yanchao Lu
18f63dfcec [None][chore] Reduce tedious logs (#10819)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
mpikulski
0f7ec033f7 [https://nvbugs/5791242][fix] workaround for flashinfer.sampling.sampling_from_logits (#10713)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Yukun He
25bdc30162 [https://nvbugs/5782112][fix] Cherry-pick #10633: Fix hanging issue for MNNVL Allreduce under PP (#10750)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Yuxian Qiu
2b3bb2e9b0 [https://nvbugs/5811697][fix] Fix buffer reuse. (#10716)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Mike Iovine
f02948d956 [https://nvbugs/5803813][fix] Fix llama 4 min latency (#10724)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2026-01-25 18:12:21 +08:00
Yao Yao
6f07fa81d7
[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2 (#10736)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy.
2026-01-24 04:48:39 -05:00
Yuxian Qiu
9fcc93ea7b
[https://nvbugs/5829097][fix] Re-init TRTLLM sampler to use sample stream in multi-stream cases. (#10918)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2026-01-24 14:04:10 +08:00
Kaiyu Xie
da967d0bd7
[TRTLLM-10334] [feat] Support overlap scheduler for disagg ctx instances (#10755)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2026-01-23 22:29:37 -05:00
jthomson04
cf88da7eca
[None][feat] KV Connector Support for MTP (#10932)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2026-01-23 18:58:26 -05:00
Taylor Yeonbok Lee
1fbbb1f3cd
[None][feat] AutoDeploy: Enhance memory consumption for MoE fusion transform (#10772)
Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>
2026-01-23 15:22:54 -08:00
Yan Chunwei
54768f3f2c
[None][chore] refine placement group in ray executor (#10235)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2026-01-23 19:31:20 +08:00
Leslie Fang
31d04dfa12
[TRTLLM-9108][feat] Add test configurable moe module multi gpu (#10699)
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2026-01-23 10:16:58 +08:00
William Zhang
2146c23786
[#9306][refactor] Refactor AutoDeployConfig into LlmArgs (#10613)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2026-01-22 16:02:49 -05:00