TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-19 01:05:12 +08:00

Author	SHA1	Message	Date
Gal Hubara-Agam	2b60cc181c	[#10780 ][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE (#11322 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com> Signed-off-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>	2026-02-09 10:07:37 -05:00
Robin Kobus	31db399042	[https://nvbugs/5829097 ][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2026-02-09 17:11:45 +08:00
mpikulski	03b38e9fbf	[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search (#11341 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-07 12:31:11 +08:00
William Zhang	ffc0f54959	[https://nvbugs/5848756 ][fix] Re-take ownership of mrope tensors in prefill worker (#11217 ) * Why? Previously, the mrope tensors' IPC handles would just be forwarded from encode -> prefill -> decode workers. While this is fine for the prefill worker, it is not for the decode worker, since by the time it tries to rebuild those tensors, they could have been garbage collected due to their refcounts reaching zero in the producer (encode) worker. This could lead to nasty runtime errors when running E/P/D disaggregated serving. * What? This commit fixes this by having the prefill worker take ownership of those reconstructed tensors, and stand up new copies for the decode worker. Closes: NvBug 5848756 Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-02-06 22:37:42 -05:00
Iman Tabrizian	18e611da77	[https://nvbugs/5863392 ][fix] fix partial reuse disabled for disagg (#11247 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2026-02-06 14:23:51 -05:00
Shi Xiaowei	b1268e1b37	[TRTLLM-9527][feat] Modularization of the transceiver for KV manager v2 (step 4) (#11225 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2026-02-06 07:15:18 -05:00
Yueh-Ting (eop) Chen	383c5921c2	[https://nvbugs/5756028 ][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2026-02-06 14:28:47 +08:00
Chenghao Zhang	9644f024bd	[None][feat] AutoDeploy: add triton backend for causal conv (#11124 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:33:00 -08:00
Chenghao Zhang	d160439ef9	[#11148 ][feat] AutoDeploy: Better structure the custom op (#11152 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:32:22 -08:00
yifeizhang-c	5521c7b7e7	[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell (#10130 ) Added FP8 cute dsl gemm and batch gemm. Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2026-02-06 09:49:30 +08:00
Chuang Zhu	a9d4927235	[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len (#11082 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2026-02-05 14:54:00 -05:00
Harris Nover	a7494a5ff4	[None][chore] Remove outdated comment in model_engine.py (#11240 ) Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>	2026-02-05 13:54:46 -05:00
jthomson04	d778b26062	[None][fix] Reduce host memory usage during model loading (#11119 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2026-02-05 08:57:40 -08:00
mpikulski	7d235cfb23	[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes (#11281 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 16:33:22 +01:00
chenfeiz0326	eae480b713	[https://nvbugs/5820874 ][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope (#11259 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-02-05 23:12:38 +08:00
mpikulski	719e82c429	[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) (#11276 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 15:33:51 +01:00
Yuewei Na	0d18b2d7a4	[None][feat] Add priority-based KV cache offload filtering support (#10751 ) Signed-off-by: Yuewei Na <yna@nvidia.com> Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>	2026-02-05 05:22:56 -05:00
Chang Su	9601b17459	[#11037 ][fix] Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292 ) Signed-off-by: Chang Su <chang.s.su@oracle.com>	2026-02-05 05:00:29 -05:00
Yao Yao	d9b936be94	[None][feat] Enhance support for complex models (#11254 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2026-02-05 17:28:26 +08:00
xxi	4c1d9d0c10	[None][chore] Pass without_comm to cutlass and deepgemm (#11229 ) Signed-off-by: xxi <xxi@nvidia.com>	2026-02-05 02:07:59 -05:00
Yi Zhang	ada463d15d	[None][fix] Fix comments for kv cache manager v2 (#11207 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2026-02-04 23:31:29 -05:00
dongfengy	0bd4630cd1	[https://nvbugs/5854860 ][fix] Fix cutedsl argmax on sm120 (#11181 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-02-04 17:15:31 -05:00
Grzegorz Kwasniewski	d90a8e5700	[TRTLLM-10673][feat] Improved layer classification for sharding (#10718 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2026-02-04 18:06:10 +01:00
Lucas Liebenwein	925d911fc0	[#10966 ][feat] AutoDeploy: kv cache manager integration [2/2] (#11149 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-02-04 09:44:27 -05:00
Yueh-Ting (eop) Chen	f6fff18142	[https://nvbugs/5624818 ][fix] Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2026-02-04 19:21:50 +08:00
mpikulski	f0ca62b175	[None][fix] make health_generate work with beam search (#11097 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-04 09:46:19 +01:00
xxi	02b80bfd58	[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends (#11128 ) Signed-off-by: xxi <xxi@nvidia.com>	2026-02-04 15:57:56 +08:00
Gal Hubara-Agam	de6931bbfd	[None][fix] Fix selective_state_update perf regression for T=1 decode path (#11194 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-02-04 09:01:34 +02:00
tburt-nv	588db0ed64	[None][chore] bump version to 1.3.0rc3 (#11238 ) Signed-off-by: Tyler Burt <tburt@nvidia.com>	2026-02-04 09:30:45 +08:00
Dmitry Barsukoff	5d522295e9	[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol (#10644 ) Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>	2026-02-03 16:04:54 -08:00
Taylor Yeonbok Lee	f9e6045f39	[#11086 ][feat] Optimize Auto Deploy weight loading by preloading weights to CPU (#11059 ) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>	2026-02-03 13:23:10 -08:00
Lizhi Zhou	f9c4bdf6cf	[TRTLLM-8921][feat] implement gen-first disagg_service (#11020 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-02-03 15:46:11 -05:00
Chenjie Luo	2532eb5adc	[None][fix] Align kv_scales with modelopt HF checkpoint (#10745 ) Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>	2026-02-03 08:03:42 -05:00
gramnarayan	585fbb2734	[#10826 ][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation (#11073 ) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2026-02-02 09:51:10 -08:00
Izzy Putterman	3ef8a4639b	[None][feat] Nemotron H: Eagle3 support (#11131 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-02-02 10:26:25 -05:00
Rundong Li	f1b85fea4c	[None][feat] Integrate cuda.tile RMS norm kernels (#9725 ) Signed-off-by: Rundong (David) Li <davidli@nvidia.com> Co-authored-by: Jinman Xie <jinmanx@nvidia.com> Co-authored-by: Alexey Bylinkin <abylinkin@nvidia.com> Co-authored-by: Qiqi Xiao <qiqix@nvidia.com> Co-authored-by: Biao Wang <biaow@nvidia.com> Co-authored-by: Thomas Schmid <thschmid@nvidia.com>	2026-02-02 19:44:27 +08:00
Mike Iovine	13b0ab9c0e	[None][fix] Fix MTP 1-model sampler (#10369 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Mike Iovine	d9aef94431	[https://nvbugs/5814914 ][fix] Fix llama sm120 spec dec (#10765 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Zheyu Fu	d31482686c	[https://nvbugs/5680911 ][fix] Remove @cache decorator to enhance CI stability for unit tests using single process mode (#10730 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Enwei Zhu	ccdd8461ac	[None][fix] Always reset drafting states for GuidedDecoder (#10899 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Michal Guzek	fafc22e3d4	[https://nvbugs/5691730 ][fix] Have LoRa bf16 ckpts work with Llama 3.3-70B-fp8 (#9808 ) Signed-off-by: Michal Guzek <mguzek@nvidia.com> Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
William Zhang	bc2487bc2c	[https://nvbugs/5826962 ][fix] Fix PD disaggregation for VLMs that use mrope (#10865 ) * Why? Commit `a6a8898` enabled EPD disaggregation for VLMs that use mrope (e.g. qwen). However, this broke PD disaggregation for these sames models. * What? This commit fixes this, and adds a unit test that guards against it. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Yi Zhang	0306c0f12c	[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime (#10659 ) Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>	2026-02-02 14:29:02 +08:00
Liao Lanyu	fef0e4b17d	[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns (#10988 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2026-02-02 10:36:08 +08:00
Lizhi Zhou	b00e8338ec	[https://nvbugs/5834212 ][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID (#11095 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-02-02 09:54:33 +08:00
Dmitry Barsukoff	ea49afdf0b	[None][fix] AttributeError with return_perf_metrics on tensorrt backend (#10662 ) Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>	2026-02-02 08:41:15 +08:00
shuyixiong	278ced972b	[TRTLLM-9771][feat] Allow overriding quantization configs (#11062 ) Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>	2026-01-31 10:48:51 -05:00
Frida Hou	7910d4d2a9	[#8242 ][feat] Add int4 GPTQ support for AutoDeploy (#8248 ) Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2026-01-30 23:07:24 -08:00
Guoming Zhang	6bace84167	[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super (#10791 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2026-01-31 13:48:25 +08:00
Balaram Buddharaju	531f85dc9b	[None][feat] Perfect routing for Deepseek models (#11127 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-30 23:46:35 -05:00
Karthik	5a97374f3c	[#9525 ][feat] add L2 norm pattern matcher and fusion transform (#10767 ) Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>	2026-01-30 16:05:53 -05:00
nvyocox	4af47208d8	[None][feat] Export ONNX for DriveOS LLM (#10117 ) Signed-off-by: yocox <yocox@nvidia.com>	2026-01-30 15:43:11 -05:00
Yao Yao	53cb762ee5	[None][feat] New KVCacheManagerV2 APIs for Transceiver (#11003 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2026-01-30 18:09:53 +08:00
Liao Lanyu	f2dd0ee128	[None][chore] Correct sorting order for attention DP scheduling to prioritize non-relaxed requests (#11106 ) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>	2026-01-30 16:06:48 +08:00
dongfengy	4f0c1b2489	[TRTLLM-10733][feat] Make TRTLLM MOE the default one for GPTOSS on Blackwell (#11074 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-01-29 23:59:19 -08:00
Jin Li	ef268e2062	[TRTLLM-9904][feat] Changes for future KVCacheV2 MTP support (#11029 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2026-01-30 01:49:17 -05:00
Necofish	144b61715f	[None][fix] Add missing absolute pe in Qwen3-VL Vision Encoder (#11065 ) Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn>	2026-01-30 09:59:36 +09:00
Chang Su	dbad94715b	[None][feat] Add gRPC server for high-performance external router integration (#11037 ) Signed-off-by: Chang Su <chang.s.su@oracle.com>	2026-01-30 07:48:27 +08:00
Chenghao Zhang	e033929221	[None][feat] AutoDeploy: Flashinfer kernels bringup (#10867 ) Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>	2026-01-29 14:59:29 -08:00
Harris Nover	ab7dd34bbe	[None][chore] Consolidate duplicate kv cache reuse variables. (#10935 ) Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>	2026-01-29 11:03:27 -08:00
Stefan Niebler	7d31532850	[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler (#10459 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2026-01-29 11:06:09 -05:00
Balaram Buddharaju	c7a86f89de	[TRTLLM-10264][feat] Support attention DP + Helix CP (#10477 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-29 02:57:13 -05:00
Tailing Yuan	91528365a9	[None][feat] Add performance alignment to layer-wise benchmarks (#11018 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2026-01-29 14:01:51 +08:00
Enwei Zhu	34a730aaf7	[None][fix] Fix enable_alltoall passed to CutlassFusedMoE (#11016 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2026-01-29 12:11:07 +08:00
Anish Shanbhag	24ac86c485	[https://nvbugs/5761391 ][fix] Include triton-kernels as a packaged dependency (#10471 ) Signed-off-by: Anish Shanbhag <ashanbhag@nvidia.com>	2026-01-28 19:56:32 -08:00
Frida Hou	f03908cf9e	[None][fix] fix Qwen2/3 export for AutoDeploy (#11007 ) Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2026-01-28 16:53:21 -08:00
Ludwig Schneider	4e10bf8950	[None][fix] nccl symmetric with graceful fallbacks (#11042 ) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>	2026-01-28 15:43:24 -08:00
Bala Marimuthu	393c3d259e	[#10245 ][feat] AutoDeploy: Add Minimax M2 support (#10525 ) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>	2026-01-28 17:22:32 -05:00
gramnarayan	744a955cbb	[None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 Llama checkpoint (#10674 ) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2026-01-28 12:10:49 -08:00
Evgueni Petrov	f25a2c53bb	[#10877 ][fix] restore ipv6 support in serve.py (#10929 ) Signed-off-by: Evgueni Petrov <evgueni.s.petrov@gmail.com>	2026-01-27 11:55:59 -08:00
Lucas Liebenwein	ff3a494f5c	[#10013 ][feat] AutoDeploy: native cache manager integration (#10635 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-27 11:23:22 -05:00
Yukun He	b575184fca	[TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning (#10956 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-27 16:39:40 +08:00
Chuang Zhu	d6f76d2fae	[TRTLLM-9527][feat] change context params and disagg params (step3) (#10495 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2026-01-27 16:34:17 +08:00
ZhichenJiang	fae4985797	[TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth (#10987 ) Signed-off-by: zhichen jiang <zhichenj@NVIDIA.com>	2026-01-27 16:15:32 +08:00
Bo Li	6b251cc7fa	[TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. (#11002 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2026-01-27 15:55:07 +08:00
Lizhi Zhou	93ae8a14ab	[#10889 ][fix] fix pydantic deepcopy bug (#11004 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-01-27 02:40:13 -05:00
Yiqing Yan	ea5d811aec	[None][chore] Bump version to 1.3.0rc2 (#11021 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2026-01-27 15:26:03 +08:00
Tailing Yuan	5553391c5e	[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler (#10943 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2026-01-27 13:18:34 +08:00
Wanli Jiang	4a206351bb	[TRTLLM-10453][feat] Update mamba decode kernel to flashinfer (#10757 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2026-01-27 13:04:40 +08:00
ameynaik-hub	df8be0c50c	[TRTLLM-10276][feat] Integrate cutedsl argmax kernel (#10476 ) Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com> Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com> Co-authored-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2026-01-26 22:08:47 -05:00
sunnyqgg	ff0dd6076e	[TRTLLM-10062][feat] Enable MTP for Nemotron Super (#10754 ) Signed-off-by: qgai <qgai@nvidia.com>	2026-01-26 11:23:26 -05:00
Lucas Liebenwein	00f341be49	[#8982 ][feat] AutoDeploy attention dp support (#10728 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-01-26 09:43:33 -05:00
Pengyun Lin	ce37e27066	[#10614 ][fix] gpt_oss first iteration streaming in trtllm-serve (#10808 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2026-01-26 20:53:11 +08:00
Bo Li	e405468230	[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. (#10885 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2026-01-26 17:59:03 +08:00
Tian Zheng	5efee01da1	[None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV (#10813 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2026-01-26 16:46:33 +08:00
Enwei Zhu	72ef732bcf	[TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2026-01-25 21:02:30 +08:00
Yanchao Lu	ae58a7ed20	[None][chore] Revert NVIDIA/TensorRT-LLM#10819 (#10870 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
Yanchao Lu	18f63dfcec	[None][chore] Reduce tedious logs (#10819 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
mpikulski	0f7ec033f7	[https://nvbugs/5791242 ][fix] workaround for flashinfer.sampling.sampling_from_logits (#10713 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
Yukun He	25bdc30162	[https://nvbugs/5782112 ][fix] Cherry-pick #10633 : Fix hanging issue for MNNVL Allreduce under PP (#10750 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
Yuxian Qiu	2b3bb2e9b0	[https://nvbugs/5811697 ][fix] Fix buffer reuse. (#10716 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
Mike Iovine	f02948d956	[https://nvbugs/5803813 ][fix] Fix llama 4 min latency (#10724 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-01-25 18:12:21 +08:00
Yao Yao	6f07fa81d7	[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2 (#10736 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy.	2026-01-24 04:48:39 -05:00
Yuxian Qiu	9fcc93ea7b	[https://nvbugs/5829097 ][fix] Re-init TRTLLM sampler to use sample stream in multi-stream cases. (#10918 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-01-24 14:04:10 +08:00
Kaiyu Xie	da967d0bd7	[TRTLLM-10334] [feat] Support overlap scheduler for disagg ctx instances (#10755 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2026-01-23 22:29:37 -05:00
jthomson04	cf88da7eca	[None][feat] KV Connector Support for MTP (#10932 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2026-01-23 18:58:26 -05:00
Taylor Yeonbok Lee	1fbbb1f3cd	[None][feat] AutoDeploy: Enhance memory consumption for MoE fusion transform (#10772 ) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>	2026-01-23 15:22:54 -08:00
Yan Chunwei	54768f3f2c	[None][chore] refine placement group in ray executor (#10235 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>	2026-01-23 19:31:20 +08:00
Leslie Fang	31d04dfa12	[TRTLLM-9108][feat] Add test configurable moe module multi gpu (#10699 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2026-01-23 10:16:58 +08:00
William Zhang	2146c23786	[#9306 ][refactor] Refactor AutoDeployConfig into LlmArgs (#10613 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-01-22 16:02:49 -05:00

1 2 3 4 5 ...

2209 Commits