TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-17 00:04:57 +08:00

Author	SHA1	Message	Date
mpikulski	37c53425c1	[TRTLLM-10030][chore] improve assert in sampler (#11475 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-13 21:54:28 +08:00
Gal Hubara-Agam	d0e7ba102e	[#11455 ][fix] Fallback to triton_ssm for nvfp4 quantization (#11456 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-02-13 07:38:37 +02:00
xxi	2565f0f4e4	[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437 ) Signed-off-by: xxi <xxi@nvidia.com>	2026-02-13 11:05:38 +08:00
Ludwig Schneider	5130cbd73e	[None][fix] Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326 ) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>	2026-02-12 14:31:51 -08:00
Balaram Buddharaju	9c2d23c2e5	[https://nvbugs/5888410 ][fix] Enable warmup for Helix CP (#11460 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-02-12 14:24:51 -08:00
Yukun He	cb1d8d130f	[TRTLLM-10791][feat] TorchSampler general host time optimization (#11141 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-02-12 18:05:58 +01:00
Wanli Jiang	421eb9e39c	[None][feat] Optimize NemotronH model with elementwise and nvfp4 fusion (#11273 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2026-02-12 09:25:31 -05:00
Simeng Liu	12085536df	[TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. (#11075 ) Signed-off-by: SimengLiu-nv <simengl@nvidia.com>	2026-02-12 00:48:47 -05:00
William Zhang	ca9537e17c	[TRTLLM-10858][feat] Multi-image support for EPD disagg (#11264 ) * Why? Prior to this commit, we only supported a single multimodal input for E/P/D disaggregated serving. * What? This commit does a minor refactor of the multimodal embedding handles that cross process boundaries to enable this. Existing unit tests are updated accordingly to test this. The `RequestOutput` has its `mm_embedding_handle` replaced in favor of `disaggregated_params`, addressing a previous TODO. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-02-11 20:50:00 -08:00
Liao Lanyu	58165d5394	[None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2026-02-12 09:18:24 +08:00
Harris Nover	2d5ebb3fe8	[None][chore] Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406 ) Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>	2026-02-11 12:01:36 -05:00
Robin Kobus	7a103035be	[None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2026-02-11 08:46:25 -08:00
Guoming Zhang	c47ff4da43	[None][feat] Remove the hard code for activation type definition in T… (#11164 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2026-02-11 21:50:45 +08:00
Yihan Wang	e8b860965b	[None][feat] Initial PR for trtllm-gen attention backend (#10784 ) Signed-off-by: Yihan Wang <yihwang@nvidia.com>	2026-02-11 17:16:52 +08:00
Taylor Yeonbok Lee	860054c859	[#11203 ][feat] AutoDeploy: Refactor node caching and improve engine build time (#11250 ) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>	2026-02-10 13:35:44 -08:00
mpikulski	411fa9ff87	[TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search (#11390 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-10 16:48:36 +01:00
Iman Tabrizian	7d992972b2	[TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ (#10540 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2026-02-10 07:20:56 -08:00
Leslie Fang	d6e49542bd	[https://nvbugs/5848377 ][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com> Signed-off-by: Leslie Fang <leslief@nvidia.com> Co-authored-by: Tailing Yuan <yuantailing@gmail.com>	2026-02-10 20:09:00 +08:00
chenfeiz0326	eac56b793e	[https://nvbugs/5853720 ][fix] Disable cutedsl argmax kernel to fix perf regression (#11403 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-02-10 18:10:38 +08:00
mpikulski	adc0d82500	[https://nvbugs/5791242 ][chore] remove obsolete code (#11388 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-10 10:55:29 +01:00
Yuxian Qiu	5f4df89109	[None][feat] Fully non-blocking pipeline parallelism executor loop. (#10349 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-02-10 15:43:28 +08:00
shuyixiong	c3cdc93211	[TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph (#11267 ) Signed-off-by: Shuyi Xiong <219646547+shuyixiong@users.noreply.github.com>	2026-02-10 01:12:49 -05:00
Jonas Li	8b2dc57823	[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch (#11384 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>	2026-02-10 14:00:42 +08:00
Lucas Liebenwein	a2fb5afecf	[#11032 ][feat] MLA revisited and GLM 4.7 Flash support (#11324 )	2026-02-09 23:26:51 -05:00
Yuan Tong	4fc3644705	[None][fix] Avoid reserved filename on Windows (#11382 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2026-02-10 11:22:59 +08:00
Yuxian Qiu	af68c29d3d	[None][chore] Reduce attention module repeated warnings. (#11335 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2026-02-10 08:58:21 +08:00
Ziyi Xiong	e76b634251	[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec (#10502 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2026-02-10 05:16:02 +08:00
Bala Marimuthu	4a743338c3	[None][infra] AutoDeploy: Dump graph IR after every transform (#11045 ) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>	2026-02-09 10:43:44 -08:00
Guiju Zhang	c37531c3f7	[TRTLLM-10669][fix] Fix Eagle3 draft model weight loading for throughput checkpoint (#11010 ) Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-09 23:53:40 +08:00
William Zhang	abb8106c01	[https://nvbugs/5835925 ][fix] Add EPD disagg support for Qwen3 VL MoE (#10962 ) * Why? Trying to instantiate a `MultimodalEncoder` for a Qwen3 VL MoE model would fail during weight loading. * What? This commit fixes the bug, alongside: - explicit, intentional support for EPD for Qwen3 VL MoE. - extends EPD unit tests for Qwen3 VL MoE, albeit with dummy weights. - unit tests for the weight mapper fixes. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-09 23:53:40 +08:00
Jin Li	0ead17bb85	[https://nvbugs/5800646 ][fix] Fix hang issue by avoid exposing UB buf… (#10842 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-09 23:53:40 +08:00
Stefan Niebler	d50010cd1f	[https://nvbugs/5769815 ][fix] Fix offset calculation in _are_stop_words when using speculative decoding (#10854 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-09 23:53:40 +08:00
mpikulski	196d94a419	[TRTLLM-10030][perf] avoid syncs in beam search + other improvements (#11349 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-09 16:13:58 +01:00
Gal Hubara-Agam	2b60cc181c	[#10780 ][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE (#11322 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com> Signed-off-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>	2026-02-09 10:07:37 -05:00
Robin Kobus	31db399042	[https://nvbugs/5829097 ][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2026-02-09 17:11:45 +08:00
mpikulski	03b38e9fbf	[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search (#11341 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-07 12:31:11 +08:00
William Zhang	ffc0f54959	[https://nvbugs/5848756 ][fix] Re-take ownership of mrope tensors in prefill worker (#11217 ) * Why? Previously, the mrope tensors' IPC handles would just be forwarded from encode -> prefill -> decode workers. While this is fine for the prefill worker, it is not for the decode worker, since by the time it tries to rebuild those tensors, they could have been garbage collected due to their refcounts reaching zero in the producer (encode) worker. This could lead to nasty runtime errors when running E/P/D disaggregated serving. * What? This commit fixes this by having the prefill worker take ownership of those reconstructed tensors, and stand up new copies for the decode worker. Closes: NvBug 5848756 Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-02-06 22:37:42 -05:00
Iman Tabrizian	18e611da77	[https://nvbugs/5863392 ][fix] fix partial reuse disabled for disagg (#11247 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2026-02-06 14:23:51 -05:00
Shi Xiaowei	b1268e1b37	[TRTLLM-9527][feat] Modularization of the transceiver for KV manager v2 (step 4) (#11225 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2026-02-06 07:15:18 -05:00
Yueh-Ting (eop) Chen	383c5921c2	[https://nvbugs/5756028 ][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2026-02-06 14:28:47 +08:00
Chenghao Zhang	9644f024bd	[None][feat] AutoDeploy: add triton backend for causal conv (#11124 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:33:00 -08:00
Chenghao Zhang	d160439ef9	[#11148 ][feat] AutoDeploy: Better structure the custom op (#11152 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:32:22 -08:00
yifeizhang-c	5521c7b7e7	[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell (#10130 ) Added FP8 cute dsl gemm and batch gemm. Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2026-02-06 09:49:30 +08:00
Chuang Zhu	a9d4927235	[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len (#11082 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2026-02-05 14:54:00 -05:00
Harris Nover	a7494a5ff4	[None][chore] Remove outdated comment in model_engine.py (#11240 ) Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>	2026-02-05 13:54:46 -05:00
jthomson04	d778b26062	[None][fix] Reduce host memory usage during model loading (#11119 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2026-02-05 08:57:40 -08:00
mpikulski	7d235cfb23	[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes (#11281 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 16:33:22 +01:00
chenfeiz0326	eae480b713	[https://nvbugs/5820874 ][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope (#11259 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-02-05 23:12:38 +08:00
mpikulski	719e82c429	[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) (#11276 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 15:33:51 +01:00
Yuewei Na	0d18b2d7a4	[None][feat] Add priority-based KV cache offload filtering support (#10751 ) Signed-off-by: Yuewei Na <yna@nvidia.com> Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>	2026-02-05 05:22:56 -05:00

1 2 3 4 5 ...

1602 Commits