TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-16 15:55:08 +08:00

Author	SHA1	Message	Date
William Zhang	ffc0f54959	[https://nvbugs/5848756 ][fix] Re-take ownership of mrope tensors in prefill worker (#11217 ) * Why? Previously, the mrope tensors' IPC handles would just be forwarded from encode -> prefill -> decode workers. While this is fine for the prefill worker, it is not for the decode worker, since by the time it tries to rebuild those tensors, they could have been garbage collected due to their refcounts reaching zero in the producer (encode) worker. This could lead to nasty runtime errors when running E/P/D disaggregated serving. * What? This commit fixes this by having the prefill worker take ownership of those reconstructed tensors, and stand up new copies for the decode worker. Closes: NvBug 5848756 Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2026-02-06 22:37:42 -05:00
Iman Tabrizian	18e611da77	[https://nvbugs/5863392 ][fix] fix partial reuse disabled for disagg (#11247 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2026-02-06 14:23:51 -05:00
Shi Xiaowei	b1268e1b37	[TRTLLM-9527][feat] Modularization of the transceiver for KV manager v2 (step 4) (#11225 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2026-02-06 07:15:18 -05:00
Yueh-Ting (eop) Chen	383c5921c2	[https://nvbugs/5756028 ][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2026-02-06 14:28:47 +08:00
Chenghao Zhang	9644f024bd	[None][feat] AutoDeploy: add triton backend for causal conv (#11124 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:33:00 -08:00
Chenghao Zhang	d160439ef9	[#11148 ][feat] AutoDeploy: Better structure the custom op (#11152 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2026-02-05 21:32:22 -08:00
yifeizhang-c	5521c7b7e7	[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell (#10130 ) Added FP8 cute dsl gemm and batch gemm. Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2026-02-06 09:49:30 +08:00
Chuang Zhu	a9d4927235	[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len (#11082 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2026-02-05 14:54:00 -05:00
Harris Nover	a7494a5ff4	[None][chore] Remove outdated comment in model_engine.py (#11240 ) Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>	2026-02-05 13:54:46 -05:00
jthomson04	d778b26062	[None][fix] Reduce host memory usage during model loading (#11119 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2026-02-05 08:57:40 -08:00
mpikulski	7d235cfb23	[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes (#11281 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 16:33:22 +01:00
chenfeiz0326	eae480b713	[https://nvbugs/5820874 ][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope (#11259 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>	2026-02-05 23:12:38 +08:00
mpikulski	719e82c429	[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) (#11276 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-05 15:33:51 +01:00
Yuewei Na	0d18b2d7a4	[None][feat] Add priority-based KV cache offload filtering support (#10751 ) Signed-off-by: Yuewei Na <yna@nvidia.com> Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>	2026-02-05 05:22:56 -05:00
Chang Su	9601b17459	[#11037 ][fix] Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292 ) Signed-off-by: Chang Su <chang.s.su@oracle.com>	2026-02-05 05:00:29 -05:00
Yao Yao	d9b936be94	[None][feat] Enhance support for complex models (#11254 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2026-02-05 17:28:26 +08:00
xxi	4c1d9d0c10	[None][chore] Pass without_comm to cutlass and deepgemm (#11229 ) Signed-off-by: xxi <xxi@nvidia.com>	2026-02-05 02:07:59 -05:00
Yi Zhang	ada463d15d	[None][fix] Fix comments for kv cache manager v2 (#11207 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2026-02-04 23:31:29 -05:00
dongfengy	0bd4630cd1	[https://nvbugs/5854860 ][fix] Fix cutedsl argmax on sm120 (#11181 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2026-02-04 17:15:31 -05:00
Grzegorz Kwasniewski	d90a8e5700	[TRTLLM-10673][feat] Improved layer classification for sharding (#10718 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2026-02-04 18:06:10 +01:00
Lucas Liebenwein	925d911fc0	[#10966 ][feat] AutoDeploy: kv cache manager integration [2/2] (#11149 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2026-02-04 09:44:27 -05:00
Yueh-Ting (eop) Chen	f6fff18142	[https://nvbugs/5624818 ][fix] Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192 ) Signed-off-by: eopXD <yuehtingc@nvidia.com>	2026-02-04 19:21:50 +08:00
mpikulski	f0ca62b175	[None][fix] make health_generate work with beam search (#11097 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2026-02-04 09:46:19 +01:00
xxi	02b80bfd58	[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends (#11128 ) Signed-off-by: xxi <xxi@nvidia.com>	2026-02-04 15:57:56 +08:00
Gal Hubara-Agam	de6931bbfd	[None][fix] Fix selective_state_update perf regression for T=1 decode path (#11194 ) Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>	2026-02-04 09:01:34 +02:00
tburt-nv	588db0ed64	[None][chore] bump version to 1.3.0rc3 (#11238 ) Signed-off-by: Tyler Burt <tburt@nvidia.com>	2026-02-04 09:30:45 +08:00
Dmitry Barsukoff	5d522295e9	[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol (#10644 ) Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>	2026-02-03 16:04:54 -08:00
Taylor Yeonbok Lee	f9e6045f39	[#11086 ][feat] Optimize Auto Deploy weight loading by preloading weights to CPU (#11059 ) Signed-off-by: Taylor Yeonbok Lee <249374542+taylor-yb-lee@users.noreply.github.com>	2026-02-03 13:23:10 -08:00
Lizhi Zhou	f9c4bdf6cf	[TRTLLM-8921][feat] implement gen-first disagg_service (#11020 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-02-03 15:46:11 -05:00
Chenjie Luo	2532eb5adc	[None][fix] Align kv_scales with modelopt HF checkpoint (#10745 ) Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>	2026-02-03 08:03:42 -05:00
gramnarayan	585fbb2734	[#10826 ][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation (#11073 ) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>	2026-02-02 09:51:10 -08:00
Izzy Putterman	3ef8a4639b	[None][feat] Nemotron H: Eagle3 support (#11131 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2026-02-02 10:26:25 -05:00
Rundong Li	f1b85fea4c	[None][feat] Integrate cuda.tile RMS norm kernels (#9725 ) Signed-off-by: Rundong (David) Li <davidli@nvidia.com> Co-authored-by: Jinman Xie <jinmanx@nvidia.com> Co-authored-by: Alexey Bylinkin <abylinkin@nvidia.com> Co-authored-by: Qiqi Xiao <qiqix@nvidia.com> Co-authored-by: Biao Wang <biaow@nvidia.com> Co-authored-by: Thomas Schmid <thschmid@nvidia.com>	2026-02-02 19:44:27 +08:00
Mike Iovine	13b0ab9c0e	[None][fix] Fix MTP 1-model sampler (#10369 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Mike Iovine	d9aef94431	[https://nvbugs/5814914 ][fix] Fix llama sm120 spec dec (#10765 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Zheyu Fu	d31482686c	[https://nvbugs/5680911 ][fix] Remove @cache decorator to enhance CI stability for unit tests using single process mode (#10730 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Enwei Zhu	ccdd8461ac	[None][fix] Always reset drafting states for GuidedDecoder (#10899 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Michal Guzek	fafc22e3d4	[https://nvbugs/5691730 ][fix] Have LoRa bf16 ckpts work with Llama 3.3-70B-fp8 (#9808 ) Signed-off-by: Michal Guzek <mguzek@nvidia.com> Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
William Zhang	bc2487bc2c	[https://nvbugs/5826962 ][fix] Fix PD disaggregation for VLMs that use mrope (#10865 ) * Why? Commit `a6a8898` enabled EPD disaggregation for VLMs that use mrope (e.g. qwen). However, this broke PD disaggregation for these sames models. * What? This commit fixes this, and adds a unit test that guards against it. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2026-02-02 16:26:46 +08:00
Yi Zhang	0306c0f12c	[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime (#10659 ) Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>	2026-02-02 14:29:02 +08:00
Liao Lanyu	fef0e4b17d	[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns (#10988 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2026-02-02 10:36:08 +08:00
Lizhi Zhou	b00e8338ec	[https://nvbugs/5834212 ][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID (#11095 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2026-02-02 09:54:33 +08:00
Dmitry Barsukoff	ea49afdf0b	[None][fix] AttributeError with return_perf_metrics on tensorrt backend (#10662 ) Signed-off-by: Dmitry Barsukoff <riZZZhik@gmail.com> Co-authored-by: Kanghwan <861393+karljang@users.noreply.github.com>	2026-02-02 08:41:15 +08:00
shuyixiong	278ced972b	[TRTLLM-9771][feat] Allow overriding quantization configs (#11062 ) Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>	2026-01-31 10:48:51 -05:00
Frida Hou	7910d4d2a9	[#8242 ][feat] Add int4 GPTQ support for AutoDeploy (#8248 ) Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2026-01-30 23:07:24 -08:00
Guoming Zhang	6bace84167	[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super (#10791 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2026-01-31 13:48:25 +08:00
Balaram Buddharaju	531f85dc9b	[None][feat] Perfect routing for Deepseek models (#11127 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2026-01-30 23:46:35 -05:00
Karthik	5a97374f3c	[#9525 ][feat] add L2 norm pattern matcher and fusion transform (#10767 ) Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>	2026-01-30 16:05:53 -05:00
nvyocox	4af47208d8	[None][feat] Export ONNX for DriveOS LLM (#10117 ) Signed-off-by: yocox <yocox@nvidia.com>	2026-01-30 15:43:11 -05:00
Yao Yao	53cb762ee5	[None][feat] New KVCacheManagerV2 APIs for Transceiver (#11003 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2026-01-30 18:09:53 +08:00

1 2 3 4 5 ...

2156 Commits