TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Shunkangz	bddf183e15	[None][feat] Add Request specific exception (#6931 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-09-04 18:43:42 -04:00
Chang Liu	08a0e06621	[TRTLLM-7410][feat] Support hashing and KV cache reuse for videos (#7360 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>	2025-09-04 14:39:23 -04:00
sychen52	98a1bffb7c	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-04 09:03:38 -07:00
Enwei Zhu	1745102e72	[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-09-04 23:30:14 +08:00
Izzy Putterman	26b133f3a7	[None][feat] MultiLayer Eagle (#7234 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-09-04 10:49:13 -04:00
Wanli Jiang	4e3dded64d	[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm (#7521 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-04 20:16:10 +08:00
kris1025	cce9556858	[https://nvbugs/5485886 ][fix] Fix resource free of Eagle3ResourceManager (#7437 ) Signed-off-by: linquanh <linquanh@nvidia.com>	2025-09-04 17:38:13 +08:00
jianweiwu	7090b286b2	[None][fix] fix hunyuan_moe init bug (#7502 ) Signed-off-by: sorenwu <sorenwu@tencent.com>	2025-09-04 03:06:00 -04:00
Grzegorz Kwasniewski	3755f8ab7d	[TRTLLM-6342][fix] Fixed triggering BMM sharding (#7389 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-09-04 02:01:27 -04:00
William Zhang	a117e7a57e	[TRTLLM-7442][model] Remove unnecessary D2H copies (#7273 ) * Why? Initial profiling showed there were multiple D2H / H2D copies being scheduled in the mistral 3.1 small model. * What? This commit removes those unnecessary copies by returning `image_sizes` as a simple list instead of a tensor. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-09-03 23:14:20 -04:00
Jin Li	2a2dfe273b	[https://nvbugs/5485102 ][fix] Correctly set stride for piecewise outp… (#7442 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-09-04 10:48:15 +08:00
Frida Hou	51a2b8729e	[#7222 ][autodeploy] Separate run_shape_prop as another graph utility (#7313 ) Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-09-03 19:32:50 -04:00
Leslie Fang	bd9ba97d89	[None][chore] Remove two unused parameters in create_py_executor (#7458 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-04 07:31:31 +08:00
Enwei Zhu	5ff3a65b23	[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-03 15:16:11 -07:00
Mike Iovine	64e3bfa054	[None][fix] Fix KV cache recompute in draft_target spec decode (#7348 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-03 15:04:14 -04:00
Anurag Mukkara	ae5136831f	[https://nvbugs/5472947 ][fix] wait on isend handles before reusing buffers (#7462 ) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>	2025-09-03 13:20:02 +05:30
YueWeng	9a4f60687f	[https://nvbugs/5480289 ][fix] release slot manager in mtp MTPHiddenStatesManager (#7340 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-09-02 19:37:51 -07:00
Jinyang Yuan	572551b586	[None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-09-03 10:08:59 +08:00
Leslie Fang	42697ea32a	[None][chore] rm executor config in kv cache connector (#7372 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-03 08:13:13 +08:00
tomeras91	9c8d2161d0	[None][doc] fix example in docstring (#7410 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-09-02 11:59:49 +03:00
Leslie Fang	e81c50dbd2	[None][chore] Use llm args in create_py_executor (#7239 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-01 16:27:55 -07:00
Mike Iovine	b3c57a7042	[TRTLLM-7353][feat] Implement capturable drafting loops for speculation (#7100 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-01 14:37:44 -04:00
QI JUN	ed4087a295	[https://nvbugs/5374016 ][fix] improve error message (#6893 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Aurelien Chartier	93e623b455	[https://nvbugs/5449155 ][fix] Fix DeepSeek R1 weight loading for TP16 (#6913 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Liao Lanyu	704fca4178	[TRTLLM-6835][fix] Fix potential hang caused by python multiprocessing when prefetching weights (#6927 ) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Mike Iovine	de55763f13	[https://nvbugs/5455836 ][fix] Fix llama 4 FP4 (#6911 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
brb-nv	0253036a4e	[None][chore] Add docs for Gemma3 VLMs (#6880 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Yukun He	e106045fda	[None][fix] Complete the last missing allreduce op in Llama3/4. (#6850 ) The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Anurag Mukkara	b821883b25	[None][fix] Revert phi4-mm aggregate mode (#6907 ) Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
2ez4bz	cf0c47ca2d	[None][fix] Fix batching bug in Mistral3 model (#6841 ) Prior to this commit, if multiple requests with images were in the same batch, the batching logic for the images would fail. This commit fixes it, and adds unit tests for it that were verified to fail prior to the fix. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
2ez4bz	2480aedb73	[TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 (#6731 ) This commit adds some level of FP8 support to Mistral Small 3.1 by: * disabling quantization for the vision sub-model since `modelopt` does support quantizing it (yet). * extending existing accuracy tests to use a modelopt produced FP8 checkpoint. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Tian Zheng	e257cb3533	[None][feat] Support NVFP4 KV Cache (#6244 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-09-01 09:24:52 +08:00
Zongfei Jing	a7ed26dd8b	[TRTLLM-6747][feat] Merge add sparse exp and shared exp into local reduction (#7369 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-31 21:20:00 -04:00
Fanrong Li	37a1bd810f	[https://nvbugs/5481385 ][fix] Fix max_seq_len in cuda graph warmup and intermediate_size in fused_moe_deepgemm (#7345 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-08-29 17:00:43 +08:00
Chang Liu	31b0f0fb0c	[https://nvbugs/5445466 ][fix] Eliminate race when loading HF dynamic modules (#7268 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-08-29 12:36:30 +08:00
Richard Huo	ce580ce4f5	[None][feat] KV Cache Connector API (#7228 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Signed-off-by: richardhuo-nv <rihuo@nvidia.com> Co-authored-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-08-28 23:09:27 -04:00
Shiyu Li	b093d94d34	[https://nvbugs/5445466 ][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. (#6886 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-08-28 15:17:48 -07:00
dongfengy	367ff88a5e	[None][feat] Refactor llama4 for multimodal encoder IFB (#6844 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2025-08-28 13:22:19 -07:00
Nikita Korobov	a419b77fb5	[None][fix] mxfp4 padding bug for TRT-LLM and CUTLASS MoE backends (#7214 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-08-28 10:08:05 -07:00
Zongfei Jing	53163bf1df	[TRTLLM-6876][feat] Add low precision all2all for mnnvl (#7155 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-28 18:26:16 +08:00
Mike Iovine	8b216135f0	[None][refactor] Move draft token padding out of Drafter (#7134 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-08-27 11:07:50 +02:00
Yukun He	bed5bc9f2e	[None][chore] Wrap the swiglu into custom op to avoid redundant device copy. (#7021 ) A redundant D2D copy is observed when enabling torch.compile for the Llama model due to the swiglu triton kernel, which brings perf overhead. Use a custom op to wrap the swiglu op to avoid this overhead. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-27 13:02:10 +08:00
Shunkangz	ff4047414b	[None][opt] Balance the request based on number of tokens in AttentionDP (#7183 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-08-27 11:16:12 +08:00
Fanrong Li	e12868bc00	[None][fix] Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-08-27 10:35:38 +08:00
Jin Li	028235404b	[TRTLLM-6633][feat] Padding for piecewise cudagraph (#6750 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-26 18:31:33 -04:00
Fridah-nv	0f947c64cb	[None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder (#7233 ) Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-08-26 10:47:57 -07:00
Void	040f4c70d3	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-27 00:13:13 +08:00
qixiang-99	b165f8bc97	fix/improve kvcache allocation in PyTorch runtime (#5933 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-08-26 12:40:22 +08:00
William Zhang	92576488d3	[None][feat] Skip prefetching consolidated safetensors when appropriate (#7013 ) * Why? Some models (e.g. anything produced by Mistral) can have both sharded safetensors and a consolidated safetensor in the same checkpoint directory. In such cases, prefetching both to memory is a waste of time, and memory. * What? This commit skips over consolidated safetensors when they are not the only safetensor file present in the checkpoint directory Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-08-25 23:56:21 -04:00
Grzegorz Kwasniewski	2101d46d68	[TRTLLM-6342][feat] TP Sharding read from the model config (#6972 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-08-25 15:41:27 -07:00

1 2 3 4 5 ...

768 Commits