TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Yuxian Qiu	d6ebcf7c4a	[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) (#7610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-19 09:40:49 +08:00
Ziyi Xiong	420f0fbcf5	[https://nvbugs/5522851 ][fix] Correct the logic to update kv_lens_cuda (#7790 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-19 08:11:29 +08:00
sunnyqgg	80dd8fe197	[TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle (#7001 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-18 12:05:36 -04:00
Li Min	d921fc3352	[TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-09-18 21:20:04 +08:00
bhsueh_NV	c65457db8a	[None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" (#7780 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-09-18 20:11:05 +08:00
Wanli Jiang	fe104dc20d	[TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm (#7723 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-18 17:37:16 +08:00
Stefan Niebler	a55251bf75	[None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod (#7732 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-09-18 10:30:50 +02:00
Wanli Jiang	a7ca0fff54	[TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend (#7207 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-18 16:26:20 +08:00
Leslie Fang	870cfcf9a0	[None][chore] Remove executor config in create_py_executor (#7599 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-18 14:24:58 +08:00
mpikulski	1c7f601265	[https://nvbugs/5508890 ][fix] gen. result cleanup when using PostprocWorker (#7771 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-18 14:01:18 +08:00
Li Min	14e455da3e	[None][fix] Fix CI issue for dsl pkg install (#7784 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-09-18 13:58:20 +08:00
Barry Kang	4f0e6b5f96	[None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-09-18 13:51:48 +08:00
Ziyi Xiong	28469dbf27	[https://nvbugs/5523080 ][fix] Correct the batch index in device tensors (#7803 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-18 13:45:37 +08:00
Guoming Zhang	e0423bfaab	[https://nvbugs/5519544 ][fix] fix invalid expression for disabling pa… (#7806 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-09-18 12:54:52 +08:00
Yanchao Lu	f8e811d134	[None][chore] Version bump for 1.1.0rc6 (#7824 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-18 11:13:56 +08:00
Yukun He	cd80e0a7f1	[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529 ) tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot. To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner. Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number. This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-09-18 10:58:52 +08:00
Lucas Liebenwein	39eb120b96	[#7308 ] [feat] AutoDeploy: graph-less transformers mode for HF (#7635 ) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>	2025-09-18 10:44:24 +08:00
Netanel Haber	a5cfc8368f	[https://nvbugs/5508536 ][fix] Revert #7041 : Move stop_criteria to sample_async (#7041 ) (#7796 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Mike Iovine <miovine@nvidia.com>	2025-09-17 21:27:01 -04:00
William Zhang	2614d71994	[TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 (#7628 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-09-17 08:11:16 -07:00
Zhenhuan Chen	6983e8a00d	[https://nvbugs/5517260 ][fix] move scaffolding contrib module's import to subdirectory (#7758 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-09-17 11:36:33 +08:00
Kaiyu Xie	62042a9733	[TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128 ) (#7571 ) Signed-off-by: Cheng Hang <chang@nvidia.com> Co-authored-by: Cheng Hang <chang@nvidia.com>	2025-09-17 09:41:32 +08:00
Yukun He	6313c9799c	[https://nvbugs/5488582 ][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe (#7708 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-09-17 09:00:28 +08:00
Shiyu Li	8bdbb48264	[https://nvbugs/5489015 ][fix] Support communicator split in MNNVL allreduce and fix the binding issues. (#7387 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-09-17 07:43:20 +08:00
HuiGao-NV	a49cfb3e68	[https://nvbugs/5516666 ][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding (#7737 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com> Co-authored-by: Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-17 06:24:20 +08:00
Aurelien Chartier	471723bce1	[None][chore] Remove unused get_quant_scales methods (#7687 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-09-16 12:56:11 -07:00
Lucas Liebenwein	9befd1a72f	[None][chore] AutoDeploy: neat disablement of transforms in pipeline (#7736 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-09-16 23:31:48 +08:00
bhsueh_NV	8226ef23dc	Revert "[None][feat] support attention dp for qwen3 dense model" (#7765 )	2025-09-16 19:09:04 +08:00
Kaiyu Xie	6eef19297f	[None] [chore] cherry pick changes on slurm scripts from `release/1.1.0rc2` (#7750 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-09-16 16:07:13 +08:00
Li Min	b278d06481	[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op (#7632 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-09-16 14:25:26 +08:00
Bo Li	3f4e160cba	[None][chore] Fix error when running trtllm-bench without cuda graph. (#7725 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-09-15 20:30:23 -07:00
Void	103b554734	[None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks (#7614 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-09-16 11:04:26 +08:00
Yanchao Lu	e5cead1eb9	[TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing (#7739 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-16 09:59:18 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
Necofish	96f11b10ae	[None][feat] support attention dp for qwen3 dense model (#7618 ) Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>	2025-09-16 09:33:22 +08:00
Ziyi Xiong	536e8776cd	[TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding (#7651 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-09-16 07:33:44 +08:00
Izzy Putterman	8097be7e9c	[None][feat] Eagle, use last hidden post norm (#7546 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-09-15 12:23:57 -04:00
jmydurant	7deefb3d2b	[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-09-15 21:43:49 +08:00
Zheng Duan	24fc1f9acf	[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-15 07:26:01 -04:00
Wanli Jiang	e080294725	[TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563 )" (#7722 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-15 17:19:44 +08:00
Wanli Jiang	fc9f4c9295	[TRTLLM-7918][feat] Support kvcache reuse for phi4mm (#7563 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-09-15 15:47:00 +08:00
DylanChen-NV	d5df0af017	[https://nvbugs/5467981 ][fix] Fix Qwen2.5-VL fails with cuda graph padding (#7122 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-09-15 15:02:34 +08:00
Chang Liu	47e37755a3	[TRTLLM-6903][feat] Support chunked prefill for multimodal models (#6843 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-09-14 20:10:10 -07:00
Pengyun Lin	c2bc39af63	[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-09-12 15:32:34 +08:00
Chang Liu	3a9847eb84	[https://nvbugs/5498165 ][fix] fix permission error for config file lock (#7656 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>	2025-09-11 10:36:51 +08:00
Dom Brown	fc9d426589	[https://nvbugs/5505402 ] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-09-10 18:30:48 +01:00
Leslie Fang	d219a4f225	[None][chore] remove executor config in kv cache creator (#7526 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-09-10 21:14:44 +08:00
Yiqing Yan	76c5e1a12f	[None][infra] Bump version to 1.1.0rc5 (#7668 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-10 16:06:54 +08:00
Kanghwan	758c22f832	[#7208 ][fix] Fix config type of MedusaConfig (#7320 ) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>	2025-09-09 23:25:17 -07:00
Frida Hou	bbb5ae3349	[#5861 ][autodeploy] Refactor: Quantization Transforms with Inheritance (#7227 ) Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-09-10 13:00:06 +08:00
Zheyu Fu	c353ff342e	[None][feat] Make the should_use_spec_decode logic a bit smarter (#7112 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-10 12:53:59 +08:00

1 2 3 4 5 ...

1221 Commits