TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
ChristinaZ	a608b00d38	Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-27 20:17:40 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Tailing Yuan	ef43b95aa1	Fix execute_process: check results using EQUAL (#5481 )	2025-06-27 11:57:04 +08:00
Anthony Chang	de7cd0de05	fix: MoE autotune fallback failed to query default heuristic (#5520 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-26 17:28:48 +01:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
Alessio Netti	7e681fbe52	[chore] Allow configuring linking of NVRTC wrapper (#5189 ) Signed-off-by: Alessio Netti <netti.alessio@gmail.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 07:26:10 +02:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
Daniel Stokes	942841417e	opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-26 12:18:19 +08:00
qsang-nv	e9cd810071	keep sm90 headsize 128 cubins (#5320 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-26 12:14:01 +08:00
ChristinaZ	d135f5993d	Add unit test for routing kernels (#5405 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-26 09:49:11 +08:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Perkz Zheng	1f292ff2a0	[https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels (#5426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-25 16:31:10 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
Robin Kobus	b3045c44b9	refactor: remove TrtGptModelOptionalParams (#5165 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-20 10:31:40 +02:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Kaiyu Xie	113f6fbadd	Fix: missing clientId when serialize and deserialize response (#5231 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 23:05:11 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00
yunruis	b3e886074e	Fix CI build time increase (#5337 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-19 13:49:42 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Bo Li	d76bda7f2c	chore: Refine printed info of CHECK_TYPE. (#5295 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-18 15:35:41 +08:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
qsang-nv	5236bb9084	delete cubins (#5274 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 22:10:49 +08:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Robin Kobus	dc3861b4aa	refactor: Unify decoder test with e2e worklfow (#5239 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-17 12:04:58 +02:00
qsang-nv	faca19c2f0	update setup.py for special cases (#5227 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 16:41:07 +08:00
qsang-nv	134cb66a53	fix mla test (#5240 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 15:26:25 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Tracin	a2e8ae1120	Update internal cutlass commit. (#5228 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-17 10:47:45 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Anthony Chang	4f9fa9f21d	feat: MoE trtllm backend kernel update (#5183 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-16 14:46:13 +08:00
Chuang Zhu	1d2b0d3d80	use file lock to avoid port conflict (#5123 )	2025-06-16 14:15:37 +08:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
qsang-nv	5a01ba5260	use cu for fmha_v2 (#4694 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-15 18:40:44 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
Robin Kobus	443b2eb51f	refactor: Speculative decoding buffers (#5091 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-14 11:39:32 +02:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
dongxuy04	97657bfda2	optimize memset before alltoall communication (#5188 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-14 10:49:47 +08:00
Perkz Zheng	3d87770e15	[https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels (#5164 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-13 23:01:01 +08:00
Chuang Zhu	8e9937081d	ucxx only use ucp_feature_tag to aviod some issuse on some platform (#4994 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-13 19:14:25 +08:00
yunruis	e5be3a95b3	fix: fix license bug (#5200 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-13 18:58:15 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Yao Yao	12e075eb70	[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime (#5133 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-13 15:53:29 +08:00
Matthias Jouanneaux	514baf1287	[fix] Fix comment to pass guardwords check (#5191 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-06-13 15:49:59 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Xiaodong (Vincent) Huang	cc2a1344be	None: fix OOM because of unnecessary mha workspace (#5056 ) Signed-off-by: Vincent Huang <vincenth@nvidia.com>	2025-06-12 21:56:05 +02:00
liji-nv	10ab9791ec	[fix] Do not reuse dummy request KVCache (#4804 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-12 15:24:50 +08:00
Netanel Haber	e692779ead	Solve underallocation in VSWA+/VGQA (#4667 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-12 12:12:46 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
Zheng Duan	ee44fa00f8	chore: rename IOFormatter to BaseCacheFormatter (#5068 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-12 10:50:14 +08:00
Bo Li	1b79041f5d	fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-11 09:38:10 +08:00
Tracin	6c91f1c7ac	Mxfp8xmxfp4 quant mode(#4978 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-10 22:01:37 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Aurelien Chartier	dcf72c6ad3	chore: cleanup GDS Cmake interface (#4928 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-10 17:25:43 +08:00
dongxuy04	7137cc8f67	fix cuda driver link issue with driver version less than 12.3 (#5025 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-10 15:27:39 +08:00
pcastonguay	87c56ab024	perf: Removing initializing ptuning buffers to zero (#4915 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-06-09 21:57:21 -04:00
Daniel Cámpora	d68b8180d3	feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-10 07:28:34 +08:00
Chang Liu	f70815c945	[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-06-10 01:59:56 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
liji-nv	1d4f748773	[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-09 17:50:57 +08:00
ChristinaZ	f45aff2b7d	Add customized renormalized moe routing kernel for moe cutlass backend (#4955 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-09 17:38:50 +08:00
Chuang Zhu	9a874760c1	Kv cache transfer support duplicate heads (#4929 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:11:19 +08:00
Chuang Zhu	947571c311	Fix buffer count (#5007 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:01:13 +08:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
Omer Ullman Argov	8731f5f14f	chore: Mass integration of release/0.20 (#4898 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-08 23:26:26 +08:00
dongxuy04	1e369658f1	feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-06-08 10:25:18 +08:00
Jinyang Yuan	20d0649f19	[feat] Support XQA-based MLA on SM120 (#4858 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-06-06 22:32:49 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
dongjiyingdjy	51652b9b2b	feat : add PositionEmbeddingType=0 to xqa support (#4934 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-05 21:50:42 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
Zheng Duan	ded694b1aa	feat: cache reuse support (selective cache transfer) in mla cache formatter (#4749 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 09:56:31 +08:00
ChristinaZ	d64af85e8c	Replace memset with data initialization within kernels (#4851 ) Signed-off-by: Christina Zhang <christinaz@nvidia.com>	2025-06-04 08:56:46 +08:00
Perkz Zheng	a089aa3225	[https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-03 19:02:57 -04:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
Robin Kobus	3de02582dd	refactor: Separate DecoderState from GptDecoderBatched (#4700 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-03 09:42:01 +02:00
Robin Kobus	b9263a8e10	fix: max_num_sequences calculation with overlap scheduling (#4532 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-03 09:31:22 +02:00
Tian Zheng	9832787050	[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-06-03 10:00:17 +08:00
Yilin Fan	90aab0596e	[fix] Fix Llama4 guradwords failures (#4844 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-02 13:43:42 -07:00
Enwei Zhu	5b4852b7b5	feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-02 01:25:02 +08:00
Netanel Haber	2ce05c3ab4	'entered copyBlock' format string expects %s, pass string rather than int (#4820 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-01 08:54:33 -07:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Daniel Cámpora	69c7fe8905	[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-01 03:32:43 +08:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Chuang Zhu	f117d6abe9	Fabric Memory for KV Cache Transfer (#4717 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-30 15:50:21 +08:00
Thor Johnsen	55d56f8155	[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596 ) Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-05-29 22:03:20 -07:00
Jinyang Yuan	5339d367ce	[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-30 09:03:52 +08:00
Yilin Fan	31bb650298	Cherry pick feat/llama4 to main (#4739 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-05-30 05:28:40 +08:00
Robin Kobus	79a94a28f9	refactor: unique_ptr instead of shared_ptr (#4697 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-29 22:49:35 +02:00
Jhao-Ting Chen	fcadce9f8d	[fix] Eagle-2 LLMAPI pybind argument fix. (#3967 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-29 12:23:25 -07:00
Arthur Rasmusson	812b1abf86	feature: KV Cache GPUDirect Storage (#3209 ) Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-28 23:27:43 +00:00
Robin Kobus	12763779c4	chore: Clean up cpp runtime (#4449 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-28 16:32:59 +02:00
ixlmar	fbe4db207d	feat: forward exceptions to Python and catch OOMs (#4497 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-28 11:58:10 +02:00
Kaiyu Xie	b800adc65c	Fix: hang on disagg when MNNVL two-shot AllReduce is enabled (#4678 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-05-28 13:03:53 +08:00
yunruis	29ac4c20e0	fix: fix dsr1 min lat cga ar rate drop(0.2) (#4561 ) Signed-off-by: yunruis <yunruis@nvidia.com>	2025-05-27 21:59:57 +08:00
Robin Kobus	93a54457ac	[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608 ) (#4621 ) - Moved sorting related logic to a dedicated function for better clarity and maintainability. - Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID. - Updated function documentation to reflect the sorting behavior and its purpose. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 17:10:55 +08:00
Robin Kobus	502758aaa9	fix: Handle additional model outputs based on pipeline parallel rank (#4498 ) - Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 09:04:40 +02:00
Perkz Zheng	4d711be8f4	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-26 09:06:33 +08:00
shaharmor98	2b8f6d2871	Fix snake case format (#4559 ) fix snake case format Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-25 17:57:17 +08:00
Chuang Zhu	b60846b47d	fix datatype check (#4606 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-24 08:36:17 +08:00
Yao Yao	ef763b0ddc	fix: rename some terms (#4534 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-05-23 23:23:49 +08:00
Robin Kobus	7b2818a47b	refactor: CreateNewDecoderRequests (#4452 ) * refactor: CreateNewDecoderRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Consolidate request generation in CreateNewDecoderRequests - Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests. - Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options. - Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy. - Cleaned up associated includes and references throughout the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Simplify request handling in CreateNewDecoderRequests - Removed the generateRequestOptions method and integrated its logic directly into the operator() method. - Updated the request generation process to improve clarity and reduce redundancy. - Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests - Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling. - Removed redundant request generation logic from the operator() method, streamlining the process. - Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests - Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management. - Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality. - Cleaned up unnecessary includes and improved code organization for better maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters - Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder. - Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling. - Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests - Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency. - Updated bindings and method signatures for decoder stream handling. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 22:54:37 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00
Bo Li	9ae705af1b	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 ) * Add Julien's origina kernel. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Get rid of UpdateKVCache functionality. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add kernels. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add torch OP. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update cmake. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Torch OP must use double as argument dtype. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix misaligned access when head_dim=64. In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Remove unroll (compiler can do that). Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add switch for interleave. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Refactor vectorized load/store. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Implement is_neox. Result not correct yet. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix is_neox=True. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add q_weight and k_weight. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --------- Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-05-23 15:31:04 +08:00
CarstyYou	ef280e687e	[feat] support fp8 blockscale gemm on sm89 (#4481 ) * [feat] integrate ada blockwise gemm Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] align scale M Signed-off-by: CarstyYou <xiy@nvidia.com> * [feat] swizzle mma output Signed-off-by: CarstyYou <xiy@nvidia.com> * [test] add ut for sm89 Signed-off-by: CarstyYou <xiy@nvidia.com> * [delete] remove useless comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] codestyle Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] fix review comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> --------- Signed-off-by: CarstyYou <xiy@nvidia.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-05-23 10:39:10 +08:00
nv-guomingz	e3a534d0ee	chore: guardword clean for header file. (#4540 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 10:08:14 +08:00
pcastonguay	d7d455e7ea	[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243 ) * feat: Enabling dis serving with TRT backend with Python runtime Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing formatting Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg mtp test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-22 22:01:06 -04:00
dongxuy04	338744fba6	fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573 ) fix moe possible race cond and add bypass worker thread for no updates Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-23 09:24:23 +08:00
nv-guomingz	3549b68c1c	chroe:clean useless flag (#4567 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 07:05:15 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00
Chuang Zhu	558eaecf16	fix sequence data race (#4565 ) stash for debug broken promise Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 23:13:48 +08:00
Chuang Zhu	44cfd757b2	Agent interface impl for NIXL (#4125 ) * agentConnection Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> recv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> agentState Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> NIXL interfaces Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> update cmakelists Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> nixl improve Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> remove cppzmq Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> transferAgent remove register Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> work for cache Test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> reduce sleep time Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> intergarte Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> nixl env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix rebase error Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> cpp test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> stash for send metaData Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> loadRemoteMD after fetchRemoteMD Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> workaround for mixed gen and context Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> test_env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> avoid port conflict in test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use std::string Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * typo Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix transferAgentTest Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 09:09:41 +08:00
Nikita Korobov	e1b42be3d1	fix: TRT-LLM Gen dtype declaration (#4503 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-21 23:56:37 +02:00
Zongfei Jing	dbaddb3a29	Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216 ) * Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe Signed-off-by: Shiyu Li <shili@nvidia.com> Adding comments Signed-off-by: Shiyu Li <shili@nvidia.com> Add unittest of the twoshot kernel. Signed-off-by: Shiyu Li <shili@nvidia.com> Update dispatch logic Signed-off-by: Shiyu Li <shili@nvidia.com> Use cpu barrier instead of GPU at init Signed-off-by: Shiyu Li <shili@nvidia.com> Merge dispatch logic fix Signed-off-by: Shiyu Li <shili@nvidia.com> Update the kernel to use GPU-managed buffer Signed-off-by: Shiyu Li <shili@nvidia.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean up Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Simplify AllReduce interface Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix warning Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Skip ut for no_fusion Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Shiyu Li <shili@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Shiyu Li <shili@nvidia.com>	2025-05-22 03:42:36 +08:00
Ruoqian Guo	db7446fda7	Feat: add deep_gemm swapab Kernel (#4430 ) * feat: add deepgemm_swapab feat: add fp8_gemm_kernel_swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> feat: set threshold for deepgemm and deepgemmswapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * docs: update README.md Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * fix: std::runtime_error needs #include <stdexcept> Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove the redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * feat: support for dense deep_gemm swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> --------- Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-05-21 10:48:43 +08:00
Thor Johnsen	5d438be59a	[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936 ) * v1.5 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> v1.5.4 Add back draft_overhead to spec dec stats Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v1.5.5: fix CI error Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.6: fix CI error 8196 > 8192 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * precommit run Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v2.0: Address reviewer concerns Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v2.1: add fix from wili Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Revert changes that require use of TypeAlias because that requires python version >= 3.10 Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> --------- Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-21 10:40:00 +08:00
Perkz Zheng	426f6fd2bc	Feat: add chunked-attention kernels on Blackwell (#4394 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add chunked-attention kernels on blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> fix Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-21 10:16:46 +08:00
djns99	a030a898d1	perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-05-21 10:00:36 +08:00
Robin Kobus	8564c5a41f	refactor: Unify request order in TRT and PyTorch workflow (#4096 ) * chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-20 18:49:27 +02:00
dongxuy04	21aff2e313	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 ) * first commit of cpp moe loadbalance code Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python bindings for moe load balance Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python wrapper, ut and bug fixes Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add binding for layerId and update binding test Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add host tensor sharing and ut Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-20 17:53:48 +08:00
kanghui0204	6f3922f318	feat: Low Precision Allreduce for PCIe based GPU (#4344 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-20 06:53:46 +08:00
Yuxian Qiu	c8e062bfd3	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-19 14:25:36 -07:00
Perkz Zheng	1c5b0d6a13	[Feat] add chunked-attention kernels on Hopper (for llama4) (#4291 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-05-19 09:57:10 -07:00
Faraz	7656af1b57	[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) (#4335 ) * add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * update cutlass versions Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added internal cutlass with fix and docker update Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added mixtral to pro 6000 Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> --------- Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-05-19 08:56:21 -07:00
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
Shi Xiaowei	df2798e0c3	feat: NIXL interface integration (#3934 ) NIXL interfaces Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-05-19 18:18:22 +08:00
Void	62bb7f9286	fix potential issues in allreduce fusion kernel and ut (#4226 ) fix allreduce fuison kernels and ut Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> --------- Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-19 17:38:29 +08:00
Jinyang Yuan	b618e1f55b	perf: Eliminate the need for attention DP padding when possible (#3439 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-05-17 13:30:55 +08:00
Robin Kobus	4e370a509a	refactor: Copy sequence lengths once in decoder setup (#4102 ) * refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update DecoderInputBuffers to remove duplicated buffers - Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability. - Adjusted references in generateRequestOptions.cpp to align with the new buffer structure. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move getEmbeddingBias to anonymous namespace Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Filter context requests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: GenerateRequestOptions using more fine-grained functions - Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests. - Updated the `operator()` method to utilize the new method, improving code clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMDecoder - Updated the `generate_request_options` call. - Updated the `make_decoding_batch_input_output` call. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove const where we modify input buffers - Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications. - Updated related function calls to ensure compatibility with the new parameter types. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-16 22:03:55 +08:00
Nikita Korobov	fa3879629e	feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280 ) - Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. - Refactors TRT-LLM Gen MoE runner to call to BMM interface - The accuracy is verified for DeepSeek R1 FP4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-16 13:31:53 +02:00
ixlmar	f7ad49bb9b	chore: improve log-level setting UX (#4352 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-16 09:47:44 +01:00
NVJiangShao	6cc3f2093a	Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-16 10:02:30 +08:00
Erin	c44cf34373	fix: update checks that broke medusa tests when use_py_session=True (#4339 ) fix check Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-15 15:47:28 -07:00
yuxianq	4f8afe4cc6	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-16 04:16:53 +08:00
yuxianq	0e87fcc228	refactor: use x is None instead of x == None. (#4244 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-15 20:00:04 +08:00
zhhuang-nv	97bc680cd8	feat: support kv cache reuse for MLA (#3571 ) * support kv cache reuse for MLA load compressed_kv and k_pe and do up-projection use 192/128 head size MLA context kernel support Blackwell and Hopper now Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * add CI test Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2 Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * use GPTJ style RoPE for MLA Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix rebase error and some docs Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix kv_lens Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * tiny fix Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: use normal device memory instead of pinned memory for unit test Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * fix L0 tests Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile after rebase Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments again Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com> Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-05-15 15:22:21 +08:00
Zhanrui Sun	5dc3b539ba	infra: Down the gcc toolset version from 13 to 11 (#4114 ) * Down the gcc toolset version from 13 to 11 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update rocky8 images Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-05-15 11:08:51 +08:00
qsang-nv	0fd59d64ab	infra: open source fmha v2 kernels (#4185 ) * add fmha repo Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix code style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header kernel_traits.h Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add .gitignore file Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add SLIDING_WINDOW_ATTENTION Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update setup.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update build_wheel.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-05-15 10:56:34 +08:00
QI JUN	498ce8a056	Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340 ) Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit `5e634dd1bd`.	2025-05-15 09:52:39 +08:00
hlu1	7fb0af9320	[fix] Remove stale cublas heuristics (#4326 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-14 17:35:51 -07:00

1 2 3 4 5 ...

479 Commits