TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Perkz Zheng	1f292ff2a0	[https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels (#5426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-25 16:31:10 +08:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
yunruis	b3e886074e	Fix CI build time increase (#5337 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-19 13:49:42 +08:00
qsang-nv	5236bb9084	delete cubins (#5274 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 22:10:49 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
qsang-nv	faca19c2f0	update setup.py for special cases (#5227 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 16:41:07 +08:00
qsang-nv	134cb66a53	fix mla test (#5240 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 15:26:25 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Tracin	a2e8ae1120	Update internal cutlass commit. (#5228 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-17 10:47:45 +08:00
Anthony Chang	4f9fa9f21d	feat: MoE trtllm backend kernel update (#5183 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-16 14:46:13 +08:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
qsang-nv	5a01ba5260	use cu for fmha_v2 (#4694 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-15 18:40:44 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
dongxuy04	97657bfda2	optimize memset before alltoall communication (#5188 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-14 10:49:47 +08:00
Perkz Zheng	3d87770e15	[https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels (#5164 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-13 23:01:01 +08:00
yunruis	e5be3a95b3	fix: fix license bug (#5200 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-13 18:58:15 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Yao Yao	12e075eb70	[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime (#5133 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-13 15:53:29 +08:00
Matthias Jouanneaux	514baf1287	[fix] Fix comment to pass guardwords check (#5191 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-06-13 15:49:59 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Bo Li	1b79041f5d	fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-11 09:38:10 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
liji-nv	1d4f748773	[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-09 17:50:57 +08:00
ChristinaZ	f45aff2b7d	Add customized renormalized moe routing kernel for moe cutlass backend (#4955 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-09 17:38:50 +08:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
Omer Ullman Argov	8731f5f14f	chore: Mass integration of release/0.20 (#4898 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-08 23:26:26 +08:00
dongxuy04	1e369658f1	feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-06-08 10:25:18 +08:00
Jinyang Yuan	20d0649f19	[feat] Support XQA-based MLA on SM120 (#4858 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-06-06 22:32:49 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
dongjiyingdjy	51652b9b2b	feat : add PositionEmbeddingType=0 to xqa support (#4934 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-05 21:50:42 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
ChristinaZ	d64af85e8c	Replace memset with data initialization within kernels (#4851 ) Signed-off-by: Christina Zhang <christinaz@nvidia.com>	2025-06-04 08:56:46 +08:00
Perkz Zheng	a089aa3225	[https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-03 19:02:57 -04:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
Yilin Fan	90aab0596e	[fix] Fix Llama4 guradwords failures (#4844 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-02 13:43:42 -07:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Jinyang Yuan	5339d367ce	[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-30 09:03:52 +08:00
Yilin Fan	31bb650298	Cherry pick feat/llama4 to main (#4739 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-05-30 05:28:40 +08:00
yunruis	29ac4c20e0	fix: fix dsr1 min lat cga ar rate drop(0.2) (#4561 ) Signed-off-by: yunruis <yunruis@nvidia.com>	2025-05-27 21:59:57 +08:00
Perkz Zheng	4d711be8f4	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-26 09:06:33 +08:00
Yao Yao	ef763b0ddc	fix: rename some terms (#4534 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-05-23 23:23:49 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00

1 2 3 4 5

219 Commits