855 Commits

Author SHA1 Message Date
TJian aa6fb8a329 [Bugfix] [ROCm] [Critical] fallback to regular abi for ROCm (#44648)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-06-05 15:51:17 +00:00
Chris Leonard 56aff0dd15 [10/n] Migrate cuda_view and silu_and_mul_per_block_quant kernels to torch stale ABI. (#44334) 2026-06-04 20:14:43 -07:00
Yongye Zhu b5235fca2e [DSv4] Adding TRTLLM gen attention kernel (#43827) 2026-06-04 07:35:09 -07:00
Chris Leonard 59d0236193 [10b/n] Migrate custom all-reduce, DeepSeek V4 fused MLA, MiniMax reduce-RMS, and MXFP8 MoE to libtorch stable ABI (#44365)
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-06-04 00:29:46 +08:00
Charlie Fu 71df063c49 Enable perf_token_group_quant/_C_stable_libtorch for ROCm (#42758)
Signed-off-by: charlifu <charlifu@amd.com>
2026-06-02 23:23:28 -07:00
SeongJun Lee 3099de3617 [Kernel][MoE] Add GELU_TANH to CPU, CUTLASS, and WNA16 MoE backends (#42027)
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Co-authored-by: lesj0610 <lesj0610@users.noreply.github.com>
2026-06-02 17:12:08 -04:00
Chris Leonard 4d93bc35c9 Migrate header files to torch stable abi (#44013) 2026-06-02 08:09:52 -07:00
Rukhaiya2004 689b0eeb9e [HARDWARE][POWER] Enable SHM communicator support for PowerPC (#43754)
Signed-off-by: Rukhaiya <rukhaiya@c643n08aix1-lp1.pok.stglabs.ibm.com>
Signed-off-by: Rukhaiya <bibirukhaiya123@gmail.com>
Co-authored-by: Rukhaiya <rukhaiya@c643n08aix1-lp1.pok.stglabs.ibm.com>
Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2026-06-02 18:06:32 +08:00
Fadi Arafeh 0b25cf4419 [CPU][Perf] Enable fused kernels for GDN's gated delta rules (#43534)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2026-06-02 08:00:48 +00:00
wcy 98f1279815 [CPU][RISC-V] Add missing RVV cpu_types helpers for WNA16 (#42730)
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2026-06-01 14:56:41 +08:00
Lanze Liu 124fac10cb [Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype (#42379)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-29 23:16:53 -07:00
Jee Jee Li 559d6710bf [PERF]MiniMax-M2 gate kernel (#38445)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com>
Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
2026-05-29 18:28:34 -07:00
JartX 0cff0741ff [Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 (#41394)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-05-29 11:04:40 +00:00
Chris Leonard 22a58640b4 [9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717)
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-29 04:44:45 +00:00
Matthias Gehre a9ec46d4b7 [ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-05-28 16:28:21 +00:00
Wentao Ye 64e1218673 [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-28 06:18:26 -07:00
tonyliu312 6cc8577421 [Kernel] Marlin MoE: include SM 12.x in default arch list (#40923)
Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
Co-authored-by: Tony Liu <tonyliu0512@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-28 15:30:26 +08:00
Chris Leonard 284e6f543d [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-27 09:35:24 -07:00
Yongye Zhu 6ab6ffb428 [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) 2026-05-26 05:12:54 -10:00
zhao, zhenhui 771e1e48b1 [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032)
Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com>
2026-05-26 14:15:47 +08:00
Jee Jee Li d4004455d2 [Kernel] Remove NormGateLinear (#43554)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
2026-05-25 09:49:19 +00:00
Jakub Zakrzewski 5bb8d2767a [Kernel] Batch invariant NVFP4 linear using cutlass (#39912)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2026-05-23 09:41:12 -04:00
Chris Leonard a7be0f342d [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-23 13:20:00 +08:00
gnovack f743254143 DSv4 fused Q-norm kernel grid refactor (#42353) 2026-05-22 15:21:33 -07:00
velonica0 c68c55d43e [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2026-05-21 04:50:49 -07:00
Chris Leonard 07aeaf9d4d [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-20 00:18:12 -07:00
Wentao Ye 37ece593c1 [Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement (#42774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-18 16:38:12 -07:00
Wentao Ye 00e20e76f7 [Refactor] Remove dead cuda kernels (#42767)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-18 11:14:21 -07:00
Yuwen Zhou 88a860d754 [CPU] Add MXFP4 W4A16 MoE support (#41922)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com>
2026-05-18 03:04:45 -07:00
Tianmu Li cac81b6eda [CPU Backend] Improve cpu thread utilization (#42666)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-18 03:04:41 -07:00
Li, Jiang b4601ad43f [CPU] Add fused GDN support for AMX CPU platform (#42707)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-05-18 03:04:36 -07:00
Nguyễn Thế Duy e3aeee5ff8 [Bugfix] moe lora align kernel grid (#40131)
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-05-18 00:17:53 -07:00
Blake Ledden 06d020bb6e [Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths (#35568)
Signed-off-by: Blake Ledden <blake@secondnaturecomputing.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
2026-05-15 10:59:00 -07:00
lyd1992 f351455f0f [CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector Extension (#40119)
Signed-off-by: liuyudong <liuyudong@iscas.ac.cn>
Co-authored-by: Claude <noreply@anthropic.com>
2026-05-15 12:08:23 +08:00
Wentao Ye 6548560496 [Compile] Fix compile warning with topk softplus sqrt (#41261)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-14 05:12:50 -07:00
Jee Jee Li 0a65d46628 [DSV4] Fuse norm and router for low latency scenario (#41263)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-14 05:11:02 -07:00
Chris Leonard 85b2fecab7 [5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI (continued) (#42339)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
2026-05-13 07:24:39 +00:00
Jiahan Chang (Cyrus) dd6b3a5ef5 [Perf] Use 2D-grid to eliminate divmod in W8W8 group quant (#42153)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
2026-05-12 10:01:30 -04:00
pschlan-amd 39dff5ff39 Add VLLM_USE_SPINLOOP_EXT to use more efficient busy polling (#36517)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
2026-05-11 16:11:49 -07:00
Wentao Ye 0d453e2336 [Perf] Batch invariance with Cutlass fp8 support, 28.9% E2E latency improvement (#40408)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-05-11 12:20:58 -04:00
Wentao Ye 4b64fc2cbf [Refactor] Cleanup batch invariant dead code (#41993)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-11 10:48:39 -04:00
Rohan Potdar a51376b3f0 [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA (#40392)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2026-05-11 14:10:50 +00:00
bnellnm 1b57eb41f2 [MoE] Move various experts classes to fused_moe/experts/ (#41979)
Signed-off-by: Jackmin801 <ongjackm@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Jackmin801 <ongjackm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
2026-05-11 07:54:33 +08:00
Wei Zhao 986edc858a [Bugfix] Fix DeepSeek v4 topk numerical issue for unaligned max-model-len (#42169) 2026-05-09 20:30:08 -07:00
Itay Etelis 00b0618a03 Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps (#39306)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <etelis2019@gmail.com>
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Itay Etelis <etelis2019@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-10 05:57:09 +03:00
pmaybank 6881c754e1 use HIP_VERSION variables to guard against duplicate atomicAdd definitions (#41802)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-05-08 18:44:37 -04:00
Li, Jiang b3945cc316 [CPU] Bump up to the latest CPU kernels (#41924)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-05-07 05:45:59 -07:00
Hongxia Yang 20cac26b19 [ROCm] Enable SimpleCPUOffloadConnector on ROCm (#40549)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-05-06 20:52:02 -07:00
Yongye Zhu 80d5e7d103 [Bugfix] Fix condition to clear persistent topk so that it can be captured regardless (#41665)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2026-05-06 16:17:48 -07:00
Zhaodong Bing 66d1cc0c77 fix(rocm): remove workaround causing invalid argument on Qwen3.5 with TP=2 (#40686)
Co-authored-by: Test User <test@example.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-05-06 01:38:32 -07:00