TJian
|
aa6fb8a329
|
[Bugfix] [ROCm] [Critical] fallback to regular abi for ROCm (#44648)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2026-06-05 15:51:17 +00:00 |
|
Chris Leonard
|
56aff0dd15
|
[10/n] Migrate cuda_view and silu_and_mul_per_block_quant kernels to torch stale ABI. (#44334)
|
2026-06-04 20:14:43 -07:00 |
|
Yongye Zhu
|
b5235fca2e
|
[DSv4] Adding TRTLLM gen attention kernel (#43827)
|
2026-06-04 07:35:09 -07:00 |
|
Chris Leonard
|
59d0236193
|
[10b/n] Migrate custom all-reduce, DeepSeek V4 fused MLA, MiniMax reduce-RMS, and MXFP8 MoE to libtorch stable ABI (#44365)
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-06-04 00:29:46 +08:00 |
|
Charlie Fu
|
71df063c49
|
Enable perf_token_group_quant/_C_stable_libtorch for ROCm (#42758)
Signed-off-by: charlifu <charlifu@amd.com>
|
2026-06-02 23:23:28 -07:00 |
|
SeongJun Lee
|
3099de3617
|
[Kernel][MoE] Add GELU_TANH to CPU, CUTLASS, and WNA16 MoE backends (#42027)
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Co-authored-by: lesj0610 <lesj0610@users.noreply.github.com>
|
2026-06-02 17:12:08 -04:00 |
|
Chris Leonard
|
4d93bc35c9
|
Migrate header files to torch stable abi (#44013)
|
2026-06-02 08:09:52 -07:00 |
|
Rukhaiya2004
|
689b0eeb9e
|
[HARDWARE][POWER] Enable SHM communicator support for PowerPC (#43754)
Signed-off-by: Rukhaiya <rukhaiya@c643n08aix1-lp1.pok.stglabs.ibm.com>
Signed-off-by: Rukhaiya <bibirukhaiya123@gmail.com>
Co-authored-by: Rukhaiya <rukhaiya@c643n08aix1-lp1.pok.stglabs.ibm.com>
Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-06-02 18:06:32 +08:00 |
|
Fadi Arafeh
|
0b25cf4419
|
[CPU][Perf] Enable fused kernels for GDN's gated delta rules (#43534)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-06-02 08:00:48 +00:00 |
|
wcy
|
98f1279815
|
[CPU][RISC-V] Add missing RVV cpu_types helpers for WNA16 (#42730)
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-06-01 14:56:41 +08:00 |
|
Lanze Liu
|
124fac10cb
|
[Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype (#42379)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
|
2026-05-29 23:16:53 -07:00 |
|
Jee Jee Li
|
559d6710bf
|
[PERF]MiniMax-M2 gate kernel (#38445)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com>
Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
|
2026-05-29 18:28:34 -07:00 |
|
JartX
|
0cff0741ff
|
[Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 (#41394)
Signed-off-by: JartX <sagformas@epdcenter.es>
|
2026-05-29 11:04:40 +00:00 |
|
Chris Leonard
|
22a58640b4
|
[9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717)
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-05-29 04:44:45 +00:00 |
|
Matthias Gehre
|
a9ec46d4b7
|
[ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
|
2026-05-28 16:28:21 +00:00 |
|
Wentao Ye
|
64e1218673
|
[Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-05-28 06:18:26 -07:00 |
|
tonyliu312
|
6cc8577421
|
[Kernel] Marlin MoE: include SM 12.x in default arch list (#40923)
Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
Co-authored-by: Tony Liu <tonyliu0512@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-05-28 15:30:26 +08:00 |
|
Chris Leonard
|
284e6f543d
|
[8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-05-27 09:35:24 -07:00 |
|
Yongye Zhu
|
6ab6ffb428
|
[Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162)
|
2026-05-26 05:12:54 -10:00 |
|
zhao, zhenhui
|
771e1e48b1
|
[CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032)
Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com>
|
2026-05-26 14:15:47 +08:00 |
|
Jee Jee Li
|
d4004455d2
|
[Kernel] Remove NormGateLinear (#43554)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
|
2026-05-25 09:49:19 +00:00 |
|
Jakub Zakrzewski
|
5bb8d2767a
|
[Kernel] Batch invariant NVFP4 linear using cutlass (#39912)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
|
2026-05-23 09:41:12 -04:00 |
|
Chris Leonard
|
a7be0f342d
|
[7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-05-23 13:20:00 +08:00 |
|
gnovack
|
f743254143
|
DSv4 fused Q-norm kernel grid refactor (#42353)
|
2026-05-22 15:21:33 -07:00 |
|
velonica0
|
c68c55d43e
|
[CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2026-05-21 04:50:49 -07:00 |
|
Chris Leonard
|
07aeaf9d4d
|
[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
|
2026-05-20 00:18:12 -07:00 |
|
Wentao Ye
|
37ece593c1
|
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement (#42774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-05-18 16:38:12 -07:00 |
|
Wentao Ye
|
00e20e76f7
|
[Refactor] Remove dead cuda kernels (#42767)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-05-18 11:14:21 -07:00 |
|
Yuwen Zhou
|
88a860d754
|
[CPU] Add MXFP4 W4A16 MoE support (#41922)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com>
|
2026-05-18 03:04:45 -07:00 |
|
Tianmu Li
|
cac81b6eda
|
[CPU Backend] Improve cpu thread utilization (#42666)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
|
2026-05-18 03:04:41 -07:00 |
|
Li, Jiang
|
b4601ad43f
|
[CPU] Add fused GDN support for AMX CPU platform (#42707)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-05-18 03:04:36 -07:00 |
|
Nguyễn Thế Duy
|
e3aeee5ff8
|
[Bugfix] moe lora align kernel grid (#40131)
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2026-05-18 00:17:53 -07:00 |
|
Blake Ledden
|
06d020bb6e
|
[Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 paths (#35568)
Signed-off-by: Blake Ledden <blake@secondnaturecomputing.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
|
2026-05-15 10:59:00 -07:00 |
|
lyd1992
|
f351455f0f
|
[CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector Extension (#40119)
Signed-off-by: liuyudong <liuyudong@iscas.ac.cn>
Co-authored-by: Claude <noreply@anthropic.com>
|
2026-05-15 12:08:23 +08:00 |
|
Wentao Ye
|
6548560496
|
[Compile] Fix compile warning with topk softplus sqrt (#41261)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-05-14 05:12:50 -07:00 |
|
Jee Jee Li
|
0a65d46628
|
[DSV4] Fuse norm and router for low latency scenario (#41263)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
|
2026-05-14 05:11:02 -07:00 |
|
Chris Leonard
|
85b2fecab7
|
[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI (continued) (#42339)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
|
2026-05-13 07:24:39 +00:00 |
|
Jiahan Chang (Cyrus)
|
dd6b3a5ef5
|
[Perf] Use 2D-grid to eliminate divmod in W8W8 group quant (#42153)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
|
2026-05-12 10:01:30 -04:00 |
|
pschlan-amd
|
39dff5ff39
|
Add VLLM_USE_SPINLOOP_EXT to use more efficient busy polling (#36517)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
|
2026-05-11 16:11:49 -07:00 |
|
Wentao Ye
|
0d453e2336
|
[Perf] Batch invariance with Cutlass fp8 support, 28.9% E2E latency improvement (#40408)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-05-11 12:20:58 -04:00 |
|
Wentao Ye
|
4b64fc2cbf
|
[Refactor] Cleanup batch invariant dead code (#41993)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-05-11 10:48:39 -04:00 |
|
Rohan Potdar
|
a51376b3f0
|
[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA (#40392)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
|
2026-05-11 14:10:50 +00:00 |
|
bnellnm
|
1b57eb41f2
|
[MoE] Move various experts classes to fused_moe/experts/ (#41979)
Signed-off-by: Jackmin801 <ongjackm@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Jackmin801 <ongjackm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com>
|
2026-05-11 07:54:33 +08:00 |
|
Wei Zhao
|
986edc858a
|
[Bugfix] Fix DeepSeek v4 topk numerical issue for unaligned max-model-len (#42169)
|
2026-05-09 20:30:08 -07:00 |
|
Itay Etelis
|
00b0618a03
|
Use CU_MEMCPY_SRC_ACCESS_ORDER_ANY for batch KV cache swaps (#39306)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <etelis2019@gmail.com>
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Itay Etelis <etelis2019@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
|
2026-05-10 05:57:09 +03:00 |
|
pmaybank
|
6881c754e1
|
use HIP_VERSION variables to guard against duplicate atomicAdd definitions (#41802)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-05-08 18:44:37 -04:00 |
|
Li, Jiang
|
b3945cc316
|
[CPU] Bump up to the latest CPU kernels (#41924)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-05-07 05:45:59 -07:00 |
|
Hongxia Yang
|
20cac26b19
|
[ROCm] Enable SimpleCPUOffloadConnector on ROCm (#40549)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-05-06 20:52:02 -07:00 |
|
Yongye Zhu
|
80d5e7d103
|
[Bugfix] Fix condition to clear persistent topk so that it can be captured regardless (#41665)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
2026-05-06 16:17:48 -07:00 |
|
Zhaodong Bing
|
66d1cc0c77
|
fix(rocm): remove workaround causing invalid argument on Qwen3.5 with TP=2 (#40686)
Co-authored-by: Test User <test@example.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-05-06 01:38:32 -07:00 |
|