623 Commits

Author SHA1 Message Date
Jee Jee Li 559d6710bf [PERF]MiniMax-M2 gate kernel (#38445)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com>
Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
2026-05-29 18:28:34 -07:00
Wentao Ye 64e1218673 [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-28 06:18:26 -07:00
Jee Jee Li ec5de7fa7d [LoRA] Add one shot triton kernel For MoE LoRA (#42290)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-05-25 19:47:04 -07:00
Jee Jee Li d4004455d2 [Kernel] Remove NormGateLinear (#43554)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
2026-05-25 09:49:19 +00:00
danisereb d56285c747 Tuning script and configs for Triton Mamba SSU kernel (#43083)
Signed-off-by: Banani Ghosh <bg2502@nyu.edu>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Co-authored-by: Banani Ghosh <bg2502@nyu.edu>
2026-05-24 20:12:44 +03:00
Benjamin Chislett 4e2eba28be [Perf] Optimize hidden state extraction logic (#37374)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-22 18:23:08 -04:00
Viktor Pus 87a2adcb43 [Misc] Add common random prefix option to structured-output serving benchmark (#41632)
Signed-off-by: Viktor Pus <viktorpus@tenstorrent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-16 00:44:48 +00:00
lyd1992 f351455f0f [CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector Extension (#40119)
Signed-off-by: liuyudong <liuyudong@iscas.ac.cn>
Co-authored-by: Claude <noreply@anthropic.com>
2026-05-15 12:08:23 +08:00
Matthew Bonanni 9898f94abe [Attention] Remove deprecated MLA prefill arguments (#42555)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-05-14 10:34:06 -07:00
Jee Jee Li 0a65d46628 [DSV4] Fuse norm and router for low latency scenario (#41263)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-14 05:11:02 -07:00
Yongye Zhu 0d2732dd91 [MLA Attention Backend] Add TOKENSPEED_MLA backend for DSR1/Kimi K25 prefill + decode on Blackwell (#41778)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-05-13 23:48:02 -07:00
bnellnm 6427603ae8 [MoE Refactor] Move remaining experts classes to experts directory (#42334)
Signed-off-by: Bill Nell <bnell@redhat.com>
2026-05-12 09:19:46 -04:00
Dao007forever 4845aee6b7 [Benchmark] Add --trust-remote-code flag to multi-turn benchmark (#41661)
Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-05-05 01:00:37 -07:00
snadampal 3179e53135 [P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes (#32553)
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
2026-04-30 10:14:20 +00:00
Zhanda Zhu 5d5c776444 [Perf] FP8 FlashInfer Attn for ViT (#38065)
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com>
Co-authored-by: Yubo Gao <ybgao-nvidia@users.noreply.github.com>
2026-04-27 13:44:15 +08:00
Ignacio Sica f88763efc3 [Bugfix] add seq_lens_cpu_upper_bound to CommonAttentionMetadata in mla_runner.py (#40844)
Signed-off-by: ignaciosica <mignacio.sica@gmail.com>
2026-04-24 23:13:52 +00:00
Jackmin801 079a4cf399 [MoE] Move cutlass moe to fused_moe/experts/ (#40574)
Signed-off-by: Jackmin801 <ongjackm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-04-24 06:05:49 +00:00
Yanan Cao fe5c115ee4 [vLLM IR] Add IR op testing and benchmarking infrastructure (#40167)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: Theresa Shan <Theresa.Shan@amd.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 00:23:03 +00:00
Nicolò Lucchesi 8625ec267b [Misc] Multi-turn benchmark output performance json (#39572)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-04-13 18:15:23 +00:00
Yan Ma ec68d53b2b Add platform manual_seed_all API (#38468)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2026-04-10 13:43:50 +08:00
Maral 2e9034c998 [W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. (#33892)
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
2026-04-09 08:50:39 +08:00
Jackmin801 a776a48b1c [MoE] Move DEEP_GEMM into experts/ subdirectory (#39005)
Signed-off-by: Jackmin801 <ongjackm@gmail.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-04-08 19:23:08 +00:00
Carl Y 3bc2734dd0 [Kernel] Fuse FP8 output quantization into merge_attn_states (#36518)
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com>
2026-04-03 01:47:04 +00:00
Xin Yang 9bd7231106 Revert "[Kernel] Add gpt-oss Router GEMM kernel (#37205)" (#38778)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-04-01 22:02:32 -07:00
Monishver c09ad767cd Feature/silu block quant fusion v1 (#32996)
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
2026-04-01 18:50:43 +00:00
Zhanda Zhu c75a313824 [Perf] triton bilinear_pos_embed kernel for ViT (#37948)
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com>
2026-04-01 01:52:02 -07:00
whyiug 58c959a767 [Misc]: clean up non-core lint issues (#37049)
Signed-off-by: whyiug <whyiug@hotmail.com>
2026-03-28 10:28:16 -04:00
Liwen 171775f306 Fix Device Index for ROCm Ray Workers in MoE Benchmark (#38108)
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-28 08:27:11 +00:00
Jee Jee Li 2bfbdca23c [Bugfix] Fix benchmark_fused_collective.py (#38082)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-03-25 23:51:00 -07:00
Harry Mellor d215d1efca [Mypy] Better fixes for the mypy issues in vllm/config (#37902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-25 06:14:43 -07:00
Kyle Sayers 38364a7e32 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-23 16:03:29 -04:00
Harry Mellor 572b432913 Stop bench CLI from recursively casting all configs to dict (#37559)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 14:04:03 +00:00
Wentao Ye 0ef7f79054 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement (#37340)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-18 14:18:34 -04:00
Xin Yang b1169d7be8 [Kernel] Add gpt-oss Router GEMM kernel (#37205)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-03-18 08:15:56 -07:00
Andrey Talman 68f783a727 [Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility (#35673)
Signed-off-by: atalman <atalman@fb.com>
2026-03-17 18:47:59 +00:00
Wei Zhao a3a51d20e7 [Benchmark] Improvements to attention benchmark script (#37115)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-03-16 22:22:40 +00:00
Kunshang Ji 747b068136 [Hardware] Replace memory related torch.cuda APIs (#37031)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
2026-03-16 10:24:48 +00:00
Matthew Bonanni f444c05c32 [Attention] Use FA4 for MLA prefill (#34732)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-12 12:10:17 -04:00
Kunshang Ji 53ec16a705 [Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-12 07:57:47 -07:00
Yan Ma 894843eb25 replace with torch.cuda.device with with torch.accelerator.device_index (#36144)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2026-03-11 23:12:57 -07:00
Roberto L. Castro 580864d81e [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
2026-03-09 09:50:36 -07:00
Roberto L. Castro 2b28b9b269 [Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 (#35290)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-03-09 09:46:57 -07:00
Harry Mellor a0f44bb616 Allow markdownlint to run locally (#36398)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-08 20:05:24 -07:00
lif 00b814ba5a [V0 Deprecation] Remove unused swap_space parameter (#36216)
Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
Jiayi Yan 6a895197fa [Bugfix][CI] fix typos (#34934)
Signed-off-by: 1195343015 <1195343015@qq.com>
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-05 17:05:46 +00:00
Kunshang Ji 66a2209645 [Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize (#36085)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-05 10:36:39 +00:00
Kunshang Ji 16d2ad1d38 [Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache (#30681)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-04 09:49:47 +00:00
Robert Shaw 97995f6376 [MoE Refactor] Create MK for TRTLLM Kernels (#32564)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2026-03-03 10:39:50 -08:00
Cyrus Leung 792a74b973 [Doc] Improve UX of --enable-log-requests (#35723)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-02 08:24:09 -08:00
Wentao Ye 05970c772c [Refactor] Remove dead code for attention benchmark script (#35418)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-02-26 09:53:46 -08:00