Jee Jee Li
559d6710bf
[PERF]MiniMax-M2 gate kernel ( #38445 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com >
Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com >
2026-05-29 18:28:34 -07:00
Wentao Ye
64e1218673
[Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement ( #43014 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-05-28 06:18:26 -07:00
Jee Jee Li
ec5de7fa7d
[LoRA] Add one shot triton kernel For MoE LoRA ( #42290 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-05-25 19:47:04 -07:00
Jee Jee Li
d4004455d2
[Kernel] Remove NormGateLinear ( #43554 )
...
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai >
2026-05-25 09:49:19 +00:00
danisereb
d56285c747
Tuning script and configs for Triton Mamba SSU kernel ( #43083 )
...
Signed-off-by: Banani Ghosh <bg2502@nyu.edu >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Co-authored-by: Banani Ghosh <bg2502@nyu.edu >
2026-05-24 20:12:44 +03:00
Benjamin Chislett
4e2eba28be
[Perf] Optimize hidden state extraction logic ( #37374 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-22 18:23:08 -04:00
Viktor Pus
87a2adcb43
[Misc] Add common random prefix option to structured-output serving benchmark ( #41632 )
...
Signed-off-by: Viktor Pus <viktorpus@tenstorrent.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-16 00:44:48 +00:00
lyd1992
f351455f0f
[CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector Extension ( #40119 )
...
Signed-off-by: liuyudong <liuyudong@iscas.ac.cn >
Co-authored-by: Claude <noreply@anthropic.com >
2026-05-15 12:08:23 +08:00
Matthew Bonanni
9898f94abe
[Attention] Remove deprecated MLA prefill arguments ( #42555 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-05-14 10:34:06 -07:00
Jee Jee Li
0a65d46628
[DSV4] Fuse norm and router for low latency scenario ( #41263 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io >
Co-authored-by: jeejeelee <jeejeelee@verda-b300-05.datacrunch.io >
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-14 05:11:02 -07:00
Yongye Zhu
0d2732dd91
[MLA Attention Backend] Add TOKENSPEED_MLA backend for DSR1/Kimi K25 prefill + decode on Blackwell ( #41778 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-05-13 23:48:02 -07:00
bnellnm
6427603ae8
[MoE Refactor] Move remaining experts classes to experts directory ( #42334 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-05-12 09:19:46 -04:00
Dao007forever
4845aee6b7
[Benchmark] Add --trust-remote-code flag to multi-turn benchmark ( #41661 )
...
Signed-off-by: Dao Le <daole@inferact.ai >
Signed-off-by: Dao Le <Dao007forever@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-05-05 01:00:37 -07:00
snadampal
3179e53135
[P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes ( #32553 )
...
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com >
2026-04-30 10:14:20 +00:00
Zhanda Zhu
5d5c776444
[Perf] FP8 FlashInfer Attn for ViT ( #38065 )
...
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com >
Co-authored-by: Yubo Gao <ybgao-nvidia@users.noreply.github.com >
2026-04-27 13:44:15 +08:00
Ignacio Sica
f88763efc3
[Bugfix] add seq_lens_cpu_upper_bound to CommonAttentionMetadata in mla_runner.py ( #40844 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2026-04-24 23:13:52 +00:00
Jackmin801
079a4cf399
[MoE] Move cutlass moe to fused_moe/experts/ ( #40574 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-04-24 06:05:49 +00:00
Yanan Cao
fe5c115ee4
[vLLM IR] Add IR op testing and benchmarking infrastructure ( #40167 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Theresa Shan <Theresa.Shan@amd.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-21 00:23:03 +00:00
Nicolò Lucchesi
8625ec267b
[Misc] Multi-turn benchmark output performance json ( #39572 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-04-13 18:15:23 +00:00
Yan Ma
ec68d53b2b
Add platform manual_seed_all API ( #38468 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-04-10 13:43:50 +08:00
Maral
2e9034c998
[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. ( #33892 )
...
Signed-off-by: maral <maralbahari.98@gmail.com >
Signed-off-by: Maral <maralbahari.98@gmail.com >
2026-04-09 08:50:39 +08:00
Jackmin801
a776a48b1c
[MoE] Move DEEP_GEMM into experts/ subdirectory ( #39005 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-08 19:23:08 +00:00
Carl Y
3bc2734dd0
[Kernel] Fuse FP8 output quantization into merge_attn_states ( #36518 )
...
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com >
2026-04-03 01:47:04 +00:00
Xin Yang
9bd7231106
Revert "[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )" ( #38778 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-04-01 22:02:32 -07:00
Monishver
c09ad767cd
Feature/silu block quant fusion v1 ( #32996 )
...
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com >
2026-04-01 18:50:43 +00:00
Zhanda Zhu
c75a313824
[Perf] triton bilinear_pos_embed kernel for ViT ( #37948 )
...
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com >
2026-04-01 01:52:02 -07:00
whyiug
58c959a767
[Misc]: clean up non-core lint issues ( #37049 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2026-03-28 10:28:16 -04:00
Liwen
171775f306
Fix Device Index for ROCm Ray Workers in MoE Benchmark ( #38108 )
...
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-28 08:27:11 +00:00
Jee Jee Li
2bfbdca23c
[Bugfix] Fix benchmark_fused_collective.py ( #38082 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-25 23:51:00 -07:00
Harry Mellor
d215d1efca
[Mypy] Better fixes for the mypy issues in vllm/config ( #37902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 06:14:43 -07:00
Kyle Sayers
38364a7e32
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels ( #36799 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-23 16:03:29 -04:00
Harry Mellor
572b432913
Stop bench CLI from recursively casting all configs to dict ( #37559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 14:04:03 +00:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Wei Zhao
a3a51d20e7
[Benchmark] Improvements to attention benchmark script ( #37115 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-16 22:22:40 +00:00
Kunshang Ji
747b068136
[Hardware] Replace memory related torch.cuda APIs ( #37031 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-03-16 10:24:48 +00:00
Matthew Bonanni
f444c05c32
[Attention] Use FA4 for MLA prefill ( #34732 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-12 12:10:17 -04:00
Kunshang Ji
53ec16a705
[Hardware] Replace torch.cuda.device_count/current_device/set_device API ( #36145 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-12 07:57:47 -07:00
Yan Ma
894843eb25
replace with torch.cuda.device with with torch.accelerator.device_index ( #36144 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-11 23:12:57 -07:00
Roberto L. Castro
580864d81e
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 ( #34917 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-03-09 09:50:36 -07:00
Roberto L. Castro
2b28b9b269
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 ( #35290 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-09 09:46:57 -07:00
Harry Mellor
a0f44bb616
Allow markdownlint to run locally ( #36398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:05:24 -07:00
lif
00b814ba5a
[V0 Deprecation] Remove unused swap_space parameter ( #36216 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
Jiayi Yan
6a895197fa
[Bugfix][CI] fix typos ( #34934 )
...
Signed-off-by: 1195343015 <1195343015@qq.com >
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 17:05:46 +00:00
Kunshang Ji
66a2209645
[Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize ( #36085 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-05 10:36:39 +00:00
Kunshang Ji
16d2ad1d38
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache ( #30681 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 09:49:47 +00:00
Robert Shaw
97995f6376
[MoE Refactor] Create MK for TRTLLM Kernels ( #32564 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-03-03 10:39:50 -08:00
Cyrus Leung
792a74b973
[Doc] Improve UX of --enable-log-requests ( #35723 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-02 08:24:09 -08:00
Wentao Ye
05970c772c
[Refactor] Remove dead code for attention benchmark script ( #35418 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 09:53:46 -08:00