17325 Commits

Author SHA1 Message Date
Andy Lo 95b1615ec9 [Perf] Improve multimodal item handling from O(n) to O(log n) per step (#44212)
Signed-off-by: Andy Lo <andy@mistral.ai>
2026-06-03 11:00:26 +00:00
Itay Etelis 1fa9ea09f6 [Perf] Triton fast path for small CPU→GPU swap_blocks_batch in the offloading connector (#42212)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-06-03 13:38:17 +03:00
Yan Ma 02564b4de0 [XPU]fallback to TRITON_ATTN for vit attn on xpu when use float32 dtype (#43759)
Signed-off-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-06-03 03:20:21 -07:00
Flora Feng 209709a8c1 [Bugfix] Fix unstreamed tool call args dropped in Responses API streaming (#44348)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-06-03 03:19:08 -07:00
Wei Zhao ace95c9cf8 [Bugfix] Update TrtLLM MoE routing methods (#44347)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-06-03 02:56:43 -07:00
Shanshan Shen 0e2b13103b [Doc] Update ViT CUDA graph interfaces (#44388)
Signed-off-by: shen-shanshan <467638484@qq.com>
2026-06-03 01:20:59 -07:00
Bugen Zhao 449be4f934 [Rust Frontend] Fix several hf chat template rendering issues (#44311)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-06-03 01:04:43 -07:00
Xunzhuo 6550ff12f2 [Rust Frontend] Add dynamic LoRA endpoints (#43778)
Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
2026-06-03 07:55:29 +00:00
NolanHo 4aaed4ca22 [Rust Frontend] Add server router extension hook (#43774)
Signed-off-by: NolanHo <kujyo.eia.serias@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
2026-06-03 07:45:31 +00:00
Varun Sundar Rabindranath 7268457999 [KV Offloading] Enable HMA models for Tiering Offloading (#44287)
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com>
2026-06-03 10:03:00 +03:00
Majid 9af53a3c13 [Perf] Add tuned selective_state_update configs for H200 and RTX PRO … (#44251)
Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
2026-06-02 23:59:01 -07:00
Andreas Karatzas 87954eb50e [ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script (#36949)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-06-02 23:43:07 -07:00
Charlie Fu 71df063c49 Enable perf_token_group_quant/_C_stable_libtorch for ROCm (#42758)
Signed-off-by: charlifu <charlifu@amd.com>
2026-06-02 23:23:28 -07:00
Albert Cheng e0081ef8cf [Benchmark] Enable reasoning-model (thinking) benchmarking via --chat-template-kwargs for client-rendered datasets (#44244)
Signed-off-by: Albert Cheng <albertching0112@gmail.com>
2026-06-02 22:49:51 -07:00
William Rom f0204358d9 [Bugfix] fix crash in postprocess for null tool args (#43862)
Signed-off-by: William-Rom <william.rom@intility.no>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-06-02 22:17:26 -07:00
Willow Lopez 597bc15936 fix: resolve CUTLASS fmin compatibility for DeepSeek-V4 init (#44236)
Signed-off-by: Willow Lopez <100782273+Oxygen56@users.noreply.github.com>
2026-06-03 01:07:10 -04:00
Rotem Shavitt 3f0a91bb96 Nit Changes in Tiered KV Offload (#44293)
Signed-off-by: Rotem Shavitt <rshavitt@gmail.com>
2026-06-02 21:53:21 -07:00
Flora Feng e67063826b [CI] Add missing vllm/parser/ CI trigger and fix test_parse.py (#44352)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-06-02 21:05:19 -07:00
Andreas Karatzas 53b88d1dfc [CI] Reject out-of-vocabulary before they reach the GPU logprob path (#44042)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-06-02 22:27:52 -05:00
JartX 7b476c8f14 [ROCm][CI] Skip fp8 reload tests on gfx90a (MI250) (#44369)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-06-02 22:27:14 -05:00
JartX 4454a18695 [ROCm][CI] Fix stale wvSplitK GEMM fallback test for N=5 (#44368)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-06-02 22:00:25 -05:00
wangxiyuan 02a01496fc [Platform] Add is_cumem_allocator_available (#43838)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-06-03 10:54:50 +08:00
Kevin H. Luu 27a93cd426 [docker] Stop using extra-index-url for flashinfer-jit-cache (#44366)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2026-06-02 18:58:22 -07:00
Wei Zhao 969aec4bc8 [Bugfix] Fix Deepseek v4 non-mega-moe model init error (#44356)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-06-02 18:26:30 -07:00
Jongseok Park ca17b6b17d [Perf] Apply single-pass min_larger finding and binary search in Triton Top-p path. (#42191)
Signed-off-by: js_park <cakeng@naver.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-06-02 17:57:26 -07:00
Woosuk Kwon b254e0456c [DSV4] Minor cleanup for DeepseekV4MegaMoEExperts (#44367)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-06-02 17:54:27 -07:00
Daoyuan Li bd98e97557 [Misc] Remove dead VLLM_RPC_TIMEOUT env var and fix profiling doc that references it (#44128)
Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
2026-06-03 00:22:10 +00:00
Junhao Shen a4ac746405 [MoE/b12x] Accept W4A16 (kNvfp4Static, None) in FlashInferB12xExperts supports check (#43332)
Signed-off-by: Junhao Shen <junshen@nvidia.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
2026-06-02 15:20:37 -07:00
Vadim Gimpelson 8b3b71ee9d [CI/Build] Bump flashinfer to v0.6.12 (#44036)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-06-02 15:19:05 -07:00
Siddharth Bedekar 0917a009d3 Fix sparse NCCL weight transfer test construction (#44345)
Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com>
2026-06-02 21:51:21 +00:00
SeongJun Lee 3099de3617 [Kernel][MoE] Add GELU_TANH to CPU, CUTLASS, and WNA16 MoE backends (#42027)
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Co-authored-by: lesj0610 <lesj0610@users.noreply.github.com>
2026-06-02 17:12:08 -04:00
Nick Hill e15f20258b [ModelRunnerV2] Avoid pipeline parallel bubbles (#42187)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-06-02 14:02:01 -07:00
Matthew Bonanni 557781131a [Misc] Remove stray empty file (#44350)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-02 12:53:03 -07:00
Yifan Qiao e9e08c49b9 [Bugfix] Cache the EAGLE/MTP lookahead block in the SWA prefix-cache mask (#44082)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
2026-06-02 12:21:07 -07:00
Woosuk Kwon e4a2e584e5 [MRV2] Remove assignment of graph_pool in cudagraph_utils (#44338)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-06-02 11:50:27 -07:00
dependabot[bot] b8b49e2395 Bump actions/github-script from 8.0.0 to 9.0.0 (#39667)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 11:26:57 -07:00
Nick Hill da107a59e5 [MRV2] Also enable MRV2 for Llama and Mistral dense models (#43458)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
2026-06-02 11:18:46 -07:00
Chauncey ed9a7526b6 [Anthropic] Support system role messages inside messages array (#44283)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: Aleksandar Yanakiev <alexander.yanakiev@discretestack.com>
Co-authored-by: Ang Kah Min, Kelvin <syraxius@hotmail.com>
2026-06-02 18:13:54 +00:00
Wei Zhao 2427094152 [Feature] Support EPLB for DeepSeek v4 Mega Moe (#43339)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Wei Zhao (Engrg-Hardware 1) <weizha@login-lyris01.lyris.clusters.nvidia.com>
2026-06-02 10:56:44 -07:00
Kartavya sonar fe32e7830b [Bugfix] flashinfer: fail fast when --kv-cache-dtype nvfp4 used on unsupported arch (#43669)
Signed-off-by: Kartavya Sonar <sonarkartavya@gmail.com>
2026-06-02 10:50:00 -07:00
Alireza Dadgarnia afcb580715 [BugFix] Fix Humming MoE deploy error (#43100)
Signed-off-by: Alireza Dadgarnia <dadgarnia@Alirezas-MacBook-Pro-2.local>
Signed-off-by: Alireza Dadgarnia <49554709+adotdad@users.noreply.github.com>
Co-authored-by: Alireza Dadgarnia <dadgarnia@Alirezas-MacBook-Pro-2.local>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
2026-06-02 09:32:50 -07:00
liuzhenwei 3f3e2702c2 [XPU] Enable rms_norm/act quant fusions (#43963)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-06-02 16:14:41 +00:00
Flora Feng 478b49ddec [Refactor] Remove dead code from parser infrastructure (#44279)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-06-02 12:08:27 -04:00
Nick Hill cab5c9a2a9 [Core] Move max_concurrent_batches to VllmConfig (#44274)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-06-02 08:57:25 -07:00
Brian Dellabetta 774e552397 [compressed-tensors] Asymmetric support for MoE WNA16 marlin (#44025)
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
2026-06-02 08:51:45 -07:00
XiaoZ 53fa09d085 [Misc] Support local image encoding in benchmarks (#43843)
Signed-off-by: xiaoz <Sukra1@outlook.com>
2026-06-02 15:15:06 +00:00
Chris Leonard 4d93bc35c9 Migrate header files to torch stable abi (#44013) 2026-06-02 08:09:52 -07:00
Bugen Zhao 586201ebdc [Rust Frontend] Cover different thinking modes in roundtrip tests (#44320)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-06-02 07:51:25 -07:00
pschlan-amd 88f172188b [ROCm] Fix AITER RMSNormQuantFusion for Kimi-Linear (#44308)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
2026-06-02 14:50:21 +00:00
Bugen Zhao 880fc032f4 [Rust Frontend] Support recursive tool parameter conversion (#44299)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-06-02 07:45:35 -07:00