17325 Commits

Author SHA1 Message Date
Chauncey 87f12e5c7c [Frontend]Responses API supports chat_template_kwargs (#43761)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-05-29 07:58:19 +00:00
kliuae ab7521d77c [ROCm][DSv4] Remove device pipeline stall in sparse attention (#43898)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
2026-05-29 15:42:40 +08:00
Tianmu Li 94d3f4d205 [CPU Backend] CPU top-k and top-p sampling kernels using Triton (#43633)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-29 15:02:39 +08:00
Yintong Lu 04516eabc8 [XPU] add gelu_tanh to xpu moe backend supported activations (#42822)
Signed-off-by: yintong-lu <yintong.lu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-05-29 14:37:20 +08:00
Kevin H. Luu 648c3ebee6 [CI] Separate non-root smoke tests from image build step (#43712)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-28 23:34:16 -07:00
Chris Leonard 22a58640b4 [9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717)
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
2026-05-29 04:44:45 +00:00
Wentao Ye 710f077617 [Refactor] Remove dead code (#43234)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-29 00:29:56 -04:00
Itay Etelis d63108fb18 [kv_offload] Skip decode-phase blocks in CPU offload (#43797)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
2026-05-29 06:39:43 +03:00
Qiming Zhang 9636709372 [XPU] add scale transpose to prepare_fp8_moe_layer_for_xpu and bump up kernels (#43277)
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-05-29 03:22:51 +00:00
Weida Hong dfe8ba7c80 Adjust design around encoder_cudagraph_forward (#42288)
Signed-off-by: Weida Hong <wdhongtw@google.com>
2026-05-29 03:02:52 +00:00
Jared Wen 212deff2ec [feat] add GlmgaProcessor specific logits in glm4_1v.py (#43575)
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
2026-05-29 02:56:02 +00:00
Woosuk Kwon 7bd45da585 [DSv4] Move mHC tilelang kernels & Don't use CustomOP in dsv4/nvidia (#43905)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-05-29 10:25:02 +08:00
Vadim Gimpelson bf18d7e0b4 [Misc][NUMA] Auto-bind to PCT priority cores on DGX B300 + widen EngineCore across shard NUMA nodes (#43270)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Co-authored-by: Cursor <noreply@cursor.com>
2026-05-29 10:07:44 +08:00
Bugen Zhao 1521173c17 [Rust Frontend] Add /version endpoint using engine-reported value (#43854)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-05-29 00:32:27 +00:00
ltd0924 b690b2bb67 [Model]Support Step-3.7-Flash (#43859)
Signed-off-by: luotingdan <luotingdan@stepfun.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: luotingdan <luotingdan@stepfun.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Yu Huang <yuhuang@nvidia.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
2026-05-28 17:01:48 -07:00
yzong-rh 325a1ec4fb [CI] Enable prefix caching in BFCL benchmark (#43925)
Signed-off-by: Yifan Zong <yzong@redhat.com>
2026-05-28 23:36:31 +00:00
Harshal Janjani 69c9f19957 fix(frontend): Add multimodal placeholders to Gemma4 tool message template (#41459)
Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
2026-05-28 14:48:12 -07:00
rasmith 9769e2df2a [AMD][CI][BugFix] Fix Distributed Compile Unit Tests (2xH100-2xMI300) group (#43120)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2026-05-28 14:39:01 -07:00
Michael Goin 03f03f9630 Refactor output filename handling in ci-fetch-log.sh (#43901)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-05-28 14:20:12 -07:00
Benjamin Chislett 9202ea6fda [Spec Decode] Allow causal DFlash (#43445) 2026-05-28 21:18:44 +00:00
Woosuk Kwon 69b8956dcd [Model Refactoring] Remove unncessary torch op registration for DSv4 (#43891)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-05-28 14:04:55 -07:00
Ronen Schaffer a3ed5ab10c [KV Offload] Add per-request offloading policy via on_new_request lifecycle hook (#43205)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-28 20:45:18 +00:00
Nick Hill 7e53283b1c [Core] Cleanup KVConnector handling with PP + fix MRV2 (#43732)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-28 13:12:03 -07:00
Raj Joshi 9090368b65 [Feat] Add support for per GPU worker RDMA NIC selection (#42083)
Signed-off-by: Raj Joshi <rajjoshi@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 12:45:23 -07:00
Harry Mellor 085ac221a3 Deprecate JAISLMHeadModel (#43784)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-05-28 18:29:12 +00:00
Hua Huang 9006204e90 [MM][CG] Avoid over-padding Qwen2.5-VL encoder cudagraph window metadata (#42796)
Signed-off-by: Hua Huang <huah@nvidia.com>
2026-05-28 11:22:56 -07:00
JohnQinAMD ed7fe831da [ROCm] Enable the aiter top-k/top-p sampler by default (#43331)
Signed-off-by: John Qin <yanyuan.qin@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2026-05-28 13:19:59 -05:00
Nicolò Lucchesi 5b115bb8a3 [Attention][AMD] Standardize kv layout to blocks first for AMD (#43660)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-05-28 12:28:50 -05:00
Mike G 53a2088675 Allow native KV cache dtype in Triton cache update (#43330)
Signed-off-by: Michael Gschwind <mgschwind@nvidia.com>
Co-authored-by: Michael Gschwind <mgschwind@nvidia.com>
2026-05-28 16:51:40 +00:00
Chao-Ju Chen 099024762c [Rust Frontend] Optimize multimodal prompt expansion (#43670)
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
2026-05-28 09:46:18 -07:00
MaciejBalaNV 9aa131f944 Add Cosmos3 Reasoner model (#43356)
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: MaciejBalaNV <mbala@nvidia.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-05-28 09:43:55 -07:00
Micah Williamson 1b5437cec8 [ROCm] Bump ROCm to 7.2.3 (#43136)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2026-05-28 09:42:43 -07:00
Jason Elie Bou Kheir 3207e7680e [XPU][MoE] Add WNA16 oracle backend for GPTQ sym-int4 (xpu_fused_moe) (#41426)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-05-28 16:30:48 +00:00
Matthias Gehre a9ec46d4b7 [ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
2026-05-28 16:28:21 +00:00
Ronen Schaffer 4bfa0f2b14 [KV Offload] Rename SecondaryTierManager.get_finished() to get_finished_jobs() (#43870)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
2026-05-28 16:00:18 +00:00
Vadim Gimpelson 5d126dd155 [Bugfix] Exclude Ray DP from #42585's deferred port allocation (#43864)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-05-28 15:55:14 +00:00
Majid c08ebebf30 [Perf] Add do_not_specialize to Mamba SSD chunk kernels (#43803)
Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com>
Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com>
2026-05-28 15:40:02 +00:00
Wentao Ye be4062fd6c [Bug] Fix tests/distributed/test_elastic_ep.py - assert False (#43813)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-28 11:00:56 -04:00
Will.hou 577d693838 [rust] fix: aggregate is_sleeping and reset_prefix_cache across DP engines (#43429)
Signed-off-by: Will.hou <1205157517@qq.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 07:56:56 -07:00
Bugen Zhao 61a1e30473 [Rust Frontend] Reduce Gemma4 tool parser args scan complexity (#43850)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-05-28 14:52:29 +00:00
Bugen Zhao 3a282230ee [Rust Frontend] Add hy_v3 tool parser (#43872)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
2026-05-28 14:42:47 +00:00
Li, Jiang 20d69d100a [CPU] Migrate cpu_awq into awq_marlin (#43841)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-05-28 22:36:31 +08:00
Simon Danielsson 552eb81918 [Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency (#40344)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
2026-05-28 14:30:21 +00:00
Woosuk Kwon 9957e4d240 [Model Refactoring] Remove torch compile dependency in DSv4 (#43746)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-05-28 14:26:25 +00:00
Angelo Ruocco 864990e8d9 Add token-offset based selective offload in OffloadConnector (#39983)
Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com>
2026-05-28 14:11:02 +00:00
zexplorerhj f3b2a819f7 [Perf][KDA] Fuse gate softplus, chunk-local cumsum, and RCP_LN2 scaling (#43667)
Signed-off-by: haojiangzheng <justineric096@gmail.com>
Co-authored-by: haojiangzheng <justineric096@gmail.com>
2026-05-28 13:47:08 +00:00
Wentao Ye 64e1218673 [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-05-28 06:18:26 -07:00
Julien Denize 02606b0b09 [BUGFIX] Multimodal benchmark with MistralTokenizer (#42965)
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
2026-05-28 05:36:24 -07:00
Harry Mellor 19af4e6dd4 Fix OlmoHybridForCausalLM not initialising (#43846)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-05-28 05:33:31 -07:00
omerpaz95 811d805195 [EC Connector] Add shutdown API to EC Connector. (#42423)
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
2026-05-28 12:28:01 +00:00