9451 Commits

Author SHA1 Message Date
Winston Ma f8c0a19d46 vulkan: Removed unused functions (#23175) b9451 2026-06-01 11:46:23 +02:00
Aldehir Rojas 5254a7994d common : support manually triggering the reasoning budget end sequence (#23949) 2026-06-01 11:37:11 +02:00
Georgi Gerganov e22b0de60d ci : add missing Linux label to cpu-x64-high-perf runner (#23958)
Fixes: https://github.com/ggml-org/llama.cpp/pull/23927#discussion_r3332213086

The cpu-x64-high-perf job was missing the Linux label in its runs-on
specification, causing the runner to not be discovered. All other
self-hosted Linux jobs include this label.

Assisted-by: llama.cpp:local pi
2026-06-01 10:39:59 +03:00
Neo Zhang a51142497a [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
* support Q4_1, Q5_0, Q5_1

* update ut case
2026-06-01 09:53:53 +03:00
Neo Zhang 4162522688 [SYCL] Add more types in GET_ROWS OP (#23710)
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
2026-06-01 09:53:04 +03:00
Neo Zhang 44e211cecf sycl : Optimize Q3_K mul_mat by reorder (#23725) 2026-06-01 09:50:55 +03:00
Eve af6528e6df ci: remove redundant or duplicate jobs (#23927)
* remove redundant apple job

openvino gpu and cpu test can share the same build and machine

Update build-rpc.yml

Update build-openvino.yml

cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm

remove duplicate x86 vulkan

combine backend sampling

Update server.yml

run server on arm as windows is x86

* emdawn on one machine only

* fix openvino, remove cpu tag as we dont have many x64 machines with that tag
b9445
2026-06-01 06:32:17 +03:00
Eric Zhang 6f165c1c64 server : handle If-None-Match weak ETags (#23916) b9444 2026-05-31 16:21:08 -05:00
Georgi Gerganov 399739d5c5 ci : limit trigger paths for the CPU workflow (#23938) 2026-05-31 19:02:47 +03:00
o7si d4c8e2c29c vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756)
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* lowercase defaults to true

* type fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9442
2026-05-31 12:37:35 +02:00
Eric Zhang 3292da09f6 ui: fix ETag truncation with MSVC compiler (#23917) b9441 2026-05-31 11:21:23 +02:00
Vladislav e6123e2080 docs : update ZenDNN docs for Q8 support (#23791)
* docs zendnn added information about Q8 support

* docs zendnn rm unnecessary data

* docs update, links to ZenDNN docs provided

* docs zenDNN update: clarified explanation

* docs zenDNN update: one more explanation clarified

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-31 10:26:42 +02:00
Ruben Ortlam 22cadc1944 llama: only use one iGPU device by default (#23897) b9439 2026-05-31 08:17:47 +02:00
Pascal d749821db3 webui: add custom CSS injection via config (#23904)
* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

* ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field

* ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head

* ui: inject custom CSS through a svelte action instead of a bound element

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
b9438
2026-05-30 23:49:31 +02:00
Gaurav Garg aa46bda89b Support -fa auto in llama-bench (#23714)
* Support `-fa auto` in llama-bench

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples

* Address review comments
b9437
2026-05-31 02:03:57 +05:30
lhez d6588daa80 opencl: support bf16 by converting to f16 (#23839) b9436 2026-05-30 10:17:47 -07:00
Pascal d38d50e7ff ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) 2026-05-30 16:50:54 +02:00
Johannes Gäßler 8b0e0db606 TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843)
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs

* fix afmoe TP
b9434
2026-05-30 16:48:00 +03:00
Georgi Gerganov 2d9b7c8e98 metal : restore im2col implementation for large kernels (#23901) b9433 2026-05-30 15:26:13 +03:00
Xuan-Son Nguyen e674b1279b test: (test-llama-archs) log the config name first (#23885) b9432 2026-05-30 12:22:38 +02:00
Georgi Gerganov 4c4e91b799 ci : update ios-xcode release job to macos-26 (#23906)
* ci : disable libcommon build from xcframework

* ocd : fix name

* ci : ios-xcode change to macos-26

* cont : pin xcode

* cont : pin xcode to minor version
b9431
2026-05-30 13:21:46 +03:00
Jinyang He d48a56effb ggml : add some lsx support (#23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
b9430
2026-05-30 11:53:26 +03:00
Ruben Ortlam 6e093b80ea vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-05-30 10:39:31 +02:00
Georgi Gerganov 337528571d ci : fix s390x release job (#23898)
* ci : fix s390x release job

* ci : multi-thread build for `ios-xcode`

* ocd : names
b9428
2026-05-30 09:21:38 +03:00
Georgi Gerganov d4204b03a5 ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
2026-05-30 08:52:30 +03:00
Radoslav Gerganov 1738129bee llama : do not skip iGPU when only RPC devices are present (#23868)
After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.

Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.

closes: #23858
b9426
2026-05-30 07:48:22 +03:00
Xuan-Son Nguyen 0821c5fcfd server: in SSE mode, send HTTP headers when slot starts (#23884)
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
2026-05-30 00:06:29 +02:00
Reese Levine 151f3a98e9 ggml-webgpu: Check earlier for WebGPU required features (#23879) 2026-05-29 14:16:05 -07:00
Reese Levine b22da25889 ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-05-29 14:14:11 -07:00
Ruixiang Wang 689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869)
* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
2026-05-29 23:09:47 +02:00
Pascal 5a46b46acd app: add llama update self updater (#23865)
* wip: llama update POC

* cleaning: llama update

* llama-gen-docs

* app: delegate llama update to the install script

* app: spawn the installer detached so llama update can replace a running binary

* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}

* app: make llama_update static

Address review from @angt
2026-05-29 23:02:40 +02:00
ValdikSS 22d66b567e ui: handle audio/vnd.wave as audio WAV file (#23754)
Firefox on Linux uses this MIME type
2026-05-29 21:41:35 +02:00
Tarek Dakhran 2084434e66 vocab : support tokenizer for LFM2.5-8B-A1B (#23826)
* vocab: Support tokenizer for LFM2.5-8B-A1B

* Keep liquid6 tokenizer in models
2026-05-29 20:25:43 +02:00
Sigbjørn Skjæret 764f1e64a1 graph : ensure DS32 kq_mask_lid is F32 (#23864) 2026-05-29 19:55:14 +02:00
Xuan-Son Nguyen b5f52280fb server: remove obsolete scripts (#23870) 2026-05-29 19:47:30 +02:00
Georgi Gerganov dc71236b6c ci : update macos release to use macos-26 runner (#23878) 2026-05-29 20:41:57 +03:00
Xuan-Son Nguyen 06d26dfdff download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
b9415
2026-05-29 16:30:55 +02:00
Saba Fallah da3f990a47 mtmd: Add DeepSeekOCR 2 Support (#20975)
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b9414
2026-05-29 16:13:51 +02:00
Oliver Simons 6ed481eea4 CUDA: Check PTX version on host side to guard PDL dispatch (#23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b9413
2026-05-29 12:28:18 +02:00
Xuan-Son Nguyen cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
b9412
2026-05-29 10:23:17 +02:00
fairydreaming 1f0aa2a696 model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
b9411
2026-05-29 10:15:17 +02:00
Aman Gupta 031ddb2e08 llama: use f16 mask for FA to save VRAM (#23764)
* llama: use f16 mask for FA

* review: add llama_cast + formatting

* simplify
b9410
2026-05-29 15:44:43 +08:00
Georgi Gerganov fe12e422ad sync : ggml b9409 2026-05-29 09:56:08 +03:00
Georgi Gerganov ea02bc37f5 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:56:08 +03:00
Omid Azizi b000431a0b ngram-mod : Add missing include (#23857)
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
2026-05-29 09:21:37 +03:00
Aman Gupta eef59a7642 llama: add llm_graph_input_mtp (#23643)
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b9406
2026-05-29 09:17:32 +03:00
Adrien Gallouët 98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9405
2026-05-29 07:46:11 +02:00
Andreas Kieslinger 241cbd41d2 cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) b9404 2026-05-29 07:46:10 +03:00
Matt Corallo 33c718db1f meta : Add missing buffer set in allreduce fallback !COMPUTE clear (#23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
b9403
2026-05-29 06:30:24 +03:00
Max Krasnyansky 19e92c33ef hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
b9402
2026-05-28 14:05:54 -07:00