9334 Commits

Author SHA1 Message Date
Johannes Gäßler 192d8ae8b8 CUDA: missing PDL sync for FWHT, better fallback (#23690) b9334 2026-05-26 11:05:51 +08:00
forforever73 35c9b1f39e metal : add apple device id (#23566)
Co-authored-by: lvyichen <lvyichen@stepfun.com>
b9333
2026-05-25 21:05:16 +03:00
Max Krasnyansky 4bead4e30d snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23680) 2026-05-25 10:57:43 -07:00
Georgi Gerganov 302e2c2652 ci : reduce PR jobs by matching backend paths (#23675)
* ci : disable SYCL f16 builds

* ci : extract android and hip into separate workflows

* ci : move webgpu to separate workflow

* ci : move the rpc to a separate workflow

* ci : extract s309x and ppcl jobs

* ci : extract opencl job into a separate workflow
b9331
2026-05-25 20:54:54 +03:00
Pascal 328874d054 model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.

Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
b9330
2026-05-25 16:05:04 +02:00
Aman Gupta c1f1e28d29 CUDA: add fast walsh-hadamard transform (#23615)
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b9329
2026-05-25 21:12:10 +08:00
Pascal 5a4126adc1 ui: fix stop/continue during an agentic loop (#23356) 2026-05-25 14:18:59 +02:00
Michael Wand a4d2d4ae41 convert : add compressed-tensors NVFP4 support (#21095)
* Refactored Compressed Tensors NVFP4 support for new base.py

* Support compressed-tensors NVFP4 conversion

* Moved Qwen MTP remap into filter_tensors

* simplify

* pathlib no longer used

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-25 14:16:11 +02:00
Georgi Gerganov d161ea7071 sync : ggml b9326 2026-05-25 12:43:27 +03:00
Georgi Gerganov 45158f460e ggml : bump version to 0.13.0 (ggml/1510) 2026-05-25 12:43:27 +03:00
Georgi Gerganov 22307b3e8b sync : ggml 2026-05-25 12:38:01 +03:00
Georgi Gerganov ce5890b5f7 ggml : bump version to 0.12.1 (ggml/1508) 2026-05-25 12:38:01 +03:00
Ori Pekelman b251f74f49 ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500) 2026-05-25 12:38:01 +03:00
Dev-X25874 fa97041524 ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492) 2026-05-25 12:38:01 +03:00
Johannes Gäßler ae251b5ff2 TP: fix ggml context size calculation (#22616)
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
b9320
2026-05-25 12:37:25 +03:00
Gilad S. 66efd13375 ggml: gguf_init_from_callback and gguf_init_from_buffer (#22341)
* ggml: implement `gguf_init_from_buffer`

* test: `gguf_init_from_buffer`

* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer

* fix: use `GGML_UNUSED`

Co-authored-by: Copilot <copilot@github.com>

* fix: remove `total_size` from `gguf_reader`

* fix: file offset calculation, rename `offset` to `data_offset`

Co-authored-by: Copilot <copilot@github.com>

* refactor: extract model loader bug fixes to another PR

* feat: add `gguf_init_from_callback`

* fix: always require a max expected size

* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`

* fix: harden against offset overflow in buffer read

* fix: remove seek behavior from the callback

* feat: `max_chunk_read == 0` means `SIZE_MAX`

* fix: seeking in a gguf file with no tensors

---------

Co-authored-by: Copilot <copilot@github.com>
b9319
2026-05-25 11:33:29 +02:00
Aman Gupta 6c4cbdc70b server: MTP layer kv-cache should respect draft type ctk (#23646) b9318 2026-05-25 16:46:23 +08:00
alex-spacemit 5fdf07e33b ci : update spacemit toolchain url and enhance curl command (#23642)
* fix(action): update SpacemiT toolchain URL and version

Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c

* fix(action): add -L flag to curl command for URL redirection

Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06
2026-05-25 10:43:24 +02:00
Sigbjørn Skjæret 062d3115aa ci : fix pre-tokenizer-hashes check (#23651) 2026-05-25 10:41:25 +02:00
Tim Neumann 314e729347 llama : document that only one on-device state can be saved per sequence (#23520) b9315 2026-05-25 10:29:28 +03:00
Aldehir Rojas d55fb97174 ci : install host compiler on android-ndk build (#23630) 2026-05-25 10:18:08 +03:00
Jeff Bolz 826539ce59 ggml : Parallelize quant LUT init (#23595)
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
b9313
2026-05-25 10:15:46 +03:00
Saba Fallah b96487645c ui: media attachments before text (#23467)
* ui: media attachments before text

* fix prettier formatting
2026-05-25 08:50:41 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 9627d0f540 vendor : update cpp-httplib to 0.45.1 (#23639) b9311 2026-05-25 09:45:22 +03:00
jacekpoplawski e2ef8fe42c server: fix checkpoints creation (#22929)
* common : add common_chat_split_by_role

* cont : fix spans to reach end of message

* server: fix checkpoints creation

- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints

* cont : clean-up

* Support autoparser detection for message barriers

* server: fix message span delimiter and update docs

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
b9310
2026-05-25 08:56:18 +03:00
fairydreaming 6d57c26ef8 perplexity : fix even more integer overflows (#23623)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b9309
2026-05-25 08:12:39 +03:00
Georgi Gerganov 28123a3937 ci : move most slim jobs to self-hosted runners (#23619)
* ci : remove tag from build-self-hosted.yml

* ci : slim -> self-hosted

* ci : prevent heavy CPU jobs from running on fast runners

* ci : prevent cmake pkg to run on dedicated fast runners

* ci : try to bump 3.11 -> 3.13

* ci : move lint back to 3.11

* ci : back to 3.11

* ci : add comment about UI jobs

* ci : move python requirements check to CPU runners

this job is a bit slow for a dedicated "fast" runner

* ci : add self-hosted ui workflow

* ci : fix UI naming

* tmp to check if arm64 fast is compatible with all jobs

* revert last commit
2026-05-25 08:11:19 +03:00
Georgi Gerganov 549b9d8433 ci : update build-self-hosted.yml (#23616) 2026-05-24 18:20:10 +03:00
Sigbjørn Skjæret 5d246a792d convert : minor fixes for numpy 2.x (#23571) 2026-05-24 09:51:31 +02:00
Aldehir Rojas 63248fc3e3 cmake : fix ui build (#23592)
* cmake/ui : add -fPIC to llama-ui static lib

* cmake : rename host compiled embed helper
b9305
2026-05-24 02:37:28 -05:00
Aman Gupta 83eebe9d08 server: add margin for draft model for fit (#23485) 2026-05-24 14:43:08 +08:00
Johannes Gäßler fff63b5108 TP: fix entirely zero-sized slices per device (#23525) 2026-05-24 08:19:33 +02:00
shaofeiqi f3061116ff opencl: batch profiling to improve speed and prevent memory leaks (#23495) 2026-05-23 23:11:43 -07:00
Yiwei Shao 1c0f6db545 hexagon: apply repl optimization in flash attn softmax as #22993 (#23455) b9301 2026-05-23 19:56:59 -07:00
Aparna M P cec51c7a7d snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552) 2026-05-23 19:56:41 -07:00
Aldehir Rojas b22ff4b7b4 cmake/ui : refactor the build (#23352) 2026-05-23 17:08:22 -04:00
Aditya Singh c0c7e147e7 requirements : bump torch to 2.11.0 (#23503)
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf

The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.

Fixes #23408

* requirements: bump torch floor to 2.11.0 per maintainer

* requirements: pin torch to ==2.11.0 per project policy

* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy

* requirements: suppress check_requirements pin warning on mtmd

The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.

* ty: silence Tensor/Module union check on model[0].auto_model

With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
2026-05-23 18:24:39 +02:00
Michael Wand b0df4c0cfd model : add NVFP4 MTP scale tensors (#23563)
* Add NVFP4 MTP scale tensors

* Link Qwen3.5 MTP tensors

* Aligned nullptr
b9297
2026-05-23 13:30:31 +02:00
dskwe a497476330 ggml : Check the right iface method before using the fallback 2d get (#23514) b9296 2026-05-23 12:49:24 +02:00
Jeff Bolz 95405ac65f vulkan: fix windows find_package of SPIRV-Headers (#23215)
* vulkan: fix windows find_package of SPIRV-Headers

* not windows-only
b9295
2026-05-23 09:44:46 +02:00
Shawn Gu 0f3cb3fc8b opencl: generalize Adreno MoE kernels on M (#23449) b9294 2026-05-22 17:08:41 -07:00
Aldehir Rojas 1acee6bf89 server: only parse empty msg if continuing an assistant msg (#23506) 2026-05-22 11:58:15 -04:00
fairydreaming ef570f6308 perplexity : fix integer overflow (#23496)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b9292
2026-05-22 15:50:44 +03:00
Alexey Kopytko cc9e331213 SYCL: improve MoE prefill throughput (#23142)
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
b9291
2026-05-22 15:50:17 +03:00
Alexey Kopytko bcfd1989e9 sycl : Level Zero detection in ggml_sycl_init (#23097)
* [SYCL] Centralize Level Zero detection in ggml_sycl_init

* use the same wording

* get back the warning
b9290
2026-05-22 15:49:45 +03:00
karavayev 56f16f235c SYCL : gated_delta_net K>1 (#23174)
* sycl_gated_delta_net K>1

* editor_config
b9289
2026-05-22 15:48:56 +03:00
Katostrofik 8cc67efcd4 SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (#21580)
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup

BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.

Fixes #20478

* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0

The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.

Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.

Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 15:48:24 +03:00
Jesus Talavera 95feeab52e docs: Update documentation with Granite 4.0/4.1 (#23404) 2026-05-22 20:35:46 +08:00
Sachin Sharma 99d4026b11 ggml-zendnn : add Q8_0 quantization support (#23414)
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
b9286
2026-05-22 13:16:55 +02:00
fairydreaming 9c92e96a64 cmake : build router app only during standalone builds (#23521)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b9285
2026-05-22 12:55:29 +03:00