9632 Commits

Author SHA1 Message Date
Sigbjørn Skjæret acd79d603c jinja : add count/d/e filter aliases (#24606) b9632 2026-06-14 15:07:31 +02:00
Michael Wand 6e14286eda cli : fix not copying preserved tokens (#24258) b9631 2026-06-14 11:52:15 +02:00
Bartowski 8ed274ef46 Add cohere2moe to llama-vocab for TINY_AYA (#24601) b9630 2026-06-14 09:04:46 +02:00
Sigbjørn Skjæret 46722116b9 ci : use CUDA label for cuda backend (#24594) 2026-06-14 08:27:52 +02:00
Sigbjørn Skjæret c2ba3e47a2 add sycl to check-release (#24583) b9628 2026-06-14 09:42:26 +08:00
Aldehir Rojas 53bd47ea5b ui : fix llama-ui-embed crash when no asset dir is given (#24597) b9627 2026-06-13 17:53:30 -05:00
Michael Wand 4988f6e866 Add arch support for cohere2-MoE (#24260)
* Add arch support for cohere2-MoE

* Removed redundant gating_func checks

* Changed ffn lookup to prefer prefix_dense_intermediate_size

* Renamed arch to cohere2moe

* Removed redundant lmhead check and chat template changes

* Removed lm_head.weight check from modify tensors, load output tensor not required, fallback to token_embd.weight

* Changed to (routed+shared)*0.5 for shared expert combined avg

* fixed sliding_window_pattern issue and pattern

* Fixed transformers crash 'first_k_dense_replace' error

* Remove comment

* Removed cohere2-moe as a tokenizer type and kept as tiny_aya.  Renamed North-Mini-Code-1.0.

* Fixed MTP fail, changed to use iSWA

* Fixed remaining todos: cohere2moe renamed, changed swa parsing to use get_key_or_arr, removed extra get_arr use

* Force metadata usage

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove Cohere2 checkpoint comment

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove MTP comment

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Regenerate cohere2moe tokenizer hash

* Add cohere2moe to Llama Model Saver supported list

* Check for zerobios tensors and add support for Command to use LayerNorm

* Map expert_selection_fn to sigmoid in base.py instead of command.py

* use bools for foundnorm/foundnormrms

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9626
2026-06-13 19:49:00 +02:00
Sigbjørn Skjæret f05cf4676a jinja : fix negative step slice with start/stop values (#24580) b9625 2026-06-13 18:28:40 +02:00
Xuan-Son Nguyen e8067a8b36 ui: build-time gzip compression (#24571)
* ui: keep original file name and path

* fix nocache

* ui: build-time gzip compression
b9624
2026-06-13 16:57:27 +02:00
Sigbjørn Skjæret 341babcf73 jinja : fix split and replace with empty first arg (#24574)
* fix split and replace with empty first arg

* fix reserve size
b9623
2026-06-13 16:56:59 +02:00
Jeff Bolz 1a7718b4c5 vulkan: support non-contig unary/glu ops (#24215)
* vulkan: support non-contig unary/glu ops

Change unary/glu ops to pass in all strides and use fastdiv for the index
calculation. Put all unary ops in one file, similar to glu, to share the
code. codex went ahead and added expm1 without me asking, but I had to
make it do a real precision analysis rather than just making stuff up.

unary.comp initially couldn't use generic_unary_head because there wasn't
space for xielu's additional constants. Fixing this required packing the
fastdiv 'L' values.

* attempt to workaround compiler bug

* resolve conflict from #23991

* use expm1
b9622
2026-06-13 08:44:15 -05:00
Xuan-Son Nguyen 597b6672e8 ui: keep original file name and path (#24568)
* ui: keep original file name and path

* fix nocache
b9621
2026-06-13 14:31:41 +02:00
Xuan-Son Nguyen 57fe1f07c3 server: clean up static assets handling (#24550)
* server: clean up static assets handling

* nits

* simplify file name handling, use static file name everywhere

* cmake/ui : bundle UI assets in an archive

* ui : run prettier on post-build.js

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
b9620
2026-06-13 11:51:20 +02:00
Georgi Gerganov d8a24ccee2 fit : wrap llama_device_memory_data (#24522) b9619 2026-06-13 08:09:52 +03:00
Muhammad Salem c34b92235b fix sycl links in release notes (#24527)
* fix sycl links in release notes

* remove extra line
2026-06-13 08:37:55 +08:00
Xuan-Son Nguyen e37abd6b5f mtmd: add batching API (#24384)
* mtmd: add batching API

* wip

* first working version (gemma4v)

* add arg

* nits

* wire up support_batch()

* fix 0.0 output embd

* fix audio

* nits

* refactor a bit

* nits

* fix non-batching case

* fix comment
2026-06-13 00:10:29 +02:00
Sigbjørn Skjæret f58bad4137 ci : unbreak release harder (#24545)
* unbreak release harder

* missed one

* remove missing test for now
b9616
2026-06-12 23:49:36 +02:00
Sigbjørn Skjæret cd5044661c ci : unbreak release (#24544) 2026-06-12 23:29:49 +03:00
Georgi Gerganov ebc10770ac server : fix reasoning budget WebUI precedence over model.ini (#24517)
When reasoning-budget is set in model.ini, the per-request
thinking_budget_tokens from the WebUI was ignored because the
model.ini value took unconditional precedence.

Swap the precedence so the WebUI per-request value is checked
first, with the model.ini value serving as a fallback default.

Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-12 17:59:56 +03:00
Ruben Ortlam 3e7bd4f39a vulkan: add pipeline barriers for memcpy read operations (#23770)
* vulkan: add pipeline barriers for memcpy read/write operations

* remove unnecessary host write pipeline barriers
2026-06-12 16:43:50 +02:00
Aleksander Grygier f7ca93d12c ui: PWA support (#23871)
* feat: Add basic PWA support and service worker for offline caching

* feat: Vite PWA implementation WIP

* feat: Improve PWA icons generation

* feat: Add PWA workbox to server routes

* feat: Include `version.json` in static assets

* feat: Add HTTP cache headers for PWA static assets

* feat: Update app name for `apple-mobile-web-app-title`

* feat: Implement PWA versioning and automatic update detection

* chore: Update `.gitignore` files

* feat: Splash Screens

* feat: Add dark mode favicon support

* refactor: Cleanup

* fix: Use dark logo for dark splash screens

* refactor: Simplify favicons SVG code

* fix: Adjust caching and polling for reliable service worker updates

* fix: Add missing favicon entry

* fix: Align PWA service worker configuration with SvelteKit build structure

* fix: Replace hashed bundle paths with versioned static paths

* test: Add PWA tests

* ci: Add build output for unit tests

* refactor: Cleanup

* fix: Server build & release versioning

* chore: Update package-lock.json

* chore: Increase PWA cache size

* chore: Update packages

* feat: Update favicons

* refactor: Post-merge fix

* feat: support explicit build version for PWA cache busting

* fix: CI

* feat: Improve PWA Refresh Alert UI

* feat: Add toggleable build version display

* refactor: Cleanup

* feat: Add version mismatch detection and manual app reload

* refactor: replace dynamic imports with static

* refactor: Cleanup

* feat: Add safe space for `pwa-<size>.png` rendered icons

* fix: use relative paths for PWA assets to support base path deployment

* feat: add PWA mode detection via URL query parameter

* feat: Use ?cache=true for SW-cached PWA assets

* refactor: Build process cleanup

* refactor: Decouple PWA versioning and remove ?cache=true workaround

* chore: Update README logo

* feat: Include PWA Assets generation in build script

* refactor: `usePwa` hook for core layout

* fix: Relativize base vite plugin

* fix: remove unnecessary backslash escapes in test regexes

* test: update static asset paths for API Key test

* refactor: Move SvelteKit PWA Options config to constants

* ui: fix update notification never appearing

Keep the PWA hook object intact instead of destructuring needRefreshByStorage,
which freezes the reactive getter. Also exclude loading.html from PWA
precache to prevent 404 errors and broken SW installation.
2026-06-12 15:53:26 +02:00
Georgi Gerganov 02182fc5b9 fit : avoid including llama-ext.h in fit.h (#24506) b9611 2026-06-12 15:57:05 +03:00
Georgi Gerganov f532be8fac sync : ggml b9610 2026-06-12 15:55:35 +03:00
Georgi Gerganov e08c226a2c ggml : bump version to 0.15.1 (ggml/1541) 2026-06-12 15:55:35 +03:00
Adrien Gallouët 70b54e140c vendor : update cpp-httplib to 0.47.0 (#24395)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9608
2026-06-12 11:34:44 +02:00
Pascal 6471e3c090 UI/jpeg exif orientation (#24196)
* ui: bake jpeg exif orientation into uploaded images

stb_image in mtmd ignores exif metadata, so rotated smartphone photos
reach the model with raw pixel orientation. The webui now reads the
exif orientation tag at send time and feeds it into the existing
capImageDataURLSize canvas pass: the browser applies the rotation when
decoding, so capped images come out upright for free, and images under
the cap threshold get a single plain redraw when orientation > 1.

At most one re-encode ever happens per image. Upright jpegs with
capping disabled pass through untouched, bit perfect.

Adds jpeg-orientation.ts with a minimal exif parser working on a
bounded base64 prefix (both endianness, returns 1 on any malformed
input) and unit tests against handcrafted jpeg byte streams.

* ui: move jpeg exif constants into lib/constants

* ui: add browser test for jpeg orientation and capping

Covers capImageDataURLSize end to end in chromium with real Pillow
generated jpeg fixtures across exif orientations 1/3/5/6/8: upright
quadrant colors checked pixel-wise, expected dimensions with and
without capping, no orientation tag left in the output, and strict
passthrough when nothing needs rewriting.
2026-06-12 10:20:27 +02:00
Ruixiang Wang 88a39274ec spec: add EAGLE3 speculative decoding support (#18039)
* llama : enable layer input extraction

* spec: support eagle3

* eagle3: fix params bug

* eagle3: support Gemma4 eagle3 from RedHatAI

* eagle3: set sync when get features from target

Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>

* eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder

Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>

* eagle3: adapt to upstream changes

* eagle3: fix rebase issues and adapt to upstream changes

* eagle3:exclude the eagle3 arch from test-llama-archs

* eagle3: fix editorconfig check failures

* eagle3: fix multi-seq issue in d2t vocab mapping

* cont : minor style / clean-up

* spec : remove `common_speculative_setup_draft_model()`

* llama : clean-up unused API

* eagle3: set d2t vocab mapping in decode graph

* cont : assert layer inputs are configured

* hparams : use n_embd_inp instead of n_embd_target_features

* eagle3: make output.weight optional and inherit from target model when needed

* haparams : generic norm-before-residual param

* llama-ext : consistent names

* cont : fix

* hparams : remove target_hidden_size

* cparams : rename output_layer_inp -> embeddings_layer_inp

* arch : reuse ATTN_NORM_2 instead of adding new hidden norm

* llama : clean-up names

* cont : add assert + comment

* Update conversion/llama.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9606
2026-06-12 10:21:06 +03:00
ZihaoMu 85f99dca8b ggml: support concat for scalar types at cuda backend (#24011)
* cuda: support concat for scalar types

* Update concat.cu

* fix metal ci issue
b9605
2026-06-12 09:32:44 +03:00
Neo Zhang 099ea76fb4 [SYCL] Fix CI build & release for SYCL backend (#24387)
* restore SYCL build and release, remove github cache

* modify for test only

* verify the ccache is used

* remove debug code change

* rm duplicate action, update key in ccache

* add action ccache-clear after building in both ubuntu and windows

* set %NUMBER_OF_PROCESSORS% in widnows build
b9604
2026-06-12 09:30:24 +03:00
shaofeiqi ba1df050f3 opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (#24319)
* opencl: add q5_0 adreno support

* opencl: add q5_1 adreno support

* opencl: cosmetic fix

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
b9603
2026-06-11 21:43:09 -07:00
wencan 1593d5684d docker : support specifying the GCC version for CUDA (#24447) 2026-06-11 23:12:09 +02:00
Jeff Bolz 4c6595503f vulkan: ifdef eMesaHoneykrisp (build fix) (#24479)
Fixes build/CI after #24306.
b9601
2026-06-11 13:22:17 -05:00
Georgi Gerganov 263cc04a54 sync : ggml 2026-06-11 19:34:19 +03:00
Georgi Gerganov 17e59d6209 ggml : bump version to 0.15.0 (ggml/1539) 2026-06-11 19:34:19 +03:00
Winston Ma fdc3db9b65 vulkan: add fast path for contiguous buffer transfers (#23973) 2026-06-11 15:46:25 +02:00
Kevin Liu 1af154a76f vulkan: use medium matmul tile on Asahi Linux (#24306)
* vulkan: use medium matmul tile on Asahi Linux

* vulkan: switch Apple detection to Honeykrisp driver id
2026-06-11 15:43:04 +02:00
Xuan-Son Nguyen 18ef86ecec server: skip unused log lines on router mode (#24463) b9596 2026-06-11 11:36:35 +02:00
o7si 1bfbdb134e vocab : adopt leading TemplateProcessing special token as BOS (#24428) 2026-06-11 10:37:23 +03:00
o7si 68f30663cf vocab : refactor normalizer flags into options struct, add strip_accents (#24371)
* vocab : refactor normalizer flags into options struct, add strip_accents

* Update src/llama-vocab.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9594
2026-06-11 10:36:50 +03:00
Aldehir Rojas db94854ff5 server : skip checkpoints beyond pos_next (#24411)
* server : skip checkpoints beyond pos_next

* cont : update comment + TODO + ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-11 10:18:12 +03:00
Adrien Gallouët ac4cddeb0d vendor : update LibreSSL to 4.3.2 (#24397)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9592
2026-06-10 22:28:03 +02:00
Gaurav Garg e95dae18d6 Remove padding and multiple D2D copies for MTP (#24086)
* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

* Make GDN changes in all backends. Address review comments.

* Fix CI build errors
b9591
2026-06-10 23:21:16 +05:30
Tarek Dakhran d2462f8f7a chat: fix LFM2/LFM2.5 ignoring json_schema (#24377)
The LFM2 specialized template handler only built a grammar for tool-calling,
silently ignoring json_schema from response_format.
b9590
2026-06-10 14:41:41 +02:00
Oliver Simons fb83cc9a07 CUDA: Fix ssm_scan_f32 data-races (#24360)
* Add missing syncthreads before resuing cub_temp_storage

__syncthreads() is required before being allowed to resue TempStorage
smem:
https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti

* Add one more missing __syncthreads

Could also double-buffer, but alternative is to simply ensure all
threads have read smem* before writing to it again in the next loop
iteration

* Remove unused smem from ssm_scan_f32
b9589
2026-06-10 14:27:08 +02:00
Sigbjørn Skjæret 039e20a2db ci : bump komac version (#24396) 2026-06-10 09:45:20 +02:00
ddh0 d2e22ed975 speculative : fix "ngram-map-k4v" name in logging (#24253)
This is a non-functional change.

When using `--spec-type ngram-map-k4v`, the log messages at startup and
runtime say `ngram-map-k`. Added logic in the in the constructor of
`common_speculative_impl_ngram_map_k` to pass the correct
`COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is
`false`.

After this change, the log messages use the correct name.
b9587
2026-06-10 09:31:35 +02:00
Rémy Mathieu 76da2450a4 webui: implement pinned conversations support (#21387)
* webui: implement pinned conversations support

* webui: linter/prettier pass

* Fix the unused handleMobileSidebarItemClick from the component.

* the search should find pinned conversations as well

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
b9586
2026-06-09 21:33:22 +02:00
Aarnav Pai d73cd07674 graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (#24357)
* llama-graph : apply embedding scale when deepstack is not used

* nits: remove non-existant hunyuan-vl from the tests

* apply suggestion from @gabe-l-hart

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b9585
2026-06-09 19:46:27 +02:00
Sigbjørn Skjæret e25a32e98c ci : fix windows release (#24369) b9584 2026-06-09 19:42:23 +03:00
Pascal 483609509d ui: add opt-in run_javascript frontend tool (#24244)
* ui: add opt-in run_javascript frontend tool

Expose a run_javascript tool to the model, executed entirely in the
browser through the existing agentic loop. Code runs in a Web Worker
inside a sandboxed iframe with an opaque origin, isolated from the
WebUI and its API. Console output, errors and the return value are
fed back as the tool result. The parent enforces a hard timeout by
removing the iframe, which terminates the worker.

Disabled by default, toggle in Settings > Developer.

* ui: address review feedback from allozaur

Use the JsonSchemaType enum for the tool definition parameter types
instead of raw string literals, extending it with STRING and NUMBER.

Move the worker shim and the iframe harness html into their own files
so the service no longer carries inline source blobs.

Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT
and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins.

* ui: move sandbox worker shim to a raw imported file

Replace the inline worker template string with a real sandbox-worker.js
imported as raw text, and build the iframe harness from it in
sandbox-harness.ts. The raw worker ships as a string, not a module, so
it is excluded from eslint and the typecheck program.
2026-06-09 18:02:31 +02:00