614 Commits

Author SHA1 Message Date
Georgi Gerganov 80452d65b9 server : consolidate slot selection into get_available_slot (#24755)
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.

Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-19 09:22:34 +03:00
Reguna 40f3aafc45 server: add "X-Accel-Buffering": "no" header to streaming endpoints (#24774)
* server: add "X-Accel-Buffering": "no" header to streaming endpoints

This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
2026-06-18 22:01:24 +02:00
Xuan-Son Nguyen fe7c8b2414 server: (router) fix stopping_thread potentially hang (#24728)
* server: (router) fix stopping_thread potentially hang

* fix windows build
2026-06-18 15:41:09 +02:00
Xuan-Son Nguyen e1efd0991d server: add "schema" and validation (#24150)
* wip

* working

* correct some limits

* add field name to error message
2026-06-18 15:40:58 +02:00
Aarni Koskela 08023072ef server : add last-5-seconds generation speed display (#24291)
* server : add last-5-seconds generation speed display

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-18 14:02:20 +02:00
Anuj Attri 10786217e9 server : return HTTP 400 on invalid grammar (#24144) (#24154)
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.

Fixes #24144
2026-06-18 12:49:14 +02:00
Xuan-Son Nguyen 552258c535 server: (router) rework -hf preset repo (#24739)
* server: temporary remove HF remote preset

* rework remove preset.ini support

* rm unused get_remote_preset_whitelist()

* print warning

* add docs

* rm stray file
2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen 968c43891a server: fix router args not being forwarded to child instances (#24760) 2026-06-18 12:15:46 +02:00
Xuan-Son Nguyen 4b4d13ae72 server: (router) add model management API (#23976)
* wip

* server: (router) add SSE realtime updates API

* nits

* wip

* add download API

* add download api

* update docs

* add delete endpoint

* fix std::terminate

* fix crash

* fix 2

* add tests

* nits
2026-06-17 18:04:58 +02:00
Ruixiang Wang 635b65ad7a spec: add spec metrics mean acceptance length and acceptance rate per position (#24536)
* spec: add spec metrics mean acceptance length and acceptance per pos

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix as suggestions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-16 10:23:09 +03:00
Adrien Gallouët e3a74b2990 bench : add --offline (#24511)
* bench : add --offline

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-06-16 08:26:05 +02:00
Georgi Gerganov e3cab403bf mtmd : add post-decode callback (#24645)
Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-15 16:02:05 +03:00
Xuan-Son Nguyen e8067a8b36 ui: build-time gzip compression (#24571)
* ui: keep original file name and path

* fix nocache

* ui: build-time gzip compression
2026-06-13 16:57:27 +02:00
Xuan-Son Nguyen 597b6672e8 ui: keep original file name and path (#24568)
* ui: keep original file name and path

* fix nocache
2026-06-13 14:31:41 +02:00
Xuan-Son Nguyen 57fe1f07c3 server: clean up static assets handling (#24550)
* server: clean up static assets handling

* nits

* simplify file name handling, use static file name everywhere

* cmake/ui : bundle UI assets in an archive

* ui : run prettier on post-build.js

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
2026-06-13 11:51:20 +02:00
Georgi Gerganov d8a24ccee2 fit : wrap llama_device_memory_data (#24522) 2026-06-13 08:09:52 +03:00
Xuan-Son Nguyen e37abd6b5f mtmd: add batching API (#24384)
* mtmd: add batching API

* wip

* first working version (gemma4v)

* add arg

* nits

* wire up support_batch()

* fix 0.0 output embd

* fix audio

* nits

* refactor a bit

* nits

* fix non-batching case

* fix comment
2026-06-13 00:10:29 +02:00
Georgi Gerganov ebc10770ac server : fix reasoning budget WebUI precedence over model.ini (#24517)
When reasoning-budget is set in model.ini, the per-request
thinking_budget_tokens from the WebUI was ignored because the
model.ini value took unconditional precedence.

Swap the precedence so the WebUI per-request value is checked
first, with the model.ini value serving as a fallback default.

Assisted-by: pi:llama.cpp/Qwen3.6-27B
2026-06-12 17:59:56 +03:00
Aleksander Grygier f7ca93d12c ui: PWA support (#23871)
* feat: Add basic PWA support and service worker for offline caching

* feat: Vite PWA implementation WIP

* feat: Improve PWA icons generation

* feat: Add PWA workbox to server routes

* feat: Include `version.json` in static assets

* feat: Add HTTP cache headers for PWA static assets

* feat: Update app name for `apple-mobile-web-app-title`

* feat: Implement PWA versioning and automatic update detection

* chore: Update `.gitignore` files

* feat: Splash Screens

* feat: Add dark mode favicon support

* refactor: Cleanup

* fix: Use dark logo for dark splash screens

* refactor: Simplify favicons SVG code

* fix: Adjust caching and polling for reliable service worker updates

* fix: Add missing favicon entry

* fix: Align PWA service worker configuration with SvelteKit build structure

* fix: Replace hashed bundle paths with versioned static paths

* test: Add PWA tests

* ci: Add build output for unit tests

* refactor: Cleanup

* fix: Server build & release versioning

* chore: Update package-lock.json

* chore: Increase PWA cache size

* chore: Update packages

* feat: Update favicons

* refactor: Post-merge fix

* feat: support explicit build version for PWA cache busting

* fix: CI

* feat: Improve PWA Refresh Alert UI

* feat: Add toggleable build version display

* refactor: Cleanup

* feat: Add version mismatch detection and manual app reload

* refactor: replace dynamic imports with static

* refactor: Cleanup

* feat: Add safe space for `pwa-<size>.png` rendered icons

* fix: use relative paths for PWA assets to support base path deployment

* feat: add PWA mode detection via URL query parameter

* feat: Use ?cache=true for SW-cached PWA assets

* refactor: Build process cleanup

* refactor: Decouple PWA versioning and remove ?cache=true workaround

* chore: Update README logo

* feat: Include PWA Assets generation in build script

* refactor: `usePwa` hook for core layout

* fix: Relativize base vite plugin

* fix: remove unnecessary backslash escapes in test regexes

* test: update static asset paths for API Key test

* refactor: Move SvelteKit PWA Options config to constants

* ui: fix update notification never appearing

Keep the PWA hook object intact instead of destructuring needRefreshByStorage,
which freezes the reactive getter. Also exclude loading.html from PWA
precache to prevent 404 errors and broken SW installation.
2026-06-12 15:53:26 +02:00
Xuan-Son Nguyen 18ef86ecec server: skip unused log lines on router mode (#24463) 2026-06-11 11:36:35 +02:00
Aldehir Rojas db94854ff5 server : skip checkpoints beyond pos_next (#24411)
* server : skip checkpoints beyond pos_next

* cont : update comment + TODO + ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-11 10:18:12 +03:00
jacekpoplawski 1e912561dd server: log prompts to directory (#22031)
* server: log prompts to directory

Add `--log-prompts-dir` to write each prompt to a separate text file in
the specified directory.

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-06-09 12:09:07 +02:00
fiesh 961e9a3e46 server : do not clear slots without unified KV cache (#24190)
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-09 10:45:16 +03:00
Aldehir Rojas 42a0afd594 server : do not parse when flushing http headers (#24281) 2026-06-08 13:32:41 -05:00
Xuan-Son Nguyen 8f83d6c271 mtmd : add video input support (#24269)
* wip

* ok: lazy bitmap API

* remember to free lazy text

* wip

* add mtmd_helper_video

* support video input on server (base64 input)

* add MTMD_VIDEO config

* add timestamp

* update CLI

* cli: allow auto-completion for video

* add --video arg

* fix build

* update docs

* rename as suggested
2026-06-08 14:40:12 +03:00
ddh0 9e3b928fd8 common : relax sampler name matching (#23744)
* common : relax sampler name matching

Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.

This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.

I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.

This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.

* add more alt names

* cont. fix

* cast to unsigned char for correctness

* common : unify sampler name mapping

* annotate canonical vs. alt sampler name mappings per @CISC

* Update common/sampling.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : auto-generate sampler name aliases per @ngxson

* use merged map for matching

* use `.merge` instead of iterating

* nit: simplify comment

* nit: use insert everywhere, not index assignment

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-07 22:48:11 +02:00
Aman Gupta 04eb4c446d llama : add Gemma4 MTP (#23398) 2026-06-07 20:50:54 +08:00
Xuan-Son Nguyen f5c6ae1827 mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API (#23913)
* mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing

* fast path skip preproc for placeholder

* fix build

* correct the api

* add server endpoint + tests

* add object name

* update docs

* add proxy handling

* fix build

* fix audio input path

* use is_placeholder in process_mtmd_prompt()

* nits

* nits (2)

* docs: clarify chat/completions/input_tokens is not official

* fix merge problem
2026-06-06 11:06:51 +02:00
Mario 9c955c48b0 Fix link to available UI settings (#24169)
The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link
2026-06-05 14:39:32 +02:00
Georgi Gerganov 7c158fbb4a server : disable on-device spec checkpoints (#24108) 2026-06-04 19:30:59 +03:00
Yongyue Sun 6f3a9f3dee server: avoid unnecessary checkpoint restore when new tokens are present (#24110)
* server: avoid unnecessary checkpoint restore when new tokens are present

The pos_min_thold calculation unconditionally subtracts 1 to ensure at
least one token is evaluated for logits when no new tokens exist.
However, when the request contains new tokens beyond the cached prefix,
this -1 is overly conservative and may trigger an unnecessary checkpoint
restore.

Conditionally apply the -1 only when n_past >= task.n_tokens() (no new
tokens), avoiding redundant KV state restoration when there is actual
work to do.

* cont : add ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-04 16:09:01 +03:00
A B 0066404085 server : add header to tools/server/server-http.h (#24089) 2026-06-04 13:14:46 +03:00
Aman Gupta 166fe29492 qwen35: use post-norm hidden state for MTP (#24025)
* qwen35: use post-norm hidden state for MTP

* rename pre_norm to nextn

* fix step35
2026-06-04 01:29:09 +08:00
Mikhail Podvitskii 4fb16eccce model: add Mellum architecture (#23966)
* model: support for Mellum architecture

* model: improve mellum.py formatting

* model: improve mellum.py formatting once again

* deps: downgrade transformers to 4.57.6 (to fix CI)

* deps: remove huggingface_hub dependency

* deps: remove huggingface_hub from test requirements

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-02 22:11:12 +03:00
Xuan-Son Nguyen 60130d18f9 server: add SSE ping interval (#24013) 2026-06-02 14:14:55 +02:00
Pascal 354ebac8cb server: real-time reasoning interruption via control endpoint (#23971)
* server: real-time reasoning interruption via control endpoint

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-06-02 07:26:20 +02:00
Georgi Gerganov 5dcb711666 speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988)
* speculative : add common_speculative_n_max helper function

Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.

Assisted-by: llama.cpp:local pi

* cont : draft context always has n_parallel outputs

* llama : log n_outputs_max

* speculative : remove draft-simple auto-enable

* ci : enable server tests on PRs
2026-06-01 22:26:58 +03:00
Aman Gupta de6f727aae llama: limit max outputs of llama_context (#23861)
* llama: save more VRAM by reserving n_outputs == n_seqs when possible

* add n_outputs_per_seq

* move n_outputs_max to server-context

* change ubatch to batch everywhere
2026-06-01 18:01:38 +03:00
Eric Zhang 6f165c1c64 server : handle If-None-Match weak ETags (#23916) 2026-05-31 16:21:08 -05:00
Xuan-Son Nguyen 0821c5fcfd server: in SSE mode, send HTTP headers when slot starts (#23884)
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
2026-05-30 00:06:29 +02:00
Ruixiang Wang 689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869)
* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
2026-05-29 23:09:47 +02:00
Pascal 5a46b46acd app: add llama update self updater (#23865)
* wip: llama update POC

* cleaning: llama update

* llama-gen-docs

* app: delegate llama update to the install script

* app: spawn the installer detached so llama update can replace a running binary

* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}

* app: make llama_update static

Address review from @angt
2026-05-29 23:02:40 +02:00
Xuan-Son Nguyen b5f52280fb server: remove obsolete scripts (#23870) 2026-05-29 19:47:30 +02:00
Xuan-Son Nguyen 06d26dfdff download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
2026-05-29 16:30:55 +02:00
Xuan-Son Nguyen cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
2026-05-29 10:23:17 +02:00
Adrien Gallouët 98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-29 07:46:11 +02:00
Mikolaj Kucharski 7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Funtowicz Morgan 0b246862b9 server: minor tweaks to use more cpp features (#23785)
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
2026-05-28 14:00:25 +02:00
Markus Tavenrath d205df6812 server, ui : Add support for HTTP ETags in llama-server (#23701)
* allow caching of ui elements in llama-server

* use fnv_hash

* Update tools/server/server-http.cpp

etag has to be set always

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-28 12:21:24 +02:00
Georgi Gerganov 6b4e4bd582 common : fix env names to all have LLAMA_ARG_ prefix (#23778) 2026-05-27 14:52:47 +03:00