868 Commits

Author SHA1 Message Date
Xuan-Son Nguyen cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
2026-05-29 10:23:17 +02:00
Adrien Gallouët 98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-29 07:46:11 +02:00
Xuan-Son Nguyen 751ebd17a5 mtmd-debug: add color and rainbow mode (#23829)
* mtmd-debug: add color and rainbow mode

* fix M_PI

* max_dist
2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen c8914ad4f4 mtmd: fix gemma 4 projector pre_norm (#23822) 2026-05-28 20:58:55 +02:00
ValdikSS 2f6c815dc4 ui: fix audio and video modality detection (#23756)
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
2026-05-28 17:36:10 +02:00
Saba Fallah 0b56d283bf mtmd: n_head_kv defaults to n_head (#23782)
removed AI-generated comment
2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen d6be3158e1 mtmd: fix gemma 4 audio rms norm eps (#23815)
* mtmd: fix gemma 4 audio rms norm eps

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-28 16:31:37 +02:00
Mikolaj Kucharski 7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Funtowicz Morgan 0b246862b9 server: minor tweaks to use more cpp features (#23785)
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
2026-05-28 14:00:25 +02:00
Markus Tavenrath d205df6812 server, ui : Add support for HTTP ETags in llama-server (#23701)
* allow caching of ui elements in llama-server

* use fnv_hash

* Update tools/server/server-http.cpp

etag has to be set always

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-28 12:21:24 +02:00
Adrien Gallouët 48e7eae41c perplexity : fix format specifier in LOG_ERR (#23788)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-28 10:34:58 +03:00
Georgi Gerganov 6b4e4bd582 common : fix env names to all have LLAMA_ARG_ prefix (#23778) 2026-05-27 14:52:47 +03:00
Radoslav Gerganov 7085492c6f server : fix the log message when using SSL (#23393)
When llama-server is started with SSL key and cert, the log says that it
listens on http instead of https. This patch fixes this.
2026-05-27 08:06:30 +03:00
Pascal 5a4126adc1 ui: fix stop/continue during an agentic loop (#23356) 2026-05-25 14:18:59 +02:00
Aman Gupta 6c4cbdc70b server: MTP layer kv-cache should respect draft type ctk (#23646) 2026-05-25 16:46:23 +08:00
Saba Fallah b96487645c ui: media attachments before text (#23467)
* ui: media attachments before text

* fix prettier formatting
2026-05-25 08:50:41 +02:00
jacekpoplawski e2ef8fe42c server: fix checkpoints creation (#22929)
* common : add common_chat_split_by_role

* cont : fix spans to reach end of message

* server: fix checkpoints creation

- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints

* cont : clean-up

* Support autoparser detection for message barriers

* server: fix message span delimiter and update docs

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-05-25 08:56:18 +03:00
fairydreaming 6d57c26ef8 perplexity : fix even more integer overflows (#23623)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-25 08:12:39 +03:00
Aldehir Rojas 63248fc3e3 cmake : fix ui build (#23592)
* cmake/ui : add -fPIC to llama-ui static lib

* cmake : rename host compiled embed helper
2026-05-24 02:37:28 -05:00
Aman Gupta 83eebe9d08 server: add margin for draft model for fit (#23485) 2026-05-24 14:43:08 +08:00
Aldehir Rojas b22ff4b7b4 cmake/ui : refactor the build (#23352) 2026-05-23 17:08:22 -04:00
Aditya Singh c0c7e147e7 requirements : bump torch to 2.11.0 (#23503)
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf

The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.

Fixes #23408

* requirements: bump torch floor to 2.11.0 per maintainer

* requirements: pin torch to ==2.11.0 per project policy

* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy

* requirements: suppress check_requirements pin warning on mtmd

The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.

* ty: silence Tensor/Module union check on model[0].auto_model

With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
2026-05-23 18:24:39 +02:00
Aldehir Rojas 1acee6bf89 server: only parse empty msg if continuing an assistant msg (#23506) 2026-05-22 11:58:15 -04:00
fairydreaming ef570f6308 perplexity : fix integer overflow (#23496)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-22 15:50:44 +03:00
Georgi Gerganov bbce619adb cmake : add install() for impl libraries + fix apple builds (#23511)
* pi : update

* ci : fix ios build

* ci : fix andoroid

* ci : fix apple builds

* cmake : add install() for impl libraries

Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:

  llama-server: error while loading shared libraries: libllama-server-impl.so

Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515

Assisted-by: llama.cpp:local pi

* ci : fix xcframework build
2026-05-22 11:46:26 +03:00
Georgi Gerganov bb28c1fe24 cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (#23462)
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control

Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.

Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.

Assisted-by: llama.cpp:local pi

* cmake : enable LLAMA_BUILD_APP by default

Assisted-by: llama.cpp:local pi

* ci : disable app in build-cmake-pkg.yml
2026-05-21 21:13:59 +03:00
ScrewTSW b65bb4baae server: expose prompt token counts in /slots endpoint (#23454)
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
2026-05-21 13:29:13 +02:00
Aman Gupta 52fb93a2bd server : free draft/MTP resources on sleep to fix VRAM leak (#23461)
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).

For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.

Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.

ref: https://github.com/ggml-org/llama.cpp/issues/23395

Assisted-by: llama.cpp:local pi
2026-05-21 16:11:11 +08:00
Pascal c9021714e8 server: re-inject subcommand when router spawns children under unified binary (#23442) 2026-05-21 10:09:19 +02:00
Adrien Gallouët 1d7ab2b947 app : add batched-bench, fit-params, quantize & perplexity (#23459)
* app : add batched-bench, fit-params, quantize & perplexity

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing main.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add EOL

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-21 10:29:44 +03:00
Aleksander Grygier 5e932a1c8d ui: Improve Git Hooks for UI development (#23403)
* refactor: Improve Git Hooks for UI development

* fix: Address review comments

* fix: Use absolute git path for `/hooks`

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-05-21 08:27:50 +02:00
wendadawen 6a257d4463 mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329)
- HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference.
- Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch
2026-05-21 00:35:37 +02:00
stduhpf 3a479c9132 ui: Add max image size option (#22849)
* webui: Add max image size option

* remove magic numbers

* support all image formats

* use const

* Move regex to match b64 images to constants

* use SETTINGS_KEYS to get max image resolution setting

* Do not touch the image if already under the size threshold
2026-05-21 00:00:09 +02:00
Saba Fallah a8681a0ed2 mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (#23345)
* mtmd : deepseek-ocr fixes, improvements and refactoring

- image processing changes to achieve full parity with Pillow (reference impl)
- SAM mask casting only when flash-attn is on
- SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it)
- llama-chat changes to fix server/WebUI issue (new media_markers_first())
- adapted test-chat-template and added test cases for deepseek-ocr
- changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model
- ty.toml ignore unresolved-import for tools/mtmd/tests/**

* image-text reordering fix removed

* refactor bool add_padding + pad_rounding enum into a single pad_style enum
2026-05-20 17:37:10 +02:00
Aleksander Grygier 6ce96713de feat: Add WAV MIME type variants and improve audio format detection (#23396) 2026-05-20 16:55:24 +02:00
Adrien Gallouët 29f1482221 app : introduce the llama unified executable (#23296)
* app : introduce the llama unified executable

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use serve for server

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Hide completion and bench, add help command

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove STATIC

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use -impl targets instead of -lib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Revert "Remove STATIC"

This reverts commit cc44caccb9.

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-20 13:22:22 +02:00
Aleksander Grygier e6b4acfe86 refactor: Move text attachments up before the message content in chat completions payload (#23406) 2026-05-20 13:04:01 +02:00
Xuan-Son Nguyen e2b129e1bf mtmd: fit_params now take into account mmproj (#21489)
* mtmd: fit_params now take into account mmproj

* rename alloc_compute_meta to reserve_compute_meta

* rm unused functions

* add ggml_backend_dev_t support

* add debug log
2026-05-20 11:27:44 +02:00
Aleksander Grygier 5028447384 ui: Refactor isMobile as reactive value in viewport store (#23330)
* refactor: `isMobile` as reactive value in `viewport` store

* refactor: Use Svelte media query for the viewport store
2026-05-20 10:52:00 +02:00
Aleksander Grygier 585080d310 fix: Div wrapper no pointer events on hidden (#23390) 2026-05-20 09:46:31 +02:00
Aleksander Grygier 67ace021da refactor: Chat Screen UI rendering (#23333) 2026-05-19 22:38:42 +02:00
Johannes Gäßler 7256fce047 common: fix --fit verbosity with --verbosity 4 (#23282) 2026-05-19 21:33:23 +02:00
Georgi Gerganov d14ce3dab4 llama : MTP clean-up (#23269)
* llama : disable equal splits for recurrent memory with partial rollback

* spec : re-enable p-min with MTP drafts

* spec : re-enable ngram spec in combination with RS rollback

* spec : fix ngram-map-* params

* spec : fix acceptance logic in combined ngram + draft configs

* graph : fix reuse for combined `token` + `embd` batches

* spec : log parameters for each speculative implementation

- add LOG_INF in each constructor with implementation type and parameters
- extract device string logic into common_speculative_get_devices_str()
- move 'adding speculative implementation' log from init into constructors

Assisted-by: llama.cpp:local pi

* spec : extend --spec-default with ngram-map-k4v

Assisted-by: llama.cpp:local pi

* minor : fix n_embd log

* args : update draft.n_max == 3 + regen docs

* spec : relax ngram-mod rejection thold to 0.25 @ 5 low

* logs : improve

* docs : update speculative decoding CLI argument documentation

- Add missing draft model CPU scheduling and tensor override parameters
- Update --spec-type to include all available types (excluding draft-eagle3 WIP)
- Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
- Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
- Add environment variables for new parameters

Assisted-by: llama.cpp:local pi

* arg : step-back on adding k4v to the default spec config

* cont : fix name
2026-05-19 15:32:58 +03:00
Aleksander Grygier 6db130445d ui: Bump packages + address build warnings (#23300)
* chore: Update vulnerable packages

* chore: Formatting

* refactor: Update Tailwind CSS imports

* ci: Use `ubuntu-latest` for Unit/E2E UI tests

* chore: Bump package

* fix: Add missing tag

* refactor: Enums files naming
2026-05-19 10:16:04 +02:00
Pascal ccee426426 server-context: guarantee there is at least 1 token to decode (#23280) 2026-05-19 09:49:01 +03:00
Georgi Gerganov 3c81c8deea server : print graphs reused in slot timings (#23279)
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:58 +03:00
Aleksander Grygier 3a9c1b854d ui: Update KaTeX package and clean up logs from sass warnings (#23275)
* ui: migrate katex imports to @use to resolve SCSS deprecation warnings

* ci: Use `ubuntu-slim` for CI (UI) workflow
2026-05-18 16:26:01 +02:00
Aleksander Grygier b9a2170fce feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270) 2026-05-18 16:17:21 +02:00
Aleksander Grygier 1ff0fc1384 ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars

* refactor: skip MCP proxy probe when no server requires it

* refactor: suppress expected disconnect errors during MCP client shutdown

* refactor: Deduplicate requests

* refactor: deduplicate model fetching across ROUTER and MODEL modes

* refactor: Clean up models logic

* chore: Add `.env.example` file

* refactor: replace client-side CORS proxy probe with server status flag

* refactor: Post-review fixes

* test: add vitest client setup with API fetch mocks
2026-05-18 16:09:40 +02:00
Aleksander Grygier a135ec0baa ui: Centralize monospace font styles in app.css (#23272) 2026-05-18 15:10:14 +02:00