952 Commits

Author SHA1 Message Date
Georgi Gerganov d8a24ccee2 fit : wrap llama_device_memory_data (#24522) 2026-06-13 08:09:52 +03:00
Xuan-Son Nguyen e37abd6b5f mtmd: add batching API (#24384)
* mtmd: add batching API

* wip

* first working version (gemma4v)

* add arg

* nits

* wire up support_batch()

* fix 0.0 output embd

* fix audio

* nits

* refactor a bit

* nits

* fix non-batching case

* fix comment
2026-06-13 00:10:29 +02:00
Georgi Gerganov 02182fc5b9 fit : avoid including llama-ext.h in fit.h (#24506) 2026-06-12 15:57:05 +03:00
Ruixiang Wang 88a39274ec spec: add EAGLE3 speculative decoding support (#18039)
* llama : enable layer input extraction

* spec: support eagle3

* eagle3: fix params bug

* eagle3: support Gemma4 eagle3 from RedHatAI

* eagle3: set sync when get features from target

Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>

* eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder

Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>

* eagle3: adapt to upstream changes

* eagle3: fix rebase issues and adapt to upstream changes

* eagle3:exclude the eagle3 arch from test-llama-archs

* eagle3: fix editorconfig check failures

* eagle3: fix multi-seq issue in d2t vocab mapping

* cont : minor style / clean-up

* spec : remove `common_speculative_setup_draft_model()`

* llama : clean-up unused API

* eagle3: set d2t vocab mapping in decode graph

* cont : assert layer inputs are configured

* hparams : use n_embd_inp instead of n_embd_target_features

* eagle3: make output.weight optional and inherit from target model when needed

* haparams : generic norm-before-residual param

* llama-ext : consistent names

* cont : fix

* hparams : remove target_hidden_size

* cparams : rename output_layer_inp -> embeddings_layer_inp

* arch : reuse ATTN_NORM_2 instead of adding new hidden norm

* llama : clean-up names

* cont : add assert + comment

* Update conversion/llama.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-12 10:21:06 +03:00
Tarek Dakhran d2462f8f7a chat: fix LFM2/LFM2.5 ignoring json_schema (#24377)
The LFM2 specialized template handler only built a grammar for tool-calling,
silently ignoring json_schema from response_format.
2026-06-10 14:41:41 +02:00
ddh0 d2e22ed975 speculative : fix "ngram-map-k4v" name in logging (#24253)
This is a non-functional change.

When using `--spec-type ngram-map-k4v`, the log messages at startup and
runtime say `ngram-map-k`. Added logic in the in the constructor of
`common_speculative_impl_ngram_map_k` to pass the correct
`COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is
`false`.

After this change, the log messages use the correct name.
2026-06-10 09:31:35 +02:00
jacekpoplawski 1e912561dd server: log prompts to directory (#22031)
* server: log prompts to directory

Add `--log-prompts-dir` to write each prompt to a separate text file in
the specified directory.

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-06-09 12:09:07 +02:00
fiesh 961e9a3e46 server : do not clear slots without unified KV cache (#24190)
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-09 10:45:16 +03:00
Xuan-Son Nguyen 8f83d6c271 mtmd : add video input support (#24269)
* wip

* ok: lazy bitmap API

* remember to free lazy text

* wip

* add mtmd_helper_video

* support video input on server (base64 input)

* add MTMD_VIDEO config

* add timestamp

* update CLI

* cli: allow auto-completion for video

* add --video arg

* fix build

* update docs

* rename as suggested
2026-06-08 14:40:12 +03:00
ddh0 9e3b928fd8 common : relax sampler name matching (#23744)
* common : relax sampler name matching

Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.

This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.

I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.

This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.

* add more alt names

* cont. fix

* cast to unsigned char for correctness

* common : unify sampler name mapping

* annotate canonical vs. alt sampler name mappings per @CISC

* Update common/sampling.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : auto-generate sampler name aliases per @ngxson

* use merged map for matching

* use `.merge` instead of iterating

* nit: simplify comment

* nit: use insert everywhere, not index assignment

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-07 22:48:11 +02:00
Aman Gupta 04eb4c446d llama : add Gemma4 MTP (#23398) 2026-06-07 20:50:54 +08:00
Sigbjørn Skjæret 8a091c47ab spec : fix vocab compatibility check (#24256) 2026-06-07 14:43:52 +03:00
konradmb 465b1f0e75 arg: Skip mmproj download when user supplied mmproj (#24239) 2026-06-07 11:18:44 +02:00
Tarek Dakhran 98d5e8ba8a common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (#24234)
* common/chat : fix LFM2 reasoning round-trip and stray <think> leak
* Gate by reasoning format and whether the template supports <think>
2026-06-06 22:39:21 +02:00
Tarek Dakhran da87e9b612 common/chat : unify and fix LFM2/LFM2.5 tool parser (#24178) 2026-06-05 14:31:56 -05:00
Xuan-Son Nguyen 260862b8ca arg: fix double mtp downloads (#24128) 2026-06-04 19:23:48 +03:00
Bartowski e7bcf1c3a8 Move duplicated imatrix code into single common imatrix-loader.cpp (#22445)
* Deduplicate imatrix loading code

* Add back LLAMA_TRACE, early exit on quantize missing metadata
2026-06-04 17:45:40 +02:00
Aman Gupta 166fe29492 qwen35: use post-norm hidden state for MTP (#24025)
* qwen35: use post-norm hidden state for MTP

* rename pre_norm to nextn

* fix step35
2026-06-04 01:29:09 +08:00
Ryan Mangeno e3666269f9 arg : removed unecesary mmproj download when users pass --no-mmproj (#23425) 2026-06-03 08:04:46 +03:00
Daniel Bevenius 0b7154066e common : fix state save in common_prompt_batch_decode (#23468)
* common : fix state save in common_prompt_batch_decode

This commit addresses a bug in common_prompt_batch_decode that affects
the session state store/restore in completion.cpp and
save-load-state.cpp.

The motivation for this is that currently the code is saving n-1 tokens
in both the session_tokens and in the KV cache. Then when loading the
session tokens, and if the prompt matches, it would replay the last
saved token (n-1) into the next position, effectively replaying the
same token in the wrong position.

The fix is to store all n tokens in session_tokens, while the memory
state only reflects n-1 processed tokens as the saving happens before
the last token is decoded in common_prompt_batch_decode.

I ran both completion.cpp and save-load-state.cpp with a transformer, a
recurrent, and a hybrid model.

Resolves: https://github.com/ggml-org/llama.cpp/issues/23400

Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>
2026-06-02 15:44:15 +02:00
Xuan-Son Nguyen 60130d18f9 server: add SSE ping interval (#24013) 2026-06-02 14:14:55 +02:00
Aleksander Grygier f8e67fc583 ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434)
* feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel

* test: Update test reference

* fix: Icon

* fix: E2E test command

* fix: wait for greeting h1 to be visible in e2e test

* fix: remove duplicate PDF option in attachment dropdown

* fix: use label-based group toggle to avoid stale references

* refactor: inline MCP server and tool toggles in mobile sheet

* fix: serve correct build directory in e2e playwright config

* feat: add reasoning effort levels selector in model dropdown

* feat: Reasoning effort

* refactor: Make server origin configurable via environment variable

* feat: Add chat template thinking detector utility

* feat: Add thinking support detection to models store

* refactor: Update model selector components with thinking detection and message-specific indicators

* feat: Update chat form components for model selection and thinking support

* feat: Improve Reasoning controls UI

* refactor: Apply suggestions from code review

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix: Model tags

* refactor: Cleanup

* refactor: Remove unneeded components

* refactor: Cleanup
2026-06-02 10:23:19 +02:00
Georgi Gerganov 4f3a4beb8d llama : deprecate llama_set_warmup (#24009)
* llama : deprecate `llama_set_warmup`

* cont : fix type

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-06-02 10:30:38 +03:00
Pascal 354ebac8cb server: real-time reasoning interruption via control endpoint (#23971)
* server: real-time reasoning interruption via control endpoint

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-06-02 07:26:20 +02:00
Georgi Gerganov 5dcb711666 speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988)
* speculative : add common_speculative_n_max helper function

Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.

Assisted-by: llama.cpp:local pi

* cont : draft context always has n_parallel outputs

* llama : log n_outputs_max

* speculative : remove draft-simple auto-enable

* ci : enable server tests on PRs
2026-06-01 22:26:58 +03:00
Aman Gupta de6f727aae llama: limit max outputs of llama_context (#23861)
* llama: save more VRAM by reserving n_outputs == n_seqs when possible

* add n_outputs_per_seq

* move n_outputs_max to server-context

* change ubatch to batch everywhere
2026-06-01 18:01:38 +03:00
Aldehir Rojas 5254a7994d common : support manually triggering the reasoning budget end sequence (#23949) 2026-06-01 11:37:11 +02:00
Xuan-Son Nguyen 06d26dfdff download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
2026-05-29 16:30:55 +02:00
Xuan-Son Nguyen cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
2026-05-29 10:23:17 +02:00
Omid Azizi b000431a0b ngram-mod : Add missing include (#23857)
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
2026-05-29 09:21:37 +03:00
Adrien Gallouët 98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-29 07:46:11 +02:00
Mikolaj Kucharski 7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Georgi Gerganov 6b4e4bd582 common : fix env names to all have LLAMA_ARG_ prefix (#23778) 2026-05-27 14:52:47 +03:00
jacekpoplawski e2ef8fe42c server: fix checkpoints creation (#22929)
* common : add common_chat_split_by_role

* cont : fix spans to reach end of message

* server: fix checkpoints creation

- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints

* cont : clean-up

* Support autoparser detection for message barriers

* server: fix message span delimiter and update docs

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-05-25 08:56:18 +03:00
Aman Gupta 83eebe9d08 server: add margin for draft model for fit (#23485) 2026-05-24 14:43:08 +08:00
Aldehir Rojas b22ff4b7b4 cmake/ui : refactor the build (#23352) 2026-05-23 17:08:22 -04:00
Aldehir Rojas 1acee6bf89 server: only parse empty msg if continuing an assistant msg (#23506) 2026-05-22 11:58:15 -04:00
Gaurav Garg ad27757261 Move to backend sampling for MTP draft path (#23287)
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
2026-05-20 22:34:45 +05:30
Georgi Gerganov 510b5c2a35 common/speculative : fix nullptr crash in get_devices_str (#23386)
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.

Assisted-by: llama.cpp:local pi
2026-05-20 19:44:30 +03:00
Johannes Gäßler 57cb35c886 common: fix --help for --verbosity (#23278) 2026-05-19 21:34:04 +02:00
Johannes Gäßler 7256fce047 common: fix --fit verbosity with --verbosity 4 (#23282) 2026-05-19 21:33:23 +02:00
Georgi Gerganov d14ce3dab4 llama : MTP clean-up (#23269)
* llama : disable equal splits for recurrent memory with partial rollback

* spec : re-enable p-min with MTP drafts

* spec : re-enable ngram spec in combination with RS rollback

* spec : fix ngram-map-* params

* spec : fix acceptance logic in combined ngram + draft configs

* graph : fix reuse for combined `token` + `embd` batches

* spec : log parameters for each speculative implementation

- add LOG_INF in each constructor with implementation type and parameters
- extract device string logic into common_speculative_get_devices_str()
- move 'adding speculative implementation' log from init into constructors

Assisted-by: llama.cpp:local pi

* spec : extend --spec-default with ngram-map-k4v

Assisted-by: llama.cpp:local pi

* minor : fix n_embd log

* args : update draft.n_max == 3 + regen docs

* spec : relax ngram-mod rejection thold to 0.25 @ 5 low

* logs : improve

* docs : update speculative decoding CLI argument documentation

- Add missing draft model CPU scheduling and tensor override parameters
- Update --spec-type to include all available types (excluding draft-eagle3 WIP)
- Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
- Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
- Add environment variables for new parameters

Assisted-by: llama.cpp:local pi

* arg : step-back on adding k4v to the default spec config

* cont : fix name
2026-05-19 15:32:58 +03:00
Georgi Gerganov cd963fee6a save-load-state : refactor tests and improve readability (#23196)
* save-load-state : refactor into separate phase functions

- Split monolithic main() into 4 self-contained phase functions, each
  managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)

Assisted-by: llama.cpp:local pi

* save-load-state : use params.out_file instead of separate state_file

- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt

Assisted-by: llama.cpp:local pi

* save-load-state : use smart pointers for ctx and smpl

- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)

Assisted-by: llama.cpp:local pi

* save-load-state : add local llama_batch_ptr RAII wrapper

- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls

Assisted-by: llama.cpp:local pi

* save-load-state : replace printf/fprintf with logging macros

- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG

Assisted-by: llama.cpp:local pi

* save-load-state : refactor tests to check results inline

Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().

Assisted-by: llama.cpp:local pi

* save-load-state : improve test output readability

Add phase labels, remove redundant run prefixes, and show
PASS after each test.

Assisted-by: llama.cpp:local pi

* pi : add rule about git signing

* save-load-state : simplify llama_batch_ptr

Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.

Assisted-by: llama.cpp:local pi

* save-load-state : extract generate_tokens helper

Factor out the repeated token generation loop into a shared
helper function used by all phases.

Assisted-by: llama.cpp:local pi

* save-load-state : update comments to use test terminology

Replace "Phase" with "Test" and list each test's steps
as bullet points.

Assisted-by: llama.cpp:local pi

* save-load-state : rename test functions

Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.

Assisted-by: llama.cpp:local pi

* pi : add rule to never git push without confirmation

Assisted-by: llama.cpp:local pi

* common : add model_only option to common_init_from_params

Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.

Use in save-load-state to initialize only the model,
with each test creating its own context.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:34 +03:00
Adrien Gallouët 45b455e66f common : remove hf cache migration (#23266)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-18 17:11:47 +02:00
Aldehir Rojas 87589042ca cmake : fix LLAMA_BUILD_UI logic (#23190) 2026-05-17 14:42:26 -04:00
Aman Gupta 3e12fbdea5 llama: avoid copying logits during prompt decode in MTP (#23198)
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
2026-05-17 23:30:25 +08:00
Aldehir Rojas 39cf5d6191 common : delegate assistant continuation to underlying template handlers (#23089)
* common : delegate assistant continuation to template handler

* server : implement echo parameter to exclude assistant prefill in the response

* server : fix tests for prefill

* server : use existing llama template

* cont : clean up
2026-05-17 13:36:05 +02:00
Aldehir Rojas f4cc787b9f common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments
2026-05-17 03:44:34 -05:00
Rares Vernica 1a68ec9378 server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
2026-05-17 09:39:04 +03:00
ddh0 a16cce81d3 ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
2026-05-17 09:38:17 +03:00