llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 15:20:20 +00:00

Author	SHA1	Message	Date
Anuj Attri	10786217e9	server : return HTTP 400 on invalid grammar (#24144 ) (#24154 ) Throw on grammar parse failure so the server returns HTTP 400 instead of silently dropping the constraint. Add a regression test for the invalid-grammar response. Fixes #24144	2026-06-18 12:49:14 +02:00
Xuan-Son Nguyen	552258c535	server: (router) rework -hf preset repo (#24739 ) * server: temporary remove HF remote preset * rework remove preset.ini support * rm unused get_remote_preset_whitelist() * print warning * add docs * rm stray file	2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen	4b4d13ae72	server: (router) add model management API (#23976 ) * wip * server: (router) add SSE realtime updates API * nits * wip * add download API * add download api * update docs * add delete endpoint * fix std::terminate * fix crash * fix 2 * add tests * nits	2026-06-17 18:04:58 +02:00
Max Krasnyansky	cda63856b8	common: update logging to enforce max_capacity and optimize queue resizing (#24490 ) * common: update logging to enforce max_capacity and optimize queue resizing logic * common/log: remove queue expansion logic	2026-06-17 09:19:11 +03:00
Ruixiang Wang	a1824902b5	spec: add backend sampling support for eagle3 (#24655 )	2026-06-16 12:05:52 +03:00
Ruixiang Wang	635b65ad7a	spec: add spec metrics mean acceptance length and acceptance rate per position (#24536 ) * spec: add spec metrics mean acceptance length and acceptance per pos * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix as suggestions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-16 10:23:09 +03:00
Tarek Dakhran	7dad2f1a17	chat : fix LFM2 tool-call parsing double-escaping (#24667 ) * Add escape test cases * chat : fix LFM2 tool-call parsing double-escaping	2026-06-15 22:10:09 +02:00
Piotr Wilkin (ilintar)	38d546330a	chat: include full unparsed prompt in debug (#24650 ) message on parse error	2026-06-15 17:33:54 +02:00
Pascal	581e8eca8b	chat: harden peg-native tool call parsing (#24329 ) * chat: harden peg-native tool call parsing accept an optional leading type: function field in build_json_tools_flat_keys so openai style tool calls parse on templates whose serialization opens on the name field. return a clean error and log the unparsed fragment on a final peg parse failure instead of throwing the raw parser position and input. keep the raw arguments string in func_args_not_string when it is not valid json instead of aborting the prompt render. * chat: surface peg-native parse failures a final peg parse failure threw the raw parser position and input. log the unparsed fragment and raise a clearer error instead, so a model output that does not match the expected format no longer fails silently with an empty assistant turn. minimal change, no behavior change on successful parses. * chat: handle openai style tool calls in peg-native * nits * common: scope OpenAI wrapper grammar trigger via autoparser flag * chat: gate type:function parsing leniency on the analysis flag Thread accept_openai_wrapper from the generator to build_json_tools_flat_keys so the leading "type": "function" field is accepted only when openai_wrapper_trigger is set.	2026-06-15 15:37:04 +02:00
Piotr Wilkin (ilintar)	0ae3f450f0	chat: fix an "oldie but goodie" grammar generator bug that surfaced during last changes (#24653 ) * chat: fix an "oldie but goodie" grammar generator bug that surfaced during last changes * update erroneous case in PEG parser test	2026-06-15 15:27:47 +02:00
Piotr Wilkin (ilintar)	a6dff71270	chat: fix whitespace problems once and for all (#24624 ) * chat: fix whitespace problems once and for all * Purge trailing spaces from grammar generation * Revert "Purge trailing spaces from grammar generation" This reverts commit `b0827ecb7d`.	2026-06-15 08:27:10 +02:00
Piotr Wilkin (ilintar)	aedb2a5e9c	chat: add dedicated Cohere2MoE (North Code) parser (#24615 ) * chat: add dedicated Cohere2MoE (North Code) parser * Some renames to make @CISC happy :>	2026-06-14 20:17:40 +02:00
Sigbjørn Skjæret	acd79d603c	jinja : add count/d/e filter aliases (#24606 )	2026-06-14 15:07:31 +02:00
Sigbjørn Skjæret	f05cf4676a	jinja : fix negative step slice with start/stop values (#24580 )	2026-06-13 18:28:40 +02:00
Sigbjørn Skjæret	341babcf73	jinja : fix split and replace with empty first arg (#24574 ) * fix split and replace with empty first arg * fix reserve size	2026-06-13 16:56:59 +02:00
Georgi Gerganov	d8a24ccee2	fit : wrap llama_device_memory_data (#24522 )	2026-06-13 08:09:52 +03:00
Xuan-Son Nguyen	e37abd6b5f	mtmd: add batching API (#24384 ) * mtmd: add batching API * wip * first working version (gemma4v) * add arg * nits * wire up support_batch() * fix 0.0 output embd * fix audio * nits * refactor a bit * nits * fix non-batching case * fix comment	2026-06-13 00:10:29 +02:00
Georgi Gerganov	02182fc5b9	fit : avoid including llama-ext.h in fit.h (#24506 )	2026-06-12 15:57:05 +03:00
Ruixiang Wang	88a39274ec	spec: add EAGLE3 speculative decoding support (#18039 ) * llama : enable layer input extraction * spec: support eagle3 * eagle3: fix params bug * eagle3: support Gemma4 eagle3 from RedHatAI * eagle3: set sync when get features from target Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> * eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> * eagle3: adapt to upstream changes * eagle3: fix rebase issues and adapt to upstream changes * eagle3:exclude the eagle3 arch from test-llama-archs * eagle3: fix editorconfig check failures * eagle3: fix multi-seq issue in d2t vocab mapping * cont : minor style / clean-up * spec : remove `common_speculative_setup_draft_model()` * llama : clean-up unused API * eagle3: set d2t vocab mapping in decode graph * cont : assert layer inputs are configured * hparams : use n_embd_inp instead of n_embd_target_features * eagle3: make output.weight optional and inherit from target model when needed * haparams : generic norm-before-residual param * llama-ext : consistent names * cont : fix * hparams : remove target_hidden_size * cparams : rename output_layer_inp -> embeddings_layer_inp * arch : reuse ATTN_NORM_2 instead of adding new hidden norm * llama : clean-up names * cont : add assert + comment * Update conversion/llama.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-12 10:21:06 +03:00
Tarek Dakhran	d2462f8f7a	chat: fix LFM2/LFM2.5 ignoring json_schema (#24377 ) The LFM2 specialized template handler only built a grammar for tool-calling, silently ignoring json_schema from response_format.	2026-06-10 14:41:41 +02:00
ddh0	d2e22ed975	speculative : fix "ngram-map-k4v" name in logging (#24253 ) This is a non-functional change. When using `--spec-type ngram-map-k4v`, the log messages at startup and runtime say `ngram-map-k`. Added logic in the in the constructor of `common_speculative_impl_ngram_map_k` to pass the correct `COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is `false`. After this change, the log messages use the correct name.	2026-06-10 09:31:35 +02:00
jacekpoplawski	1e912561dd	server: log prompts to directory (#22031 ) * server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-06-09 12:09:07 +02:00
fiesh	961e9a3e46	server : do not clear slots without unified KV cache (#24190 ) * Always export idle slots to RAM Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. * cont : clean-up --------- Co-authored-by: Christoph Weiss <weiss@wsoptics.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-09 10:45:16 +03:00
Xuan-Son Nguyen	8f83d6c271	mtmd : add video input support (#24269 ) * wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested	2026-06-08 14:40:12 +03:00
ddh0	9e3b928fd8	common : relax sampler name matching (#23744 ) * common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-07 22:48:11 +02:00
Aman Gupta	04eb4c446d	llama : add Gemma4 MTP (#23398 )	2026-06-07 20:50:54 +08:00
Sigbjørn Skjæret	8a091c47ab	spec : fix vocab compatibility check (#24256 )	2026-06-07 14:43:52 +03:00
konradmb	465b1f0e75	arg: Skip mmproj download when user supplied mmproj (#24239 )	2026-06-07 11:18:44 +02:00
Tarek Dakhran	98d5e8ba8a	common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (#24234 ) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think>	2026-06-06 22:39:21 +02:00
Tarek Dakhran	da87e9b612	common/chat : unify and fix LFM2/LFM2.5 tool parser (#24178 )	2026-06-05 14:31:56 -05:00
Xuan-Son Nguyen	260862b8ca	arg: fix double mtp downloads (#24128 )	2026-06-04 19:23:48 +03:00
Bartowski	e7bcf1c3a8	Move duplicated imatrix code into single common imatrix-loader.cpp (#22445 ) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata	2026-06-04 17:45:40 +02:00
Aman Gupta	166fe29492	qwen35: use post-norm hidden state for MTP (#24025 ) * qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35	2026-06-04 01:29:09 +08:00
Ryan Mangeno	e3666269f9	arg : removed unecesary mmproj download when users pass --no-mmproj (#23425 )	2026-06-03 08:04:46 +03:00
Daniel Bevenius	0b7154066e	common : fix state save in common_prompt_batch_decode (#23468 ) * common : fix state save in common_prompt_batch_decode This commit addresses a bug in common_prompt_batch_decode that affects the session state store/restore in completion.cpp and save-load-state.cpp. The motivation for this is that currently the code is saving n-1 tokens in both the session_tokens and in the KV cache. Then when loading the session tokens, and if the prompt matches, it would replay the last saved token (n-1) into the next position, effectively replaying the same token in the wrong position. The fix is to store all n tokens in session_tokens, while the memory state only reflects n-1 processed tokens as the saving happens before the last token is decoded in common_prompt_batch_decode. I ran both completion.cpp and save-load-state.cpp with a transformer, a recurrent, and a hybrid model. Resolves: https://github.com/ggml-org/llama.cpp/issues/23400 Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>	2026-06-02 15:44:15 +02:00
Xuan-Son Nguyen	60130d18f9	server: add SSE ping interval (#24013 )	2026-06-02 14:14:55 +02:00
Aleksander Grygier	f8e67fc583	ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434 ) * feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel * test: Update test reference * fix: Icon * fix: E2E test command * fix: wait for greeting h1 to be visible in e2e test * fix: remove duplicate PDF option in attachment dropdown * fix: use label-based group toggle to avoid stale references * refactor: inline MCP server and tool toggles in mobile sheet * fix: serve correct build directory in e2e playwright config * feat: add reasoning effort levels selector in model dropdown * feat: Reasoning effort * refactor: Make server origin configurable via environment variable * feat: Add chat template thinking detector utility * feat: Add thinking support detection to models store * refactor: Update model selector components with thinking detection and message-specific indicators * feat: Update chat form components for model selection and thinking support * feat: Improve Reasoning controls UI * refactor: Apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags * refactor: Cleanup * refactor: Remove unneeded components * refactor: Cleanup	2026-06-02 10:23:19 +02:00
Georgi Gerganov	4f3a4beb8d	llama : deprecate `llama_set_warmup` (#24009 ) * llama : deprecate `llama_set_warmup` * cont : fix type Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-06-02 10:30:38 +03:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-02 07:26:20 +02:00
Georgi Gerganov	5dcb711666	speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988 ) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs	2026-06-01 22:26:58 +03:00
Aman Gupta	de6f727aae	llama: limit max outputs of `llama_context` (#23861 ) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere	2026-06-01 18:01:38 +03:00
Aldehir Rojas	5254a7994d	common : support manually triggering the reasoning budget end sequence (#23949 )	2026-06-01 11:37:11 +02:00
Xuan-Son Nguyen	06d26dfdff	download: add option to skip_download (#23059 ) * download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag	2026-05-29 16:30:55 +02:00
Xuan-Son Nguyen	cb47092b00	server: bump timeout to 3600s (#23842 ) * server: bump timeout to 3600s * nits: change wording	2026-05-29 10:23:17 +02:00
Omid Azizi	b000431a0b	ngram-mod : Add missing include (#23857 ) [no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>	2026-05-29 09:21:37 +03:00
Adrien Gallouët	98e480a32e	app : move licences to llama-app (#23824 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-29 07:46:11 +02:00
Mikolaj Kucharski	7fb1e70b59	arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167 )	2026-05-28 16:25:40 +02:00
Georgi Gerganov	6b4e4bd582	common : fix env names to all have LLAMA_ARG_ prefix (#23778 )	2026-05-27 14:52:47 +03:00
jacekpoplawski	e2ef8fe42c	server: fix checkpoints creation (#22929 ) * common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>	2026-05-25 08:56:18 +03:00
Aman Gupta	83eebe9d08	server: add margin for draft model for `fit` (#23485 )	2026-05-24 14:43:08 +08:00

1 2 3 4 5 ...

967 Commits