llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 15:20:20 +00:00

Author	SHA1	Message	Date
Andrea Richiardi	7ac5a4225e	cmake: skip cvector-generator and export-lora when CPU backend is disabled (#24053 ) b9504	2026-06-04 13:13:19 +03:00
Andrei	e3ba22d6cc	fix(mtmd): handle Gemma 4 audio projector embedding size (#24091 ) * mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b9503	2026-06-04 11:51:23 +02:00
Georgi Gerganov	6ddc9430b1	readme : add status badges (#24104 )	2026-06-04 10:58:13 +03:00
Georgi Gerganov	65ef50a0a4	tests : refactor test-save-load-state to accept token input (#24073 ) * tests : refactor test-save-load-state to accept token input - Default prompt is now empty; when not provided, generate n_batch random tokens (useful for models without a tokenizer) - Tokenization happens once upfront; pass token vector to test functions - generate_tokens prints token IDs instead of decoded pieces - Use llama_model_get_vocab / llama_vocab_n_tokens API - Upgrade log level from LOG_TRC to LOG_INF for visibility Assisted-by: llama.cpp:local pi * cont : use llama_tokens alias b9501	2026-06-04 08:06:36 +03:00
Georgi Gerganov	3d1998634e	metal : reduce rset heartbeat from 500ms -> 5ms (#24074 ) b9500	2026-06-04 08:05:32 +03:00
Reese Levine	e8c54893f2	ggml-webgpu: FlashAttention refactor + standardize quantization support (#23834 ) * Start work on flash_attn refactor * Refactor * Split k/v quantization * Refactor and abstract quantization logic for flash_attn and mul_mat * Add quantization support to tile path * formatting * Move to functions, add a check b9499	2026-06-04 08:05:04 +03:00
rehan-10xengineer	3c7450cee1	ggml-cpu: extend RVV quantization vec dot to higher VLENs (#22754 ) * ggml-cpu: add rvv 512b,1024b impls for iq4_xs * ggml-cpu: refactor; add rvv 512b, 1024b impls for q6_K, i-quants * ggml-cpu: refactor; add 512 and 1024 implementations of tq3_s, iq3_xxs, iq2_s, iq2_xs, iq2_xxs improve iq2_xs impl for rvv 256 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> b9498	2026-06-04 08:03:40 +03:00
Todd Malsbary	f478f1b6d7	sycl : Improve SYCL doc (#23025 ) * Tidy up SYCL doc a bit - Add explicit links to referenced items - Fix spelling errors Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Correct documented default for GGML_SYCL_GRAPH The default is ON, not OFF: $ cmake -LAH -B build \| grep GGML_SYCL_GRAPH ... GGML_SYCL_GRAPH:BOOL=ON Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Move docker instructions from SYCL.md to docker.md This makes them directly accesible from the Quick Start section of the top-level README.md. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Refer to intel.Dockerfile for ARGs and their defaults The defaults are always changing; this avoids accuracy errors from duplicating the information. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Remove mention of Nvidia in SYCL row of backend table This support was removed in 2026.02 - refer to the SYCL.md News. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>	2026-06-04 08:02:54 +03:00
Andrei	94a220cd67	mtmd: fix Gemma 4 unified FPE (#24088 ) b9496	2026-06-03 21:51:18 +02:00
Aman Gupta	166fe29492	qwen35: use post-norm hidden state for MTP (#24025 ) * qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35 b9495	2026-06-04 01:29:09 +08:00
Xuan-Son Nguyen	c8d6a00636	mtmd: enable non-causal vision for gemma 4 unified (#24082 ) b9494	2026-06-03 19:05:17 +02:00
Xuan-Son Nguyen	a731805ced	mtmd, model: allow skip build_vit() (#24077 ) * add model * nits b9493	2026-06-03 17:10:35 +02:00
Aleksander Grygier	ee4cf705bb	ui: Mermaid Diagrams in chat + interactive preview (#24032 )	2026-06-03 16:55:36 +02:00
Andreas Kieslinger	9e58d4d692	Avoid PDL race conditions by disabling __restrict__ when PDL is used (#24030 ) * Removes __restrict__ from PDL kernel headers due to incompatibility with PDL. Adds preprocessor directives based on arch in kernel body to add __restrict__ to retain performance on older architectures. * Simplifies new __restrict__ usage via macro * Add hopper to PDL __restrict__ fix. Co-authored-by: Oliver Simons <osimons@nvidia.com> --------- Co-authored-by: Oliver Simons <osimons@nvidia.com> b9491	2026-06-03 13:56:42 +02:00
Charles Xu	3571fa5435	ggml-cpu: use runtime SVE width in FWHT (#24059 ) b9490	2026-06-03 13:45:10 +03:00
Aman Gupta	f8f0a47a55	cuda: reserve space for quantize kv-cache at startup (#23907 ) * cuda: reserve space for quantize kv-cache at startup * address review comments * remove forward decl Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * remove assert in ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b9489	2026-06-03 18:39:59 +08:00
Georgi Gerganov	06938ac129	tests : add support for qwen3 SSM archs (#24031 ) * tests : add support for qwen3 SSM archs * arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS * cont : naming + TODOs b9488	2026-06-03 10:15:27 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	d545a2a993	update BoringSSL to 0.20260526.0 (#23794 ) b9487	2026-06-03 07:42:58 +02:00
Georgi Gerganov	4da6370d43	ci : disable ccache for msvc windows release jobs (#23911 ) b9486	2026-06-03 08:05:21 +03:00
Ryan Mangeno	e3666269f9	arg : removed unecesary mmproj download when users pass --no-mmproj (#23425 ) b9485	2026-06-03 08:04:46 +03:00
lhez	63e66fdd23	opencl: use flat variants of q4_K and q6_K gemv for very large M (#24006 ) b9484	2026-06-02 14:16:17 -07:00
Max Krasnyansky	5c394fdc8b	hexagon: profiler output fix and script updates (#24042 ) * hex-ops: fix profiler output (ie remove the redundant NONEs) * hex-prof: update profiling script to support tot.usec column b9483	2026-06-02 14:08:29 -07:00
Mikhail Podvitskii	4fb16eccce	model: add Mellum architecture (#23966 ) * model: support for Mellum architecture * model: improve mellum.py formatting * model: improve mellum.py formatting once again * deps: downgrade transformers to 4.57.6 (to fix CI) * deps: remove huggingface_hub dependency * deps: remove huggingface_hub from test requirements --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9482	2026-06-02 22:11:12 +03:00
Hans Florian	bfb4308b05	model : support granite multilingual embeddings R2 (ibm-granite/granite-embedding-{97,311}m-multilingual-r2) (#22716 ) * Add support for the ibm-granite/granite-embedding-{97m,311m}-multilingual-r2 embedding models: * Added a version of the gpt4o tokenizer that has a fixed regex (better handling of marks), and different token merging setting for the 97m model * Reused gemma4 tokenizer for the 311m model * granite-embedding--multilingual-r2 : add support SwiGLU FFN for Granite Embedding Multilingual R2 added new GGUF key <arch>.hidden_activation (LLM_KV_HIDDEN_ACT) + writer * added a forward declaration of llm_ffn_op_type to llama-hparams.h * added llm_ffn_op in hparams * added LLM_FFN_NONE = 0 sentinel to llm_ffn_op_type (value-initialization), modern-bert: explicitly assigns LLM_FFN_GEGLU before reading GGUF (unchanged). * centralized hidden_act mapping in llama-model.cpp, added llm_ffn_op_type_from_string() helper, mirroring rope_scaling_type/llama_rope_scaling_type_from_string() * modern-bert reads the GGUF key (when present) and uses the resulting op in its FFN graph * Added granite-embedding-{97m,311m}-multilingual-r2 to the converter code * Added the hashes for the granite embedding multilingual R2 models * Set the hidden_activation in the GGUF if the field is present in config.json (such as for the granite embedding models) b9481	2026-06-02 17:55:11 +02:00
Piotr Wilkin (ilintar)	2187e00337	StepFun 3.5 MTP (#23274 ) * StepFun 3.5 MTP * Simplify to single layer * Rollback core changes * fix flake8 errors * Remove scripts * modify to convention * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9480	2026-06-02 17:44:35 +02:00
Daniel Bevenius	0b7154066e	common : fix state save in common_prompt_batch_decode (#23468 ) * common : fix state save in common_prompt_batch_decode This commit addresses a bug in common_prompt_batch_decode that affects the session state store/restore in completion.cpp and save-load-state.cpp. The motivation for this is that currently the code is saving n-1 tokens in both the session_tokens and in the KV cache. Then when loading the session tokens, and if the prompt matches, it would replay the last saved token (n-1) into the next position, effectively replaying the same token in the wrong position. The fix is to store all n tokens in session_tokens, while the memory state only reflects n-1 processed tokens as the saving happens before the last token is decoded in common_prompt_batch_decode. I ran both completion.cpp and save-load-state.cpp with a transformer, a recurrent, and a hybrid model. Resolves: https://github.com/ggml-org/llama.cpp/issues/23400 Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com> b9479	2026-06-02 15:44:15 +02:00
Xuan-Son Nguyen	60130d18f9	server: add SSE ping interval (#24013 ) b9478	2026-06-02 14:14:55 +02:00
Georgi Gerganov	a468b89018	ci : reduce self-hosted server workflow jobs (#24012 ) Reduce the number of parallel jobs in server-self-hosted.yml by stacking test configurations as sequential steps within a single job, following the pattern from #23927. - server-metal: 4 matrix jobs -> 1 job with 4 sequential test steps - server-cuda: 2 matrix jobs -> 1 job with 2 sequential test steps - server-kleidiai: removed unnecessary single-entry matrix - removed unused Setup Node.js step from server-metal Total: 7 parallel jobs -> 3 parallel jobs Assisted-by: llama.cpp:local pi	2026-06-02 13:17:59 +03:00
Mikhail Podvitskii	d5ab0834ab	docs : update HOWTO-add-model.md (#23883 ) * docs: update HOWTO-add-model.md with new model registration and graph-building instructions * docs: improve formatting in HOWTO-add-model.md * Update docs/development/HOWTO-add-model.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-02 11:40:22 +02:00
Marcos Del Sol Vives	69cea5b669	ui: simplify network error handling (#23431 ) Previously error to string conversion was split in two different files, with one converting errors into strings, and another function analyzing those strings to generate yet another string. Now the the error handling for network fetches has been centralised and uses directly HTTP error codes whereas possible to generate the human-readable error strings. It also fixes an issue where all JSON errors reported from the backend, such as "Invalid API key", would get turned incorrectly in to "Failed to connect to server" due to poor matching logic in the now-gone getErrorMessage function.	2026-06-02 10:45:25 +02:00
Aleksander Grygier	f8e67fc583	ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434 ) * feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel * test: Update test reference * fix: Icon * fix: E2E test command * fix: wait for greeting h1 to be visible in e2e test * fix: remove duplicate PDF option in attachment dropdown * fix: use label-based group toggle to avoid stale references * refactor: inline MCP server and tool toggles in mobile sheet * fix: serve correct build directory in e2e playwright config * feat: add reasoning effort levels selector in model dropdown * feat: Reasoning effort * refactor: Make server origin configurable via environment variable * feat: Add chat template thinking detector utility * feat: Add thinking support detection to models store * refactor: Update model selector components with thinking detection and message-specific indicators * feat: Update chat form components for model selection and thinking support * feat: Improve Reasoning controls UI * refactor: Apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags * refactor: Cleanup * refactor: Remove unneeded components * refactor: Cleanup b9474	2026-06-02 10:23:19 +02:00
Georgi Gerganov	2365315955	kv-cache : SWA checkpoints store only non-masked cells (#23981 ) b9473	2026-06-02 11:06:29 +03:00
forforever73	f7a0777a5c	convert : support Step3.7-Flash (#23845 ) * feat: support step3.7 * fix: register Step-3.7 BPE pre-tokenizer hash * delete fromjson * register step3.7 arch to Step35Model * drop vit projector in base filter * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * restore blank line --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-02 09:54:49 +02:00
Georgi Gerganov	4f3a4beb8d	llama : deprecate `llama_set_warmup` (#24009 ) * llama : deprecate `llama_set_warmup` * cont : fix type Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> b9471	2026-06-02 10:30:38 +03:00
Max Krasnyansky	8f7f3bf141	hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989 ) * hex-mm: initial support for F32 * F32 -> F32 matmuls * hex-rms-norm: fix src1 stride use in fused rms_norm_mul * hex-ops: clear spad pointers in the ops that clober it This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes. * hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX Decided to use Q4_0 * F32 -> F32 matmul for this. Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16. Super simple and pretty efficient. * hmx-mm: route f16 2D matmuls through the same kernel used for all other types * hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way This update futher improves matmul performance and at the same time removes most of the redudant logic we had in different paths. * hmx-fa: slighlty improved pipeline simimar to matmul updates * hmx-mm: initial version of MAT_MUL_ID support for HMX * hmx-mm: fixed mxfp4 handling for MUL_MAT_ID * hex-gdn: optimize GATED_DELTA_NET DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :) * hmx-mm: missed one more case where we can use fastmod * hexagon: update DCVS settings for a slight perf bump * hmx-fa: use fastdiv in hmx-flash-attn * hmx-fa: precompute slope values to avoid disrupting the inner loop * hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi * hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty * hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right b9470	2026-06-01 23:40:08 -07:00
Todor Boinovski	d178a11818	hexagon: add gelu_quick (#24007 ) b9469	2026-06-01 23:19:07 -07:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> b9468	2026-06-02 07:26:20 +02:00
Anav Prasad	1fd5f48037	clean up unused variables warnings (#23975 ) b9467	2026-06-02 10:38:37 +08:00
lhez	210a6570ce	opencl: fix compiler warnings for non-adreno path (#23922 ) * opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning b9466	2026-06-01 19:15:09 -07:00
Masashi Yoshimura	b8275a8acc	revert to using global_invocation_id for cpy shader (#23955 )	2026-06-01 16:59:06 -07:00
Georgi Gerganov	5dcb711666	speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988 ) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs b9464	2026-06-01 22:26:58 +03:00
Christian Hoener zu Siederdissen	5aa3a64596	nix : add nix-nodejs facilities to build Web UI (#23846 ) * nix: add nix-nodejs facilities to build Web UI Build the Web UI locally using standard Nix systems for building NodeJS packages. - Create derivation for the web UI - npm dependencies are imported via buildNodeModules. Does not require setting any shasum. - Copy build artifacts to the correct folders. - Prevents having to download from huggingface.co Fixes #23067 * nix: simplify webui derivation using LLAMA_UI_OUT_DIR - Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write output directly to the Nix store - Copy built assets to tools/ui/dist (source tree) instead of build/tools/ui/dist so CMake's copy_src_dist() finds them	2026-06-01 14:01:26 -04:00
shaofeiqi	27d9ed8397	opencl: add basic support for q5_0 and q5_1 (#23548 ) * opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-06-01 10:06:50 -07:00
Adrien Gallouët	335abed17d	vendor : update cpp-httplib to 0.46.1 (#23980 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-01 19:40:10 +03:00
Aman Gupta	de6f727aae	llama: limit max outputs of `llama_context` (#23861 ) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere b9460	2026-06-01 18:01:38 +03:00
Shrivas Shankar	95b8b8ec1a	metal: template GLU kernels to support f16/f32 (#23882 ) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs. b9459	2026-06-01 15:40:28 +03:00
Jeff Bolz	55ac0909e5	vulkan: don't hold the device mutex while compiling pipelines (#23641 ) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed' b9458	2026-06-01 14:04:01 +02:00
Winston Ma	bef69f1306	vulkan: reduce host memory lock contention (#23376 ) * vulkan: reduces lock contention * replace unique_lock with lock_guard b9457	2026-06-01 14:03:32 +02:00
o7si	5aba5364d9	vocab: add normalizer.lowercase support to WPM (#23899 ) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer	2026-06-01 14:26:47 +03:00
Johannes Gäßler	8e6fff84de	TP: quantized KV cache support (#23792 ) * TP: quantized KV cache support * fix partial view * remove overly strict assert b9455	2026-06-01 12:30:10 +02:00

1 2 3 4 5 ...

9504 Commits