llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 15:20:20 +00:00

Author	SHA1	Message	Date
Todor Boinovski	d178a11818	hexagon: add gelu_quick (#24007 ) b9469	2026-06-01 23:19:07 -07:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> b9468	2026-06-02 07:26:20 +02:00
Anav Prasad	1fd5f48037	clean up unused variables warnings (#23975 ) b9467	2026-06-02 10:38:37 +08:00
lhez	210a6570ce	opencl: fix compiler warnings for non-adreno path (#23922 ) * opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning b9466	2026-06-01 19:15:09 -07:00
Masashi Yoshimura	b8275a8acc	revert to using global_invocation_id for cpy shader (#23955 )	2026-06-01 16:59:06 -07:00
Georgi Gerganov	5dcb711666	speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988 ) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs b9464	2026-06-01 22:26:58 +03:00
Christian Hoener zu Siederdissen	5aa3a64596	nix : add nix-nodejs facilities to build Web UI (#23846 ) * nix: add nix-nodejs facilities to build Web UI Build the Web UI locally using standard Nix systems for building NodeJS packages. - Create derivation for the web UI - npm dependencies are imported via buildNodeModules. Does not require setting any shasum. - Copy build artifacts to the correct folders. - Prevents having to download from huggingface.co Fixes #23067 * nix: simplify webui derivation using LLAMA_UI_OUT_DIR - Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write output directly to the Nix store - Copy built assets to tools/ui/dist (source tree) instead of build/tools/ui/dist so CMake's copy_src_dist() finds them	2026-06-01 14:01:26 -04:00
shaofeiqi	27d9ed8397	opencl: add basic support for q5_0 and q5_1 (#23548 ) * opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-06-01 10:06:50 -07:00
Adrien Gallouët	335abed17d	vendor : update cpp-httplib to 0.46.1 (#23980 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-01 19:40:10 +03:00
Aman Gupta	de6f727aae	llama: limit max outputs of `llama_context` (#23861 ) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere b9460	2026-06-01 18:01:38 +03:00
Shrivas Shankar	95b8b8ec1a	metal: template GLU kernels to support f16/f32 (#23882 ) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs. b9459	2026-06-01 15:40:28 +03:00
Jeff Bolz	55ac0909e5	vulkan: don't hold the device mutex while compiling pipelines (#23641 ) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed' b9458	2026-06-01 14:04:01 +02:00
Winston Ma	bef69f1306	vulkan: reduce host memory lock contention (#23376 ) * vulkan: reduces lock contention * replace unique_lock with lock_guard b9457	2026-06-01 14:03:32 +02:00
o7si	5aba5364d9	vocab: add normalizer.lowercase support to WPM (#23899 ) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer	2026-06-01 14:26:47 +03:00
Johannes Gäßler	8e6fff84de	TP: quantized KV cache support (#23792 ) * TP: quantized KV cache support * fix partial view * remove overly strict assert b9455	2026-06-01 12:30:10 +02:00
Georgi Gerganov	02a57017f6	security : disable private disclosures (#23963 )	2026-06-01 13:14:12 +03:00
Junwon Hwang	48b88c3b00	model: Add EXAONE 4.5 implementations (#21733 ) * Add EXAONE 4.5 and Add GQA for MMproj * mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps <\|vision_start\|> and <\|vision_end\|>. Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention pattern, optional mmproj input norm). Update exaone4_5 projector weights and convert_hf_to_gguf for mmproj export. * mtmd: load EXAONE4 nextn tensors correctly Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF. * Minor fixes * Address PR feedback * Address PR feedback * Fix EXAONE after merge * Fix EXAONE 4.5 conversion * Address PR feedback * Refactor EXAONE 4.5 conversion * Address PR feedback * Fix unintended deletion * Minor fix --------- Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai> b9453	2026-06-01 11:48:53 +02:00
Matt Corallo	19620004f5	vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056 ) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well. b9452	2026-06-01 11:46:48 +02:00
Winston Ma	f8c0a19d46	vulkan: Removed unused functions (#23175 ) b9451	2026-06-01 11:46:23 +02:00
Aldehir Rojas	5254a7994d	common : support manually triggering the reasoning budget end sequence (#23949 )	2026-06-01 11:37:11 +02:00
Georgi Gerganov	e22b0de60d	ci : add missing Linux label to cpu-x64-high-perf runner (#23958 ) Fixes: https://github.com/ggml-org/llama.cpp/pull/23927#discussion_r3332213086 The cpu-x64-high-perf job was missing the Linux label in its runs-on specification, causing the runner to not be discovered. All other self-hosted Linux jobs include this label. Assisted-by: llama.cpp:local pi	2026-06-01 10:39:59 +03:00
Neo Zhang	a51142497a	[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812 ) * support Q4_1, Q5_0, Q5_1 * update ut case	2026-06-01 09:53:53 +03:00
Neo Zhang	4162522688	[SYCL] Add more types in GET_ROWS OP (#23710 ) * add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link	2026-06-01 09:53:04 +03:00
Neo Zhang	44e211cecf	sycl : Optimize Q3_K mul_mat by reorder (#23725 )	2026-06-01 09:50:55 +03:00
Eve	af6528e6df	ci: remove redundant or duplicate jobs (#23927 ) * remove redundant apple job openvino gpu and cpu test can share the same build and machine Update build-rpc.yml Update build-openvino.yml cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm remove duplicate x86 vulkan combine backend sampling Update server.yml run server on arm as windows is x86 * emdawn on one machine only * fix openvino, remove cpu tag as we dont have many x64 machines with that tag b9445	2026-06-01 06:32:17 +03:00
Eric Zhang	6f165c1c64	server : handle If-None-Match weak ETags (#23916 ) b9444	2026-05-31 16:21:08 -05:00
Georgi Gerganov	399739d5c5	ci : limit trigger paths for the CPU workflow (#23938 )	2026-05-31 19:02:47 +03:00
o7si	d4c8e2c29c	vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756 ) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9442	2026-05-31 12:37:35 +02:00
Eric Zhang	3292da09f6	ui: fix ETag truncation with MSVC compiler (#23917 ) b9441	2026-05-31 11:21:23 +02:00
Vladislav	e6123e2080	docs : update ZenDNN docs for Q8 support (#23791 ) * docs zendnn added information about Q8 support * docs zendnn rm unnecessary data * docs update, links to ZenDNN docs provided * docs zenDNN update: clarified explanation * docs zenDNN update: one more explanation clarified --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-31 10:26:42 +02:00
Ruben Ortlam	22cadc1944	llama: only use one iGPU device by default (#23897 ) b9439	2026-05-31 08:17:47 +02:00
Pascal	d749821db3	webui: add custom CSS injection via config (#23904 ) * webui: add custom CSS injection via config register a customCSS setting in the Developer section under Custom JSON, syncable so it rides the existing ui-config pass through. inject the value into a single style element in the head, reactive on the setting. lets an operator theme a prebuilt binary through --ui-config without rebuilding, and lets a user set it from the settings panel. * ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field * ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head * ui: inject custom CSS through a svelte action instead of a bound element move the textContent write into a use: action on the head style node. the action is the idiomatic way to touch a node, so the no-dom-manipulating lint rule is satisfied without a disable. value stays text through textContent, never parsed as HTML. * Update tools/ui/src/lib/constants/settings-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: address review from @allozaur, rename custom config key to customJson with migration rename the custom config key to customJson across the type, the chat request builder, the settings save check and the custom tools reader, keeping the custom API param name unchanged. add a non destructive migration that copies the legacy custom key to customJson at startup. only render the head style tag when custom CSS is set. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> b9438	2026-05-30 23:49:31 +02:00
Gaurav Garg	aa46bda89b	Support `-fa auto` in llama-bench (#23714 ) * Support `-fa auto` in llama-bench Make the default value of `-ngl` -1, similar to other tools. Update README with latest usage and examples * Address review comments b9437	2026-05-31 02:03:57 +05:30
lhez	d6588daa80	opencl: support bf16 by converting to f16 (#23839 ) b9436	2026-05-30 10:17:47 -07:00
Pascal	d38d50e7ff	ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910 )	2026-05-30 16:50:54 +02:00
Johannes Gäßler	8b0e0db606	TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843 ) * TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs * fix afmoe TP b9434	2026-05-30 16:48:00 +03:00
Georgi Gerganov	2d9b7c8e98	metal : restore im2col implementation for large kernels (#23901 ) b9433	2026-05-30 15:26:13 +03:00
Xuan-Son Nguyen	e674b1279b	test: (test-llama-archs) log the config name first (#23885 ) b9432	2026-05-30 12:22:38 +02:00
Georgi Gerganov	4c4e91b799	ci : update ios-xcode release job to macos-26 (#23906 ) * ci : disable libcommon build from xcframework * ocd : fix name * ci : ios-xcode change to macos-26 * cont : pin xcode * cont : pin xcode to minor version b9431	2026-05-30 13:21:46 +03:00
Jinyang He	d48a56effb	ggml : add some lsx support (#23798 ) * loongarch : optimize LSX fp16 load/store with native intrinsics Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in __lsx_f16x4_load and __lsx_f16x4_store. * loongarch : add LSX implementation for q8_0 dot product * loongarch : add LSX implementation for q6_K dot product * loongarch : add LSX implementation for iq4_xs dot product * Improve reduce ops when sun int16 pairs to int32 b9430	2026-05-30 11:53:26 +03:00
Ruben Ortlam	6e093b80ea	vulkan: add Flash Attention support for BFloat16 KV cache (#23420 ) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs	2026-05-30 10:39:31 +02:00
Georgi Gerganov	337528571d	ci : fix s390x release job (#23898 ) * ci : fix s390x release job * ci : multi-thread build for `ios-xcode` * ocd : names b9428	2026-05-30 09:21:38 +03:00
Georgi Gerganov	d4204b03a5	ci : clear cache instead of "no timestamp" keys + fix macos (#23895 ) * ci : ios use macos-15 again * ci : add and test ccache-clear * cont : fix * cont : set permission * cont : another permission * cont : token * cont : print key * cont : bring back perms * cont : test windows * cont : add token * cont : cleanup * ci : make release jobs clean-up their ccache	2026-05-30 08:52:30 +03:00
Radoslav Gerganov	1738129bee	llama : do not skip iGPU when only RPC devices are present (#23868 ) After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device selection logic dropped the local iGPU whenever any RPC server was added, because RPC devices made `model->devices` non-empty. On systems where the "iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified memory), this caused all tensors to be allocated on the RPC peer alone and model loading to fail. Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer suppress the local iGPU. closes: #23858 b9426	2026-05-30 07:48:22 +03:00
Xuan-Son Nguyen	0821c5fcfd	server: in SSE mode, send HTTP headers when slot starts (#23884 ) * server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default	2026-05-30 00:06:29 +02:00
Reese Levine	151f3a98e9	ggml-webgpu: Check earlier for WebGPU required features (#23879 )	2026-05-29 14:16:05 -07:00
Reese Levine	b22da25889	ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760 ) * Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation	2026-05-29 14:14:11 -07:00
Ruixiang Wang	689a9a470e	server-bench : add speed-bench for speculative decoding benchmarking (#23869 ) * spec: add speed-bench support for benchmarking * speed-bench : add trailing newline to requirements.txt * speed-bench : bump datasets to 4.8.0 to fix ty check * server-bench : remove now-unused type: ignore after datasets bump	2026-05-29 23:09:47 +02:00
Pascal	5a46b46acd	app: add llama update self updater (#23865 ) * wip: llama update POC * cleaning: llama update * llama-gen-docs * app: delegate llama update to the install script * app: spawn the installer detached so llama update can replace a running binary * cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h} * app: make llama_update static Address review from @angt	2026-05-29 23:02:40 +02:00
ValdikSS	22d66b567e	ui: handle audio/vnd.wave as audio WAV file (#23754 ) Firefox on Linux uses this MIME type	2026-05-29 21:41:35 +02:00

1 2 3 4 5 ...

9469 Commits