llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 07:10:21 +00:00

Author	SHA1	Message	Date
Sigbjørn Skjæret	f58bad4137	ci : unbreak release harder (#24545 ) * unbreak release harder * missed one * remove missing test for now b9616	2026-06-12 23:49:36 +02:00
Sigbjørn Skjæret	cd5044661c	ci : unbreak release (#24544 )	2026-06-12 23:29:49 +03:00
Georgi Gerganov	ebc10770ac	server : fix reasoning budget WebUI precedence over model.ini (#24517 ) When reasoning-budget is set in model.ini, the per-request thinking_budget_tokens from the WebUI was ignored because the model.ini value took unconditional precedence. Swap the precedence so the WebUI per-request value is checked first, with the model.ini value serving as a fallback default. Assisted-by: pi:llama.cpp/Qwen3.6-27B	2026-06-12 17:59:56 +03:00
Ruben Ortlam	3e7bd4f39a	vulkan: add pipeline barriers for memcpy read operations (#23770 ) * vulkan: add pipeline barriers for memcpy read/write operations * remove unnecessary host write pipeline barriers	2026-06-12 16:43:50 +02:00
Aleksander Grygier	f7ca93d12c	ui: PWA support (#23871 ) * feat: Add basic PWA support and service worker for offline caching * feat: Vite PWA implementation WIP * feat: Improve PWA icons generation * feat: Add PWA workbox to server routes * feat: Include `version.json` in static assets * feat: Add HTTP cache headers for PWA static assets * feat: Update app name for `apple-mobile-web-app-title` * feat: Implement PWA versioning and automatic update detection * chore: Update `.gitignore` files * feat: Splash Screens * feat: Add dark mode favicon support * refactor: Cleanup * fix: Use dark logo for dark splash screens * refactor: Simplify favicons SVG code * fix: Adjust caching and polling for reliable service worker updates * fix: Add missing favicon entry * fix: Align PWA service worker configuration with SvelteKit build structure * fix: Replace hashed bundle paths with versioned static paths * test: Add PWA tests * ci: Add build output for unit tests * refactor: Cleanup * fix: Server build & release versioning * chore: Update package-lock.json * chore: Increase PWA cache size * chore: Update packages * feat: Update favicons * refactor: Post-merge fix * feat: support explicit build version for PWA cache busting * fix: CI * feat: Improve PWA Refresh Alert UI * feat: Add toggleable build version display * refactor: Cleanup * feat: Add version mismatch detection and manual app reload * refactor: replace dynamic imports with static * refactor: Cleanup * feat: Add safe space for `pwa-<size>.png` rendered icons * fix: use relative paths for PWA assets to support base path deployment * feat: add PWA mode detection via URL query parameter * feat: Use ?cache=true for SW-cached PWA assets * refactor: Build process cleanup * refactor: Decouple PWA versioning and remove ?cache=true workaround * chore: Update README logo * feat: Include PWA Assets generation in build script * refactor: `usePwa` hook for core layout * fix: Relativize base vite plugin * fix: remove unnecessary backslash escapes in test regexes * test: update static asset paths for API Key test * refactor: Move SvelteKit PWA Options config to constants * ui: fix update notification never appearing Keep the PWA hook object intact instead of destructuring needRefreshByStorage, which freezes the reactive getter. Also exclude loading.html from PWA precache to prevent 404 errors and broken SW installation.	2026-06-12 15:53:26 +02:00
Georgi Gerganov	02182fc5b9	fit : avoid including llama-ext.h in fit.h (#24506 ) b9611	2026-06-12 15:57:05 +03:00
Georgi Gerganov	f532be8fac	sync : ggml b9610	2026-06-12 15:55:35 +03:00
Georgi Gerganov	e08c226a2c	ggml : bump version to 0.15.1 (ggml/1541)	2026-06-12 15:55:35 +03:00
Adrien Gallouët	70b54e140c	vendor : update cpp-httplib to 0.47.0 (#24395 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9608	2026-06-12 11:34:44 +02:00
Pascal	6471e3c090	UI/jpeg exif orientation (#24196 ) * ui: bake jpeg exif orientation into uploaded images stb_image in mtmd ignores exif metadata, so rotated smartphone photos reach the model with raw pixel orientation. The webui now reads the exif orientation tag at send time and feeds it into the existing capImageDataURLSize canvas pass: the browser applies the rotation when decoding, so capped images come out upright for free, and images under the cap threshold get a single plain redraw when orientation > 1. At most one re-encode ever happens per image. Upright jpegs with capping disabled pass through untouched, bit perfect. Adds jpeg-orientation.ts with a minimal exif parser working on a bounded base64 prefix (both endianness, returns 1 on any malformed input) and unit tests against handcrafted jpeg byte streams. * ui: move jpeg exif constants into lib/constants * ui: add browser test for jpeg orientation and capping Covers capImageDataURLSize end to end in chromium with real Pillow generated jpeg fixtures across exif orientations 1/3/5/6/8: upright quadrant colors checked pixel-wise, expected dimensions with and without capping, no orientation tag left in the output, and strict passthrough when nothing needs rewriting.	2026-06-12 10:20:27 +02:00
Ruixiang Wang	88a39274ec	spec: add EAGLE3 speculative decoding support (#18039 ) * llama : enable layer input extraction * spec: support eagle3 * eagle3: fix params bug * eagle3: support Gemma4 eagle3 from RedHatAI * eagle3: set sync when get features from target Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> * eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> * eagle3: adapt to upstream changes * eagle3: fix rebase issues and adapt to upstream changes * eagle3:exclude the eagle3 arch from test-llama-archs * eagle3: fix editorconfig check failures * eagle3: fix multi-seq issue in d2t vocab mapping * cont : minor style / clean-up * spec : remove `common_speculative_setup_draft_model()` * llama : clean-up unused API * eagle3: set d2t vocab mapping in decode graph * cont : assert layer inputs are configured * hparams : use n_embd_inp instead of n_embd_target_features * eagle3: make output.weight optional and inherit from target model when needed * haparams : generic norm-before-residual param * llama-ext : consistent names * cont : fix * hparams : remove target_hidden_size * cparams : rename output_layer_inp -> embeddings_layer_inp * arch : reuse ATTN_NORM_2 instead of adding new hidden norm * llama : clean-up names * cont : add assert + comment * Update conversion/llama.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9606	2026-06-12 10:21:06 +03:00
ZihaoMu	85f99dca8b	ggml: support concat for scalar types at cuda backend (#24011 ) * cuda: support concat for scalar types * Update concat.cu * fix metal ci issue b9605	2026-06-12 09:32:44 +03:00
Neo Zhang	099ea76fb4	[SYCL] Fix CI build & release for SYCL backend (#24387 ) * restore SYCL build and release, remove github cache * modify for test only * verify the ccache is used * remove debug code change * rm duplicate action, update key in ccache * add action ccache-clear after building in both ubuntu and windows * set %NUMBER_OF_PROCESSORS% in widnows build b9604	2026-06-12 09:30:24 +03:00
shaofeiqi	ba1df050f3	opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (#24319 ) * opencl: add q5_0 adreno support * opencl: add q5_1 adreno support * opencl: cosmetic fix --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b9603	2026-06-11 21:43:09 -07:00
wencan	1593d5684d	docker : support specifying the GCC version for CUDA (#24447 )	2026-06-11 23:12:09 +02:00
Jeff Bolz	4c6595503f	vulkan: ifdef eMesaHoneykrisp (build fix) (#24479 ) Fixes build/CI after #24306. b9601	2026-06-11 13:22:17 -05:00
Georgi Gerganov	263cc04a54	sync : ggml	2026-06-11 19:34:19 +03:00
Georgi Gerganov	17e59d6209	ggml : bump version to 0.15.0 (ggml/1539)	2026-06-11 19:34:19 +03:00
Winston Ma	fdc3db9b65	vulkan: add fast path for contiguous buffer transfers (#23973 )	2026-06-11 15:46:25 +02:00
Kevin Liu	1af154a76f	vulkan: use medium matmul tile on Asahi Linux (#24306 ) * vulkan: use medium matmul tile on Asahi Linux * vulkan: switch Apple detection to Honeykrisp driver id	2026-06-11 15:43:04 +02:00
Xuan-Son Nguyen	18ef86ecec	server: skip unused log lines on router mode (#24463 ) b9596	2026-06-11 11:36:35 +02:00
o7si	1bfbdb134e	vocab : adopt leading TemplateProcessing special token as BOS (#24428 )	2026-06-11 10:37:23 +03:00
o7si	68f30663cf	vocab : refactor normalizer flags into options struct, add strip_accents (#24371 ) * vocab : refactor normalizer flags into options struct, add strip_accents * Update src/llama-vocab.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9594	2026-06-11 10:36:50 +03:00
Aldehir Rojas	db94854ff5	server : skip checkpoints beyond pos_next (#24411 ) * server : skip checkpoints beyond pos_next * cont : update comment + TODO + ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-11 10:18:12 +03:00
Adrien Gallouët	ac4cddeb0d	vendor : update LibreSSL to 4.3.2 (#24397 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9592	2026-06-10 22:28:03 +02:00
Gaurav Garg	e95dae18d6	Remove padding and multiple D2D copies for MTP (#24086 ) * Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1]. Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy * Make GDN changes in all backends. Address review comments. * Fix CI build errors b9591	2026-06-10 23:21:16 +05:30
Tarek Dakhran	d2462f8f7a	chat: fix LFM2/LFM2.5 ignoring json_schema (#24377 ) The LFM2 specialized template handler only built a grammar for tool-calling, silently ignoring json_schema from response_format. b9590	2026-06-10 14:41:41 +02:00
Oliver Simons	fb83cc9a07	CUDA: Fix ssm_scan_f32 data-races (#24360 ) * Add missing syncthreads before resuing cub_temp_storage __syncthreads() is required before being allowed to resue TempStorage smem: https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti * Add one more missing __syncthreads Could also double-buffer, but alternative is to simply ensure all threads have read smem* before writing to it again in the next loop iteration * Remove unused smem from ssm_scan_f32 b9589	2026-06-10 14:27:08 +02:00
Sigbjørn Skjæret	039e20a2db	ci : bump komac version (#24396 )	2026-06-10 09:45:20 +02:00
ddh0	d2e22ed975	speculative : fix "ngram-map-k4v" name in logging (#24253 ) This is a non-functional change. When using `--spec-type ngram-map-k4v`, the log messages at startup and runtime say `ngram-map-k`. Added logic in the in the constructor of `common_speculative_impl_ngram_map_k` to pass the correct `COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is `false`. After this change, the log messages use the correct name. b9587	2026-06-10 09:31:35 +02:00
Rémy Mathieu	76da2450a4	webui: implement pinned conversations support (#21387 ) * webui: implement pinned conversations support * webui: linter/prettier pass * Fix the unused handleMobileSidebarItemClick from the component. * the search should find pinned conversations as well Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com> b9586	2026-06-09 21:33:22 +02:00
Aarnav Pai	d73cd07674	graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (#24357 ) * llama-graph : apply embedding scale when deepstack is not used * nits: remove non-existant hunyuan-vl from the tests * apply suggestion from @gabe-l-hart --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b9585	2026-06-09 19:46:27 +02:00
Sigbjørn Skjæret	e25a32e98c	ci : fix windows release (#24369 ) b9584	2026-06-09 19:42:23 +03:00
Pascal	483609509d	ui: add opt-in run_javascript frontend tool (#24244 ) * ui: add opt-in run_javascript frontend tool Expose a run_javascript tool to the model, executed entirely in the browser through the existing agentic loop. Code runs in a Web Worker inside a sandboxed iframe with an opaque origin, isolated from the WebUI and its API. Console output, errors and the return value are fed back as the tool result. The parent enforces a hard timeout by removing the iframe, which terminates the worker. Disabled by default, toggle in Settings > Developer. * ui: address review feedback from allozaur Use the JsonSchemaType enum for the tool definition parameter types instead of raw string literals, extending it with STRING and NUMBER. Move the worker shim and the iframe harness html into their own files so the service no longer carries inline source blobs. Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins. * ui: move sandbox worker shim to a raw imported file Replace the inline worker template string with a real sandbox-worker.js imported as raw text, and build the iframe harness from it in sandbox-harness.ts. The raw worker ships as a string, not a module, so it is excluded from eslint and the typecheck program.	2026-06-09 18:02:31 +02:00
Saba Fallah	49f3542190	mtmd: build_vit batching (#24352 )	2026-06-09 16:32:08 +02:00
Jeff Bolz	d6d0ce8215	vulkan: reduce iq1 shared memory usage for mul_mm (#24287 ) b9581	2026-06-09 13:27:38 +02:00
Ruben Ortlam	b4e3dc613b	vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (#24123 ) * vulkan: add support for valve fp16 dot2 extension * use macro for dot2 path choice * properly check for the feature * add dot_product abstraction to reduce preprocessor branching b9580	2026-06-09 13:27:04 +02:00
Nick Towle	ae735b1314	ui: Fix excessive style recalculation on hover (#24243 )	2026-06-09 12:52:20 +02:00
Xuan-Son Nguyen	9682e351b8	mtmd: refactor video subproc handling (#24316 ) * mtmd: refactor video subproc handling * Update tools/mtmd/mtmd-helper.cpp Co-authored-by: Mikko Juola <mikjuo@gmail.com> --------- Co-authored-by: Mikko Juola <mikjuo@gmail.com> b9578	2026-06-09 13:15:12 +03:00
jacekpoplawski	1e912561dd	server: log prompts to directory (#22031 ) * server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b9577	2026-06-09 12:09:07 +02:00
Pascal	efbacf8d21	ui: fix mobile chat form overflow and bust stale bundle cache (#24158 )	2026-06-09 11:12:58 +02:00
Pascal	26021699bc	ggml : add GGML_OP_COL2IM_1D (#24206 ) * cpu: add GGML_OP_COL2IM_1D Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, KOC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [KOC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)s0 + K - 2p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator. * tests: add backend coverage for GGML_OP_COL2IM_1D Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16. cpu: harden GGML_OP_COL2IM_1D ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve. * tests: extend the GGML_OP_COL2IM_1D grid The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once. * tests: add col2im_1d equivalence test tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there. * rpc: bump protocol patch version for GGML_OP_COL2IM_1D GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts. b9575	2026-06-09 12:01:37 +03:00
fiesh	961e9a3e46	server : do not clear slots without unified KV cache (#24190 ) * Always export idle slots to RAM Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. * cont : clean-up --------- Co-authored-by: Christoph Weiss <weiss@wsoptics.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9574	2026-06-09 10:45:16 +03:00
Sigbjørn Skjæret	f0152efe40	models : fix plamo2 attention_key/value_length regression (#24317 ) b9573	2026-06-09 10:26:44 +03:00
Yash Raj Pandey	fd3271e0b4	ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#24305 ) * ggml-cpu : fix rms_norm_back wrong output under in-place aliasing * cont : clean-up comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9572	2026-06-09 10:24:27 +03:00
ravel7524	e3471b3e73	Remove case for GGML_TYPE_Q4_K in mvvq.cu (#23528 ) b9571	2026-06-09 07:46:23 +02:00
Reese Levine	3ac3c20c96	ggml-webgpu: Add clang-format job (#24308 ) * Add clang-format job * try local formatting b9570	2026-06-08 20:54:24 -07:00
Masashi Yoshimura	1e1aca09da	ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (#24225 ) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker	2026-06-08 15:19:56 -07:00
Max Krasnyansky	7d2b45b4f7	mtp: support for gemma-4 E2B and E4B assistants (#24282 ) * models: update converter to support smaller assistants * models: add masked_embd tensors to gemma4-assist arch * gemma-4: remove temp debug for conversion * gemma-4-mtp: filter out masked_embedding tensors during conversion b9568	2026-06-08 13:48:52 -07:00
Aldehir Rojas	42a0afd594	server : do not parse when flushing http headers (#24281 ) b9567	2026-06-08 13:32:41 -05:00

1 2 3 4 5 ...

9616 Commits