llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 15:20:20 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	18ef86ecec	server: skip unused log lines on router mode (#24463 ) b9596	2026-06-11 11:36:35 +02:00
o7si	1bfbdb134e	vocab : adopt leading TemplateProcessing special token as BOS (#24428 )	2026-06-11 10:37:23 +03:00
o7si	68f30663cf	vocab : refactor normalizer flags into options struct, add strip_accents (#24371 ) * vocab : refactor normalizer flags into options struct, add strip_accents * Update src/llama-vocab.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9594	2026-06-11 10:36:50 +03:00
Aldehir Rojas	db94854ff5	server : skip checkpoints beyond pos_next (#24411 ) * server : skip checkpoints beyond pos_next * cont : update comment + TODO + ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-11 10:18:12 +03:00
Adrien Gallouët	ac4cddeb0d	vendor : update LibreSSL to 4.3.2 (#24397 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9592	2026-06-10 22:28:03 +02:00
Gaurav Garg	e95dae18d6	Remove padding and multiple D2D copies for MTP (#24086 ) * Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1]. Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy * Make GDN changes in all backends. Address review comments. * Fix CI build errors b9591	2026-06-10 23:21:16 +05:30
Tarek Dakhran	d2462f8f7a	chat: fix LFM2/LFM2.5 ignoring json_schema (#24377 ) The LFM2 specialized template handler only built a grammar for tool-calling, silently ignoring json_schema from response_format. b9590	2026-06-10 14:41:41 +02:00
Oliver Simons	fb83cc9a07	CUDA: Fix ssm_scan_f32 data-races (#24360 ) * Add missing syncthreads before resuing cub_temp_storage __syncthreads() is required before being allowed to resue TempStorage smem: https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti * Add one more missing __syncthreads Could also double-buffer, but alternative is to simply ensure all threads have read smem* before writing to it again in the next loop iteration * Remove unused smem from ssm_scan_f32 b9589	2026-06-10 14:27:08 +02:00
Sigbjørn Skjæret	039e20a2db	ci : bump komac version (#24396 )	2026-06-10 09:45:20 +02:00
ddh0	d2e22ed975	speculative : fix "ngram-map-k4v" name in logging (#24253 ) This is a non-functional change. When using `--spec-type ngram-map-k4v`, the log messages at startup and runtime say `ngram-map-k`. Added logic in the in the constructor of `common_speculative_impl_ngram_map_k` to pass the correct `COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is `false`. After this change, the log messages use the correct name. b9587	2026-06-10 09:31:35 +02:00
Rémy Mathieu	76da2450a4	webui: implement pinned conversations support (#21387 ) * webui: implement pinned conversations support * webui: linter/prettier pass * Fix the unused handleMobileSidebarItemClick from the component. * the search should find pinned conversations as well Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com> b9586	2026-06-09 21:33:22 +02:00
Aarnav Pai	d73cd07674	graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (#24357 ) * llama-graph : apply embedding scale when deepstack is not used * nits: remove non-existant hunyuan-vl from the tests * apply suggestion from @gabe-l-hart --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b9585	2026-06-09 19:46:27 +02:00
Sigbjørn Skjæret	e25a32e98c	ci : fix windows release (#24369 ) b9584	2026-06-09 19:42:23 +03:00
Pascal	483609509d	ui: add opt-in run_javascript frontend tool (#24244 ) * ui: add opt-in run_javascript frontend tool Expose a run_javascript tool to the model, executed entirely in the browser through the existing agentic loop. Code runs in a Web Worker inside a sandboxed iframe with an opaque origin, isolated from the WebUI and its API. Console output, errors and the return value are fed back as the tool result. The parent enforces a hard timeout by removing the iframe, which terminates the worker. Disabled by default, toggle in Settings > Developer. * ui: address review feedback from allozaur Use the JsonSchemaType enum for the tool definition parameter types instead of raw string literals, extending it with STRING and NUMBER. Move the worker shim and the iframe harness html into their own files so the service no longer carries inline source blobs. Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins. * ui: move sandbox worker shim to a raw imported file Replace the inline worker template string with a real sandbox-worker.js imported as raw text, and build the iframe harness from it in sandbox-harness.ts. The raw worker ships as a string, not a module, so it is excluded from eslint and the typecheck program.	2026-06-09 18:02:31 +02:00
Saba Fallah	49f3542190	mtmd: build_vit batching (#24352 )	2026-06-09 16:32:08 +02:00
Jeff Bolz	d6d0ce8215	vulkan: reduce iq1 shared memory usage for mul_mm (#24287 ) b9581	2026-06-09 13:27:38 +02:00
Ruben Ortlam	b4e3dc613b	vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (#24123 ) * vulkan: add support for valve fp16 dot2 extension * use macro for dot2 path choice * properly check for the feature * add dot_product abstraction to reduce preprocessor branching b9580	2026-06-09 13:27:04 +02:00
Nick Towle	ae735b1314	ui: Fix excessive style recalculation on hover (#24243 )	2026-06-09 12:52:20 +02:00
Xuan-Son Nguyen	9682e351b8	mtmd: refactor video subproc handling (#24316 ) * mtmd: refactor video subproc handling * Update tools/mtmd/mtmd-helper.cpp Co-authored-by: Mikko Juola <mikjuo@gmail.com> --------- Co-authored-by: Mikko Juola <mikjuo@gmail.com> b9578	2026-06-09 13:15:12 +03:00
jacekpoplawski	1e912561dd	server: log prompts to directory (#22031 ) * server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b9577	2026-06-09 12:09:07 +02:00
Pascal	efbacf8d21	ui: fix mobile chat form overflow and bust stale bundle cache (#24158 )	2026-06-09 11:12:58 +02:00
Pascal	26021699bc	ggml : add GGML_OP_COL2IM_1D (#24206 ) * cpu: add GGML_OP_COL2IM_1D Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, KOC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [KOC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)s0 + K - 2p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator. * tests: add backend coverage for GGML_OP_COL2IM_1D Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16. cpu: harden GGML_OP_COL2IM_1D ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve. * tests: extend the GGML_OP_COL2IM_1D grid The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once. * tests: add col2im_1d equivalence test tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there. * rpc: bump protocol patch version for GGML_OP_COL2IM_1D GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts. b9575	2026-06-09 12:01:37 +03:00
fiesh	961e9a3e46	server : do not clear slots without unified KV cache (#24190 ) * Always export idle slots to RAM Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. * cont : clean-up --------- Co-authored-by: Christoph Weiss <weiss@wsoptics.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9574	2026-06-09 10:45:16 +03:00
Sigbjørn Skjæret	f0152efe40	models : fix plamo2 attention_key/value_length regression (#24317 ) b9573	2026-06-09 10:26:44 +03:00
Yash Raj Pandey	fd3271e0b4	ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#24305 ) * ggml-cpu : fix rms_norm_back wrong output under in-place aliasing * cont : clean-up comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9572	2026-06-09 10:24:27 +03:00
ravel7524	e3471b3e73	Remove case for GGML_TYPE_Q4_K in mvvq.cu (#23528 ) b9571	2026-06-09 07:46:23 +02:00
Reese Levine	3ac3c20c96	ggml-webgpu: Add clang-format job (#24308 ) * Add clang-format job * try local formatting b9570	2026-06-08 20:54:24 -07:00
Masashi Yoshimura	1e1aca09da	ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (#24225 ) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker	2026-06-08 15:19:56 -07:00
Max Krasnyansky	7d2b45b4f7	mtp: support for gemma-4 E2B and E4B assistants (#24282 ) * models: update converter to support smaller assistants * models: add masked_embd tensors to gemma4-assist arch * gemma-4: remove temp debug for conversion * gemma-4-mtp: filter out masked_embedding tensors during conversion b9568	2026-06-08 13:48:52 -07:00
Aldehir Rojas	42a0afd594	server : do not parse when flushing http headers (#24281 ) b9567	2026-06-08 13:32:41 -05:00
Pascal	a66d50588b	graph: guard iswa kq_mask on its own buffer (#24294 ) A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9566	2026-06-08 19:20:28 +02:00
Nikhil Jain	1705d434f6	[ggml-webgpu] Handle buffer overlap / buffer aliasing for concat operator (#24000 ) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com> b9565	2026-06-08 08:07:31 -07:00
Nikhil Jain	3b3da01dc2	[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary ops (#24044 ) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id b9564	2026-06-08 08:07:15 -07:00
Xuan-Son Nguyen	3ebe862b5d	docker: install ffmpeg in the released image (#24302 ) b9563	2026-06-08 16:59:57 +02:00
Xuan-Son Nguyen	8f83d6c271	mtmd : add video input support (#24269 ) * wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested b9562	2026-06-08 14:40:12 +03:00
Georgi Gerganov	c2b1518fd4	sync : ggml b9561	2026-06-08 14:31:33 +03:00
Georgi Gerganov	6a1de6fbf1	ggml : bump version to 0.14.0 (ggml/1533)	2026-06-08 14:31:33 +03:00
Xuan-Son Nguyen	715b86a366	cli: fix spinner not show during prompt processing (#24283 ) b9559	2026-06-08 11:11:45 +02:00
Jeff Bolz	c74759a244	vulkan: Use cm2 decode_vector for mul_mat_id B matrix loads (#23991 ) This allows vec4 loads of the B elements. Also increase BK to 64 when this is enabled. Neither of these alone is consistently faster, but together these give a nice speedup. In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are multiples of 4. b9558	2026-06-08 10:40:37 +02:00
Ruben Ortlam	0f7fada56b	cuda: reset cuda context after reading memory size (#23935 ) * cuda: reset device in get_memory function if no backend is active * also count device and host buffers * exclude hip and musa from counting and device reset * use device mutex instead of atomic * undo backend_free function move b9557	2026-06-08 10:22:44 +02:00
Harkirat Gill	19bba67c1f	HIP: add gfx1152 and gfx1153 to RDNA3.5 (#24129 ) b9556	2026-06-08 08:33:23 +02:00
Xuan-Son Nguyen	daf6bc9f2d	metal : fix im2col 1D case (audio models) (#24220 ) b9555	2026-06-08 09:03:18 +03:00
Neo Zhang	d403f00ec3	[SYCL] Update compute runtime version to 26.x in docker (#24070 ) * update compute runtime from 25 to 26 in docker * add comment with old driver for multiple GPUs b9554	2026-06-08 10:35:18 +08:00
ddh0	9e3b928fd8	common : relax sampler name matching (#23744 ) * common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9553	2026-06-07 22:48:11 +02:00
David Friehs	8a963fc10e	convert : fix conversion for Mistral-Medium-3.5-128B (#24268 ) Mistral explicitly sets `moe` and `llama_4_scaling` to `null` in params.json, breaking `key in dict` checks during conversion. Replace with `dict.get(key) is not None` where this matters. Fixes `convert-hf-to-gguf.py --mistral-format Mistral-Medium-3.5-128B`	2026-06-07 21:41:39 +02:00
Georgi Gerganov	379ac6673b	kv-cache : avoid kv cells copies (#24277 ) b9551	2026-06-07 21:42:54 +03:00
Pascal	f0156d1401	kv-cache: follow the source cache size when sharing cells (#24267 ) A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve. b9550	2026-06-07 18:33:00 +03:00
Aman Gupta	04eb4c446d	llama : add Gemma4 MTP (#23398 ) b9549	2026-06-07 20:50:54 +08:00
Sigbjørn Skjæret	8a091c47ab	spec : fix vocab compatibility check (#24256 ) b9548	2026-06-07 14:43:52 +03:00
konradmb	465b1f0e75	arg: Skip mmproj download when user supplied mmproj (#24239 ) b9547	2026-06-07 11:18:44 +02:00

1 2 3 4 5 ...

9596 Commits