llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-28 07:10:21 +00:00

Author	SHA1	Message	Date
Johannes Gäßler	192d8ae8b8	CUDA: missing PDL sync for FWHT, better fallback (#23690 ) b9334	2026-05-26 11:05:51 +08:00
forforever73	35c9b1f39e	metal : add apple device id (#23566 ) Co-authored-by: lvyichen <lvyichen@stepfun.com> b9333	2026-05-25 21:05:16 +03:00
Max Krasnyansky	4bead4e30d	snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23680 )	2026-05-25 10:57:43 -07:00
Georgi Gerganov	302e2c2652	ci : reduce PR jobs by matching backend paths (#23675 ) * ci : disable SYCL f16 builds * ci : extract android and hip into separate workflows * ci : move webgpu to separate workflow * ci : move the rpc to a separate workflow * ci : extract s309x and ppcl jobs * ci : extract opencl job into a separate workflow b9331	2026-05-25 20:54:54 +03:00
Pascal	328874d054	model: tag ffn_latent as MUL_MAT to fix buft probe (#23664 ) ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise MUL on a q8_0 weight. That used to return true unconditionally and the weight stayed on GPU by luck. Once supports_op told the truth, the probe got a no and the loader pushed the weight and its matmul to CPU, splitting the graph. Tagging it MUL_MAT asks the real question, the math is unchanged. Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s. b9330	2026-05-25 16:05:04 +02:00
Aman Gupta	c1f1e28d29	CUDA: add fast walsh-hadamard transform (#23615 ) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b9329	2026-05-25 21:12:10 +08:00
Pascal	5a4126adc1	ui: fix stop/continue during an agentic loop (#23356 )	2026-05-25 14:18:59 +02:00
Michael Wand	a4d2d4ae41	convert : add compressed-tensors NVFP4 support (#21095 ) * Refactored Compressed Tensors NVFP4 support for new base.py * Support compressed-tensors NVFP4 conversion * Moved Qwen MTP remap into filter_tensors * simplify * pathlib no longer used --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-25 14:16:11 +02:00
Georgi Gerganov	d161ea7071	sync : ggml b9326	2026-05-25 12:43:27 +03:00
Georgi Gerganov	45158f460e	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:43:27 +03:00
Georgi Gerganov	22307b3e8b	sync : ggml	2026-05-25 12:38:01 +03:00
Georgi Gerganov	ce5890b5f7	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:38:01 +03:00
Ori Pekelman	b251f74f49	ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)	2026-05-25 12:38:01 +03:00
Dev-X25874	fa97041524	ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492)	2026-05-25 12:38:01 +03:00
Johannes Gäßler	ae251b5ff2	TP: fix ggml context size calculation (#22616 ) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include b9320	2026-05-25 12:37:25 +03:00
Gilad S.	66efd13375	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (#22341 ) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com> b9319	2026-05-25 11:33:29 +02:00
Aman Gupta	6c4cbdc70b	server: MTP layer kv-cache should respect draft type ctk (#23646 ) b9318	2026-05-25 16:46:23 +08:00
alex-spacemit	5fdf07e33b	ci : update spacemit toolchain url and enhance curl command (#23642 ) * fix(action): update SpacemiT toolchain URL and version Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c * fix(action): add -L flag to curl command for URL redirection Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06	2026-05-25 10:43:24 +02:00
Sigbjørn Skjæret	062d3115aa	ci : fix pre-tokenizer-hashes check (#23651 )	2026-05-25 10:41:25 +02:00
Tim Neumann	314e729347	llama : document that only one on-device state can be saved per sequence (#23520 ) b9315	2026-05-25 10:29:28 +03:00
Aldehir Rojas	d55fb97174	ci : install host compiler on android-ndk build (#23630 )	2026-05-25 10:18:08 +03:00
Jeff Bolz	826539ce59	ggml : Parallelize quant LUT init (#23595 ) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in. b9313	2026-05-25 10:15:46 +03:00
Saba Fallah	b96487645c	ui: media attachments before text (#23467 ) * ui: media attachments before text * fix prettier formatting	2026-05-25 08:50:41 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	9627d0f540	vendor : update cpp-httplib to 0.45.1 (#23639 ) b9311	2026-05-25 09:45:22 +03:00
jacekpoplawski	e2ef8fe42c	server: fix checkpoints creation (#22929 ) * common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> b9310	2026-05-25 08:56:18 +03:00
fairydreaming	6d57c26ef8	perplexity : fix even more integer overflows (#23623 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> b9309	2026-05-25 08:12:39 +03:00
Georgi Gerganov	28123a3937	ci : move most slim jobs to self-hosted runners (#23619 ) * ci : remove tag from build-self-hosted.yml * ci : slim -> self-hosted * ci : prevent heavy CPU jobs from running on fast runners * ci : prevent cmake pkg to run on dedicated fast runners * ci : try to bump 3.11 -> 3.13 * ci : move lint back to 3.11 * ci : back to 3.11 * ci : add comment about UI jobs * ci : move python requirements check to CPU runners this job is a bit slow for a dedicated "fast" runner * ci : add self-hosted ui workflow * ci : fix UI naming * tmp to check if arm64 fast is compatible with all jobs * revert last commit	2026-05-25 08:11:19 +03:00
Georgi Gerganov	549b9d8433	ci : update build-self-hosted.yml (#23616 )	2026-05-24 18:20:10 +03:00
Sigbjørn Skjæret	5d246a792d	convert : minor fixes for numpy 2.x (#23571 )	2026-05-24 09:51:31 +02:00
Aldehir Rojas	63248fc3e3	cmake : fix ui build (#23592 ) * cmake/ui : add -fPIC to llama-ui static lib * cmake : rename host compiled embed helper b9305	2026-05-24 02:37:28 -05:00
Aman Gupta	83eebe9d08	server: add margin for draft model for `fit` (#23485 )	2026-05-24 14:43:08 +08:00
Johannes Gäßler	fff63b5108	TP: fix entirely zero-sized slices per device (#23525 )	2026-05-24 08:19:33 +02:00
shaofeiqi	f3061116ff	opencl: batch profiling to improve speed and prevent memory leaks (#23495 )	2026-05-23 23:11:43 -07:00
Yiwei Shao	1c0f6db545	hexagon: apply repl optimization in flash attn softmax as #22993 (#23455 ) b9301	2026-05-23 19:56:59 -07:00
Aparna M P	cec51c7a7d	snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552 )	2026-05-23 19:56:41 -07:00
Aldehir Rojas	b22ff4b7b4	cmake/ui : refactor the build (#23352 )	2026-05-23 17:08:22 -04:00
Aditya Singh	c0c7e147e7	requirements : bump torch to 2.11.0 (#23503 ) * requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on PyPI for platform/CPython combinations where 2.6.x is not present. The accompanying comment already says 'PyTorch 2.6.0 or later', so the looser >=2.6.0 matches the documented intent and unblocks pip install -r requirements/requirements-convert_hf_to_gguf.txt. Fixes #23408 * requirements: bump torch floor to 2.11.0 per maintainer * requirements: pin torch to ==2.11.0 per project policy * requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy * requirements: suppress check_requirements pin warning on mtmd The check_requirements script flags '==' on lines in files matched by //requirements.txt. Append the documented suppression comment to the pinned torch and torchvision lines (and to the s390x platform marker lines) so the check passes while keeping the pins required by project policy. * ty: silence Tensor/Module union check on model[0].auto_model With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns Tensor \| Module rather than Module, so model[0].auto_model fails ty on the SentenceTransformer code path. The runtime behavior is unchanged because SentenceTransformer always wraps a Module at index 0. Adding a targeted unresolved-attribute ignore keeps the type-check green without altering behavior. A follow-up issue tracks typing the variable explicitly.	2026-05-23 18:24:39 +02:00
Michael Wand	b0df4c0cfd	model : add NVFP4 MTP scale tensors (#23563 ) * Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr b9297	2026-05-23 13:30:31 +02:00
dskwe	a497476330	ggml : Check the right iface method before using the fallback 2d get (#23514 ) b9296	2026-05-23 12:49:24 +02:00
Jeff Bolz	95405ac65f	vulkan: fix windows find_package of SPIRV-Headers (#23215 ) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only b9295	2026-05-23 09:44:46 +02:00
Shawn Gu	0f3cb3fc8b	opencl: generalize Adreno MoE kernels on M (#23449 ) b9294	2026-05-22 17:08:41 -07:00
Aldehir Rojas	1acee6bf89	server: only parse empty msg if continuing an assistant msg (#23506 )	2026-05-22 11:58:15 -04:00
fairydreaming	ef570f6308	perplexity : fix integer overflow (#23496 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> b9292	2026-05-22 15:50:44 +03:00
Alexey Kopytko	cc9e331213	SYCL: improve MoE prefill throughput (#23142 ) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity b9291	2026-05-22 15:50:17 +03:00
Alexey Kopytko	bcfd1989e9	sycl : Level Zero detection in ggml_sycl_init (#23097 ) * [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning b9290	2026-05-22 15:49:45 +03:00
karavayev	56f16f235c	SYCL : gated_delta_net K>1 (#23174 ) * sycl_gated_delta_net K>1 * editor_config b9289	2026-05-22 15:48:56 +03:00
Katostrofik	8cc67efcd4	SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (#21580 ) * SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes #20478 * SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0 The qk=1 kernel (used for F16 and BF16) iterates with stride 2GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When ncols is a multiple of DMMV_X (32) but not of 2DMMV_X (64), the last warp iteration accesses elements at col >= ncols, producing NaN for the final row and wrong values for interior rows. Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] % (2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16 launcher to match. Quantized types use block-structured kernels with different access patterns and keep the existing DMMV_X check. Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70. Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-22 15:48:24 +03:00
Jesus Talavera	95feeab52e	docs: Update documentation with Granite 4.0/4.1 (#23404 )	2026-05-22 20:35:46 +08:00
Sachin Sharma	99d4026b11	ggml-zendnn : add Q8_0 quantization support (#23414 ) * ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0 b9286	2026-05-22 13:16:55 +02:00
fairydreaming	9c92e96a64	cmake : build router app only during standalone builds (#23521 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> b9285	2026-05-22 12:55:29 +03:00

1 2 3 4 5 ...

9334 Commits