llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-25 22:00:21 +00:00

Author	SHA1	Message	Date
Kashif Rasul	e8ecce53b8	docs : Eagle3 qwen3 draft model support (#24977 ) * eagle3: accept Eagle3LlamaForCausalLM draft checkpoints * docs: add eagle3 speculative decoding section * docs: address eagle3 review comments * docs: add more angelslim eagle3 models * docs: add gpt-oss eagle3 models and link to pr 18039	2026-06-25 15:58:00 +03:00
David Spruill	e9fb3b3fc0	sycl : support --split-mode tensor (#24152 ) * Sycl tp stage1 (#1) * SYCL: tensor parallelism (--split-mode tensor) for dual-GPU Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend queries via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu. For N=2 (the common dual-GPU case) implements a degenerate ring all-reduce with two size-branched paths: * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 SYCL submissions/call. * Large (nelem >= 32768): BF16-compressed. Each device compresses FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's inbox (HALF the PCIe bytes), then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved -- wins for any tensor where PCIe dominates kernel time. Threshold and BF16 path pattern mirror the CUDA NCCL allreduce. Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes (matches both path layouts: FP32 nelem floats; BF16 outbox+inbox = 2 * nelem uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant trivial. Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. Per-call sync is intentionally omitted. SYCL in-order queue semantics ensure that the meta-backend's next compute on the same per-device queue waits for our final ADD, and the next allreduce's first op on the same persistent buffer waits via the same queue. Only comm_free does an explicit final wait. OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process in communicator_impl.hpp:47 (condition devices.size() == 1), which is incompatible with llama.cpp's single-process multi-GPU model. Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly): Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16: pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%) tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%) Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE): pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%) tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%) Qwen3-4B Q4_K_M: pp64 = 984.51 t/s, tg16 = 49.29 t/s Llama-3.3-70B in SYCL TP now comfortably beats production layer mode on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on both — the BF16 path is what unlocks the many-medium-allreduces prefill pattern. Build/CMake: no changes. No new dependencies. ~210 lines added across ggml-sycl.h and ggml-sycl.cpp. * Fix comments * documentation update to address PR feedback * Bring over my device-to-device memcpy chagnes * move the dev2dev_memcpy calls to the upstream 7-parameter variety * Fix a typo and remove a trailing whitespace	2026-06-25 08:35:21 +03:00
Max Krasnyansky	8be759e6f7	hexagon: MUL_MAT and MUL_MAT_ID rework : 32x32 tiled weight repack, kernel-params, cached graphs (#24954 ) * hex-mm: new weight layout and fusion updates * hvx-mm: unroll the new tiled vec_dots to optimize hvx register util * hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots. * hvx-mm: parallel quantizer per block for large rows * hvx-mm: simplify and futher optimize dyn.quant and vec_dots * hvx-mm: keep intermediate per tile accumulators in fp16 * hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA * hmx-mm: remove qweight scratch and just use vtcm_weight * hmx-mm: remove all unused and obsolete code * hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled * hmx-mm: improve activation processing with dma prefetch * hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE) * hex-mm: align the weight tiles with dma just like we did in hmx-mm * hex-mm: factor out common mm bits into htp/matmul-ops.h * hex-mm: start moving mm kernel selection to the host * hex-mm: move all of the matmul param compute into the host * hmx-mm: restore pipelined mode * hmx-mm: unroll the dequant functions to optimize register usage * hmx-mm: further improve activation process * hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions * hex-mm: improve mm optimizer to acount for number of activation threads * hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM) * hexagon: remove support for arch < v73 since HMX is now required for most use-cases * hex-mm: cleanup naming for consistency * hex-mm: make sure matmul fusion accounts for vtcm allocation * hex-mm: minor cleanup for kernel_params definition * hex-mm: replace hardcoded limits with proper checks for vtcm requirements * hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header * hex-mm: remove unused functions * hex-mm: add shorthand for MM_SELECT in run-tool script * hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch * hex-mm: further cleanup matmul fallback path * hex-mm: refactor matmul entry point and dispatch a bit further * hexagon: update cmake build to enable hmx for everything * hex-ops: optimize kernel_param updates and include summary in the logs * hex-mm: add support for GGML_HEXAGON_MM_SELECT * hex-mm: add hex-common header * hex-mm: pass correct number of tasks to workpool * hex-mm: add proper checks for no-work in dyn.quant tasks * hex-mm: convert all quantizers into a macro * hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests * hex-mm: vectorize q8_1 quantizer * hex-mm: improve fused ffn mm stride handling * hex-mm: consistent use of n_threads and pipeline in kernel_params * hexagon: minor formatting * hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync * hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf * hvx-mm: unroll the loops and remove masking that is not needed for tiled accums * hmx-mm: optimize activation processing (slit loops, some unrolling, etc) * hmx-mm: minor optimization for output processing * hex-mm: consistent use of uint32_t and size_t in mm kernels * hex-mm: remove legacy restrictions for rows to be multiple of 256 * hexagon: replace sprintf with snprintf * hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements * hexagon: minor alignment fix * hexagon: fix trailing spaces * hex-mm: relax padding from 256 to 128 (leftovers) * hex-mm: remove redundant checks for weight align to 128 we always use 2D dma for the weights and align them properly * hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing * hex-mm: specialize per-token mmid activation handling * hex-profile: update python scripts to handle kernel-params section in the logging output * hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields * hex-trace: use easier to parse format, simply and fix post-proc scripts * hmx-mm: relax 32 row limit for output processing which helps utilization * hmx-mm: use start-chunk idx for tracing info * hmx-mm: parameterize activation dma pipeline * hexagon: add support for simple graph caching to avoid recomputing kernel-params * hex-mm: remove left-over repack functions * hex-mm: tighten n_prefetch asserts * hex-mm: remove duplicate round/align_up helper * hexagon: cleanup common header used in host/npu * hexagon: update early wakeup threshold * hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32 * hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc) * hex-mm: remove n_threads constraint * hex-mm: minor formatting updates * hex-mm: remove obsolete profiling logs * hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor	2026-06-24 12:14:25 -07:00
aafsmarak	0ef6f06d55	docs/android.md: Add dependency `libandroid-spawn` for building in termux (#21812 ) Fixes https://github.com/ggml-org/llama.cpp/issues/18615	2026-06-22 05:48:31 +02:00
Xuan-Son Nguyen	552258c535	server: (router) rework -hf preset repo (#24739 ) * server: temporary remove HF remote preset * rework remove preset.ini support * rm unused get_remote_preset_whitelist() * print warning * add docs * rm stray file	2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen	24bba7b98e	mtmd: refactor preprocessor, add mtmd_image_preproc_out (#24736 ) * add mtmd_image_preproc_out * add dev docs * remove unused clip API * rm unused clip_image_f32_batch::grid * change preprocess() call signature	2026-06-18 12:04:39 +02:00
Neo Zhang	9724f664e8	[SYCL] rename GGML_SYCL_SUPPORT_LEVEL_ZERO (#24719 ) * rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to GGML_SYCL_USE_LEVEL_ZERO_API * fix code format * fix error when rebase	2026-06-18 11:18:26 +03:00
Neo Zhang	6f1034b32a	[SYCL] support OPs: conv_2d, conv_2d_dw, conv2d_transpose (#24600 ) * fix conflict * fix format issue, rename * rm debug code * correct the file name	2026-06-18 09:40:03 +03:00
Neo Zhang	74a80dd9c0	[SYCL] add dev2dev memcpy by SYCL API (#24476 ) * add dev2dev memcpy by SYCL API * mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table * update the detect method for p2p comm * fix the erro created during fix confilct --------- Co-authored-by: Neo Zhang <NA>	2026-06-17 17:21:34 +03:00
Neo Zhang	d1759e4156	[SYCL] Add conv_3d (#24691 ) * add conv_3d * optimize * update ops.md * restore test script * rm unused code * rm copyright notes	2026-06-17 17:20:01 +03:00
Zijun Yu	890f1a27ed	openvino: OV 2026.2, context-shift, Q5_1 support, gemma4 dense/embedding, and -fa off (#24503 ) * Add interface is_model_splitted() to check the c-graph is splited or not * Infer and propagate dynamic-dimension indices for all tensors in the GGML graph in api compute_model_outputs() * Only do this for fallback sub graph * Move dynamic dims compute in graph missmatch * ggml-openvino: fix tensor data handling for PERMUTE/VIEW ops in split models * ggml-openvino:add comments * ggml-openvino: override VIEW op_case to 0 for split model inputs * openvino backend: Handle unsupported VIEW shape-mismatch in OpenVINO backend * Enable additional mul_mat tests and add tensor data saving function (#81) * ggml-openvino: fix CONT/TRANSPOSE mapping and improve dynamic-dimension handling * OpenVINO: add NORM/TANH support and rework SOFT_MAX translation * ggml-openvino: extend VIEW handling * Enable -fa off (#118) * Enable --context-shift * Fix llm param compute error for normal softmax not the softmax in attention * OpenVINO backend: fix error for attention size compute in llm param * use tensor->extra in infer_request i/o * OpenVINO backend: refacter the compute_llm_params() func add get_attention_pattern_case to easy extand * OpenVINO backend: clean unused code * 1to1 match op update (#146) * added translate_1to1_match_1_input function and updated gelu and tanh translations * Remove unused translation function calls --------- Co-authored-by: Mustafa Cavus <mustafacavus@intel.com> * initial gemma4 support * removed hardcoded names for kv cache slicing * OpenVINO backend: Add new attention pattern for llm parameters compute * flash attn Q shape static conversion * Remove slice in permute translation when n_seq is 1 * return optional in extract_layer_from_name * OpenVINO backend: refactor VIEW related operation (#148) * OpenVINO backend: refactor VIEW related operation * Enable VIEW handling in following ops * OpenVINO backend does not support GGML_OP_NORM & GGML_OP_L2_NORM with VIEW input accuracy issue from OpenVINO * OpenVINO backend: Add ops l2_norm & pad * OpenVINO backend does not support CPY with non-contiguous data or mismatched types * add op SSM_CONV GATED_DELTA_NET * OpenVINO backend: fix error for bf16 in OV gpu plugin * reverted static Q input shape for attention layer * OpenVINO backend: remove hardcode name inp_tokens, which ignore some leaf case * Disable remote tensor due to bug in ov gpu * Disable n_token > 1 GATED_DELTA_NET on gpu * OpenVINO backend: fix the view op dynamic handling issue in gemma4 & enable view + get_row * OpenVINO backend: clean code * OpenVINO backend: enable view + norm/rms_norm * OpenVINO backend: concat op * OpenVINO backend: argsort op * OpenVINO backend: enable unary + view & GGML_UNARY_OP_SOFTPLUS * Fix issue for test-backend-ops in TOPK_MOE, which compare VIEW ops result, VIEW node in OpenVINO no need compare, the whole graph result is correct * OpenVINO backend: enable sum_rows * OpenVINO backend: enable clamp * OpenVINO backend: enable DIV * OpenVINO backend: enable GGML_OP_MUL_MAT_ID * OpenVINO backend: disable MUL_MAT_ID_FUSION case with large mem needed * OpenVINO backend: Disable GGML_OP_ARGSORT, cause test_backend-ops failed * OpenVINO backend: fix issue in mul_mat_id * OpenVINO backend: Disable DIV with broadcast on GPU * OpenVINO backend: update DIV * use ov internal op GatedDeltaNet * OpenVINO backend: enable llama erch test qwen3next * OpenVINO backend: enable RMS_NORM + VIEW & remove op_case 2 for rope * OpenVINO backend: fix error * suggested changes, need review * suggested changes, need review * OpenVINO backend: clean unused code & fix build warning * OpenVINO backend: enable minicpm3 for arch test * Disable GDN op (#177) * disable gated_delta_net * update stateful_kv_size correctly in mismatch case * OpenVINO backend: enable arch test for qwen3vl * OpenVINO backend: enable cohere2 for arch test * OpenVINO backend: enable t5 for arch test * OpenVINO backend: enable jamba for arch test * OpenVINO backend: remove warning for tmp * OpenVINO backend: enable kimi-linear for arch test * Remove unused * Fix gpt-oss accuracy issue * OpenVINO backend: enable arctic for arch test * OpenVINO backend: enable grok for arch test * Gemma4 initial npu support (#179) * Initiall gemma4 npu support * temp. fix for gemma4 accuracy bug on npu * Remove hardcoded names for npu-fold handling * revert static n tokens for cont translation as it is not needed * removed unused variable * ggml-openvino: add GGML_OPENVINO_ENABLE_CACHE env var to control decoder cache. Add environment variable GGML_OPENVINO_ENABLE_CACHE (default: YES). When set to NO, the decoder_cache is bypassed and models are rebuilt from the cgraph on every inference call in both dynamic and static compute paths. This is useful for debugging and verifying correctness without caching interference. * Revert "Gemma4 initial npu support (#179)" This reverts commit 0d29a9c4a52dc2c8aa52990f1a3854cfb01768ad. * OpenVINO backend: disable debug log print * Update TBB discovery. Delegated to OpenVINOs own config. * OpenVINO backend: GGML_OPENVINO_ENABLE_CACHE YES -> 1 * OpenVINO backend: fallback FLASH_ATTN_EXT in gemma3n to CPU backend * Add raw ov infer profiling metric * Add OV raw infer time metric to static compute path Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com> * Modify precision of static profiling * update to OV 2026.2, add OV windows CI * fix editorconfig-checks * Initiall gemma4 npu support * temp. fix for gemma4 accuracy bug on npu * Remove hardcoded names for npu-fold handling * revert static n tokens for cont translation as it is not needed * removed unused variable * test-llama-archs fix * Fix gemma4 flash_attn fallback * support im2col * fix code style * disable add_rope_sin_cos optimization * stateless boradcast and rope optimizations * Enable manual gqa attn by default for stateless gpu * manual gqa: fixed static batch * gemma4 llama-bench ctx update fix * Update OV win CI * stateful rope fusion temp. fix * OpenVINO backend: Conslolidate supported ops * Exclude unsupported GGML_OP_SUB cases * Exclude unsupported TOPK_MOE cases * OpenVINO Backend: MUL_MAT enhancements * Update OV CI * support f16 mask input for npu * Make GGML_OPENVINO_* env vars usage uniform Standardize all GGML_OPENVINO_* env flags: positive integers >0 to enable. Unset, empty, =0, or non-numeric values to disable. This fixes cases where text values or empty strings enabled features. * OpenVINO backend: Enhance envvar handling * more cleanup * move ggml_openvino_env_flag to appropriate place * OpenVINO backend: add REPEAT translator, Q5_1 weights, and GLU view-input fix * ggml-openvino: fix -Werror=cast-qual in extract_q5_1_data * Update openvino.Dockerfile Use BuildKit cache mounts for faster Docker rebuilds. Use apt instead of dpkg, remove unused .ddeb downloads, add DLLAMA_BUILD_TESTS=OFF. * ggml-openvino: centralize env var access via getenv_str/getenv_int helpers Replace getenv and legacy flags with _str and _int helpers.Minor cleanup, doc updates. OpenVINO backend: Enable GGML_OP_ADD_ID * Uptade openvino backend clamg-format * clang-format * Update OPENVINO.md (#211) * OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision * Remove strict concurrency for gpu-openvino-low-perf * Update openvino CI keynames; add ccache-clear * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> * Fix formatting --------- Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel.com> Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com> Co-authored-by: Mustafa Cavus <mustafacavus@intel.com> Co-authored-by: Xuejun <XuejunZhai@intel.com> Co-authored-by: Wang Yang <yang4.wang@intel.com> Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Mostafa Faheem <mostafaaafaheem@gmail.com> Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>	2026-06-17 09:11:21 +03:00
Neo Zhang	58728bdbf0	sycl : Enable to support fp16 by OPs: SQR, SQRT, LOG, SIN, COS, CLAMP (#24692 )	2026-06-17 08:58:03 +03:00
Francois Dugast	9b260fc9ef	sycl: Add optional USM system allocations (#22526 ) This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>	2026-06-17 08:54:21 +03:00
Neo Zhang	fdd109883d	[SYCL] Support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND (#24363 ) * support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND * fix conflict * rebase, support new UT case of repeat, concat	2026-06-16 08:34:29 +03:00
Todd Malsbary	4196b477da	sycl : Make GGML_SYCL_F16=ON the default (#23996 ) * Add -cl-fp32-correctly-rounded-divide-sqrt to F16=ON builds Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Make GGML_SYCL_F16=ON the default Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Leave F32 the default F16 remains explictly set for example and Dockerfile builds. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Revert changes to examples/sycl/build scripts Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>	2026-06-16 08:34:02 +03:00
Julien Jerphanion	a1eb756c0b	docs: Add instructions to install `llama.cpp` from conda-forge (#22219 ) * docs: Add instructions to install `llama.cpp` from conda-forge Signed-off-by: Julien Jerphanion <git@jjerphan.xyz> * Rewording of instructions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-15 17:12:25 +02:00
Neo Zhang	8872ab5467	sycl : enhance set_rows to support q1_0, mxfp4, nvfp4 (#24564 )	2026-06-15 10:01:40 +03:00
Neo Zhang	987fbd821d	[SYCL] add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp (#24584 ) * add to support pool_1d, move pool_1d/2d code to pool.cpp/hpp * update ops.md	2026-06-15 10:01:07 +03:00
Mohammad Athar	8edaca9034	docs : fix typos in CUDA-FEDORA.md and grammars/README.md (#24459 )	2026-06-15 01:33:38 +08:00
Jeff Bolz	1a7718b4c5	vulkan: support non-contig unary/glu ops (#24215 ) * vulkan: support non-contig unary/glu ops Change unary/glu ops to pass in all strides and use fastdiv for the index calculation. Put all unary ops in one file, similar to glu, to share the code. codex went ahead and added expm1 without me asking, but I had to make it do a real precision analysis rather than just making stuff up. unary.comp initially couldn't use generic_unary_head because there wasn't space for xielu's additional constants. Fixing this required packing the fastdiv 'L' values. * attempt to workaround compiler bug * resolve conflict from #23991 * use expm1	2026-06-13 08:44:15 -05:00
Todd Malsbary	f478f1b6d7	sycl : Improve SYCL doc (#23025 ) * Tidy up SYCL doc a bit - Add explicit links to referenced items - Fix spelling errors Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Correct documented default for GGML_SYCL_GRAPH The default is ON, not OFF: $ cmake -LAH -B build \| grep GGML_SYCL_GRAPH ... GGML_SYCL_GRAPH:BOOL=ON Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Move docker instructions from SYCL.md to docker.md This makes them directly accesible from the Quick Start section of the top-level README.md. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Refer to intel.Dockerfile for ARGs and their defaults The defaults are always changing; this avoids accuracy errors from duplicating the information. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Remove mention of Nvidia in SYCL row of backend table This support was removed in 2026.02 - refer to the SYCL.md News. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>	2026-06-04 08:02:54 +03:00
Mikhail Podvitskii	d5ab0834ab	docs : update HOWTO-add-model.md (#23883 ) * docs: update HOWTO-add-model.md with new model registration and graph-building instructions * docs: improve formatting in HOWTO-add-model.md * Update docs/development/HOWTO-add-model.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-02 11:40:22 +02:00
Neo Zhang	a51142497a	[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812 ) * support Q4_1, Q5_0, Q5_1 * update ut case	2026-06-01 09:53:53 +03:00
Neo Zhang	4162522688	[SYCL] Add more types in GET_ROWS OP (#23710 ) * add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link	2026-06-01 09:53:04 +03:00
Vladislav	e6123e2080	docs : update ZenDNN docs for Q8 support (#23791 ) * docs zendnn added information about Q8 support * docs zendnn rm unnecessary data * docs update, links to ZenDNN docs provided * docs zenDNN update: clarified explanation * docs zenDNN update: one more explanation clarified --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-31 10:26:42 +02:00
Ruixiang Wang	689a9a470e	server-bench : add speed-bench for speculative decoding benchmarking (#23869 ) * spec: add speed-bench support for benchmarking * speed-bench : add trailing newline to requirements.txt * speed-bench : bump datasets to 4.8.0 to fix ty check * server-bench : remove now-unused type: ignore after datasets bump	2026-05-29 23:09:47 +02:00
Georgi Gerganov	6b4e4bd582	common : fix env names to all have LLAMA_ARG_ prefix (#23778 )	2026-05-27 14:52:47 +03:00
quyentonndbs	1d971bba36	docs : fix duplicated "the" in granitevision and model-conversion docs (#23767 ) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>	2026-05-27 09:34:06 +02:00
Alexey Kopytko	581d020b12	SYCL: implement ggml_sycl_pool_vmm (#22862 ) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit `938929c3f1`. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-26 07:59:00 +03:00
Max Krasnyansky	4bead4e30d	snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23680 )	2026-05-25 10:57:43 -07:00
alex-spacemit	5fdf07e33b	ci : update spacemit toolchain url and enhance curl command (#23642 ) * fix(action): update SpacemiT toolchain URL and version Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c * fix(action): add -L flag to curl command for URL redirection Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06	2026-05-25 10:43:24 +02:00
Aparna M P	cec51c7a7d	snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552 )	2026-05-23 19:56:41 -07:00
Jesus Talavera	95feeab52e	docs: Update documentation with Granite 4.0/4.1 (#23404 )	2026-05-22 20:35:46 +08:00
Reese Levine	ee7c30578a	Update WebGPU support and add link to blog/demo (#23483 )	2026-05-21 11:00:27 -07:00
Ruixiang Wang	2fc8d1851e	doc: fix spec mtp typo (#23435 )	2026-05-21 09:30:55 +03:00
Max Krasnyansky	871b0b70f8	snapdragon: update toolchain to v0.6 (#23369 ) * snapdragon: update compiler flags to enable all CPU features * snapdragon: update readme to point to toolchain v0.6 * snapdragon: bump toolchain docker to v0.6	2026-05-19 22:04:04 -07:00
Georgi Gerganov	d14ce3dab4	llama : MTP clean-up (#23269 ) * llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name	2026-05-19 15:32:58 +03:00
Neo Zhang	aabee047d8	[SCYL] add chapter for performance reference in SYCL.md (#23315 ) * add chapter for performance reference * rm unsupported GPU	2026-05-19 09:44:51 +03:00
alex-spacemit	81b0d882ae	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863 )	2026-05-14 17:39:30 +08:00
Katostrofik	9ed6e19b9d	SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597 ) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-14 13:39:14 +08:00
Ravi Panchumarthy	7e16646015	docs : Update OPENVINO.md (#22959 ) Updated OPENVINO.md with Validated models and quantizations Co-authored-by: Haarika Madaka <haarika.madaka@intel.com>	2026-05-13 17:12:15 +03:00
Masashi Yoshimura	927dada6c9	ggml-webgpu: Enables running gpt-oss-20b (#22906 ) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu	2026-05-12 07:27:40 -07:00
Neo Zhang	7d442abf5c	[SYCL] Add OP im2col_3d (#22903 ) * add im2col_3d * format code * update the ops.md	2026-05-11 08:01:47 +03:00
Alexey Kopytko	e20b83930c	SYCL: reduce allocation overhead during flash attention (#22732 ) * SYCL: reduce allocation overhead during flash attention * tidy up whitespace * add a note about the flag * move ggml_sycl_fattn_* into fattn-buffers.hpp * refactor implementation into fattn-buffers.cpp * move new_fattn_kv_buffers back into ggml-sycl.cpp	2026-05-09 09:30:39 +03:00
Intel AI Get-to Market Customer Success and Solutions	ad09224658	sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149 ) * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao <chun.tao@intel.com> * Fix abort during test-backend-ops Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Regenerate ops.md Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Add scope_dbg_print to newly added SYCL ops. Also add scope_dbg_print to existing ssm_conv op. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>	2026-05-07 18:51:33 +03:00
Gaurav Garg	b9afc19cb4	Write a readme on Multi-GPU usage in llama.cpp (#22729 ) * Write a readme on Multi-GPU usage in llama.cpp * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-07 17:48:40 +02:00
tc-mb	2496f9c149	mtmd : support MiniCPM-V 4.6 (#22529 ) * Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix convert Signed-off-by: tc-mb <tianchi_cai@icloud.com> * rename clip_graph_minicpmv4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use new TYPE_MINICPMV4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use build_attn to allow flash attention support Signed-off-by: tc-mb <tianchi_cai@icloud.com> * no use legacy code, restored here. Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use the existing tensors name Signed-off-by: tc-mb <tianchi_cai@icloud.com> * unused ctx->model.hparams.minicpmv_version Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use n_merge for slice alignment Signed-off-by: tc-mb <tianchi_cai@icloud.com> * borrow wa_layer_indexes for vit_merger insertion point Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code style Signed-off-by: tc-mb <tianchi_cai@icloud.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use filter_tensors and add model.vision_tower Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix chkhsh Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix type check Signed-off-by: tc-mb <tianchi_cai@icloud.com> --------- Signed-off-by: tc-mb <tianchi_cai@icloud.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-06 21:54:09 +02:00
Georgi Gerganov	846262d787	docs : update speculative decoding parameters after refactor (#22397 ) (#22539 ) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-- parameters - Update all example commands Assisted-by: llama.cpp:local pi pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo	2026-05-04 08:52:07 +03:00
Neo Zhang	eddd7a13a5	[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291 ) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter	2026-04-25 09:20:14 +03:00
Reese Levine	dd2914dc81	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327 ) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-25 09:18:15 +03:00

1 2 3 4 5 ...

332 Commits