llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-26 06:10:19 +00:00

Author	SHA1	Message	Date
leonardHONG	f818065d75	CUDA: batch out_prod broadcast (dps2>1) path with cublasSgemmBatched (#24426 )	2026-06-26 08:51:25 +03:00
shaofeiqi	5c7c22c3e1	opencl: flush profiling batch at shutdown for incomplete batches (#25016 )	2026-06-25 18:48:24 -07:00
Oliver Simons	1ec44d178d	CUDA: Various fixes to `cpy.cu` (#25000 ) * Add failing test-case to test-backend-ops Extracted from https://github.com/ggml-org/llama.cpp/issues/24072 * Minimize repro with help of AI N = 8 * (65535 - 1) + 1 = 524273 * Port and adjust workaround from https://github.com/LostRuins/koboldcpp/commit/0ba798341e0c70517cb226cb63c966b086a3b5b3 Fall-back should share code, also relax y-z constraint to be inclusive * Add test-case + fallback also for y dim * Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX * Fix overflow problems for transposed copy kernel	2026-06-25 17:29:23 +02:00
fairydreaming	f728adab68	ggml : address integer overflows in binary ops CUDA implementation (#24706 ) * ggml : address integer overflows in binary ops CUDA implementation * ggml : add size_t casts to avoid integer overflows * ggml : add more asserts checking integer overflows in binary ops CUDA implementation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-06-25 10:06:44 +02:00
David Spruill	e9fb3b3fc0	sycl : support --split-mode tensor (#24152 ) * Sycl tp stage1 (#1) * SYCL: tensor parallelism (--split-mode tensor) for dual-GPU Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend queries via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu. For N=2 (the common dual-GPU case) implements a degenerate ring all-reduce with two size-branched paths: * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 SYCL submissions/call. * Large (nelem >= 32768): BF16-compressed. Each device compresses FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's inbox (HALF the PCIe bytes), then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved -- wins for any tensor where PCIe dominates kernel time. Threshold and BF16 path pattern mirror the CUDA NCCL allreduce. Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes (matches both path layouts: FP32 nelem floats; BF16 outbox+inbox = 2 * nelem uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant trivial. Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. Per-call sync is intentionally omitted. SYCL in-order queue semantics ensure that the meta-backend's next compute on the same per-device queue waits for our final ADD, and the next allreduce's first op on the same persistent buffer waits via the same queue. Only comm_free does an explicit final wait. OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process in communicator_impl.hpp:47 (condition devices.size() == 1), which is incompatible with llama.cpp's single-process multi-GPU model. Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly): Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16: pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%) tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%) Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE): pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%) tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%) Qwen3-4B Q4_K_M: pp64 = 984.51 t/s, tg16 = 49.29 t/s Llama-3.3-70B in SYCL TP now comfortably beats production layer mode on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on both — the BF16 path is what unlocks the many-medium-allreduces prefill pattern. Build/CMake: no changes. No new dependencies. ~210 lines added across ggml-sycl.h and ggml-sycl.cpp. * Fix comments * documentation update to address PR feedback * Bring over my device-to-device memcpy chagnes * move the dev2dev_memcpy calls to the upstream 7-parameter variety * Fix a typo and remove a trailing whitespace	2026-06-25 08:35:21 +03:00
Neo Zhang	9c10954865	sycl : fix the failed UT cases of conv_3d (#24900 )	2026-06-25 08:27:58 +03:00
lhez	fdb2c11c70	opencl: support non-contig rows in norm (#24965 )	2026-06-24 19:21:25 -07:00
Max Krasnyansky	8be759e6f7	hexagon: MUL_MAT and MUL_MAT_ID rework : 32x32 tiled weight repack, kernel-params, cached graphs (#24954 ) * hex-mm: new weight layout and fusion updates * hvx-mm: unroll the new tiled vec_dots to optimize hvx register util * hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots. * hvx-mm: parallel quantizer per block for large rows * hvx-mm: simplify and futher optimize dyn.quant and vec_dots * hvx-mm: keep intermediate per tile accumulators in fp16 * hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA * hmx-mm: remove qweight scratch and just use vtcm_weight * hmx-mm: remove all unused and obsolete code * hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled * hmx-mm: improve activation processing with dma prefetch * hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE) * hex-mm: align the weight tiles with dma just like we did in hmx-mm * hex-mm: factor out common mm bits into htp/matmul-ops.h * hex-mm: start moving mm kernel selection to the host * hex-mm: move all of the matmul param compute into the host * hmx-mm: restore pipelined mode * hmx-mm: unroll the dequant functions to optimize register usage * hmx-mm: further improve activation process * hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions * hex-mm: improve mm optimizer to acount for number of activation threads * hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM) * hexagon: remove support for arch < v73 since HMX is now required for most use-cases * hex-mm: cleanup naming for consistency * hex-mm: make sure matmul fusion accounts for vtcm allocation * hex-mm: minor cleanup for kernel_params definition * hex-mm: replace hardcoded limits with proper checks for vtcm requirements * hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header * hex-mm: remove unused functions * hex-mm: add shorthand for MM_SELECT in run-tool script * hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch * hex-mm: further cleanup matmul fallback path * hex-mm: refactor matmul entry point and dispatch a bit further * hexagon: update cmake build to enable hmx for everything * hex-ops: optimize kernel_param updates and include summary in the logs * hex-mm: add support for GGML_HEXAGON_MM_SELECT * hex-mm: add hex-common header * hex-mm: pass correct number of tasks to workpool * hex-mm: add proper checks for no-work in dyn.quant tasks * hex-mm: convert all quantizers into a macro * hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests * hex-mm: vectorize q8_1 quantizer * hex-mm: improve fused ffn mm stride handling * hex-mm: consistent use of n_threads and pipeline in kernel_params * hexagon: minor formatting * hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync * hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf * hvx-mm: unroll the loops and remove masking that is not needed for tiled accums * hmx-mm: optimize activation processing (slit loops, some unrolling, etc) * hmx-mm: minor optimization for output processing * hex-mm: consistent use of uint32_t and size_t in mm kernels * hex-mm: remove legacy restrictions for rows to be multiple of 256 * hexagon: replace sprintf with snprintf * hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements * hexagon: minor alignment fix * hexagon: fix trailing spaces * hex-mm: relax padding from 256 to 128 (leftovers) * hex-mm: remove redundant checks for weight align to 128 we always use 2D dma for the weights and align them properly * hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing * hex-mm: specialize per-token mmid activation handling * hex-profile: update python scripts to handle kernel-params section in the logging output * hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields * hex-trace: use easier to parse format, simply and fix post-proc scripts * hmx-mm: relax 32 row limit for output processing which helps utilization * hmx-mm: use start-chunk idx for tracing info * hmx-mm: parameterize activation dma pipeline * hexagon: add support for simple graph caching to avoid recomputing kernel-params * hex-mm: remove left-over repack functions * hex-mm: tighten n_prefetch asserts * hex-mm: remove duplicate round/align_up helper * hexagon: cleanup common header used in host/npu * hexagon: update early wakeup threshold * hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32 * hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc) * hex-mm: remove n_threads constraint * hex-mm: minor formatting updates * hex-mm: remove obsolete profiling logs * hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor	2026-06-24 12:14:25 -07:00
Wagner Bruna	51eae8cfca	vulkan: allow reducing the graph submission batches to avoid timeouts (#24872 )	2026-06-24 16:29:24 +02:00
liminfei-amd	1191758c5d	vulkan: fail the build when a shader fails to compile (#24450 ) * vulkan-shaders-gen: fail the build when a shader fails to compile vulkan-shaders-gen did not detect shader-compile subprocess failures, so a broken libggml-vulkan could be produced while the build reported success and the breakage only surfaced at run time. execute_command() discarded the child exit code (POSIX waitpid passed nullptr for status; the Windows branch never called GetExitCodeProcess) and string_to_spv decided success only from whether stderr was empty, so a non-zero exit with empty stderr, or a subprocess that failed to launch, was treated as success. Return the child exit code from execute_command() (WEXITSTATUS on POSIX, GetExitCodeProcess on Windows), treat a non-zero exit or non-empty stderr or a launch exception as a failure, and record it in an atomic flag. main() checks the flag after process_shaders() and returns EXIT_FAILURE before writing the output files, so the build stops instead of emitting a broken backend. Fixes #24393 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com> * vulkan-shaders-gen: simplify compile_failed access and drop unreachable return Address review feedback on #24450: - Access the std::atomic<bool> compile_failed directly (= / implicit bool) instead of .store()/.load(); the flag stays atomic because the worker threads in process_shaders() set it concurrently. - Remove the unreachable trailing return -1 in execute_command(): on POSIX the child _exit()s after execvp and the parent returns (fork()<0 throws); on Windows the block returns the exit code. Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com> --------- Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>	2026-06-24 11:42:03 +02:00
Jeff Bolz	ac4105d68b	vulkan: Apply bias before softmax in FA, to avoid overflow (#24909 )	2026-06-23 22:34:00 -05:00
Jeff Bolz	72a9269172	vulkan: support all backend tests for SQR/SQRT/SIN/COS/CLAMP/LEAKY_RELU/NORM (#24582 ) * vulkan: make SQR/SQRT/SIN/COS/CLAMP/LEAKY_RELU use unary.comp * vulkan: make NORM support noncontig * add noncontiguous row test cases for norm/l2_norm, handle this in the CPU backend and l2_norm.comp * fix supports_op for cuda and webgpu	2026-06-23 09:48:24 -05:00
Jeff Bolz	92e854ab83	vulkan: Support GET_ROWS_BACK (#24883 )	2026-06-23 15:39:37 +02:00
Jeff Bolz	c5606364b2	vulkan: support CONV_3D (#24612 ) * vulkan: support CONV_3D This is a pretty direct port of conv2d_mm.comp to CONV_3D, done by codex and cleaned up by me. * disable slower perf tests	2026-06-23 15:39:20 +02:00
Jeff Bolz	0eb874d374	vulkan: make mul_mm ALIGNED a spec constant (#24689 ) This trims down some of the shader variant explosion and reduces binary size.	2026-06-23 14:26:17 +02:00
Wyatt Caldwell	c926ad0985	vulkan: link ggml-cpu when GGML_VULKAN_CHECK_RESULTS / RUN_TESTS are enabled (#24444 ) The result-checking and test debug paths in ggml-vulkan.cpp call ggml_graph_compute_with_ctx() to compute a CPU reference graph, but that symbol is defined in ggml-cpu, which ggml-vulkan does not link. Enabling -DGGML_VULKAN_CHECK_RESULTS=ON (or -DGGML_VULKAN_RUN_TESTS=ON) therefore fails to link with an unresolved external (e.g. LNK2019 on MSVC, undefined reference on GCC/Clang). This regressed after ggml-cpu was split into its own library. Link ggml-cpu under those two options so the debug builds link again. Signed-off-by: Wyatt Caldwell <218154709+Detensable@users.noreply.github.com>	2026-06-23 12:55:46 +02:00
Masashi Yoshimura	7c908502ea	ggml-webgpu: improve MTP inference by using mat-vec path for small batches (#24811 ) * ggml-webgpu: improve small batches decoding * Add barrier to the NUM_COLS loop in mul-mat-vec	2026-06-23 17:13:55 +09:00
Shawn Gu	23ee8797e1	opencl: q8_0 gemv precision improvement (#24923 )	2026-06-22 22:25:21 -07:00
Neo Zhang	f8cc15f163	[SYCL] support bf16 on bin_bcast OP and unary OPs (#24838 ) * support bf16 on bin_bcast OP and unary OPs * support the older Intel compiler than 2026.0	2026-06-22 14:09:02 +03:00
Guanhuai Zhang	4a80943174	fix(hexagon): use padded stride for ssm-conv weights (#24470 )	2026-06-20 14:58:49 -07:00
Adrien Gallouët	37a77fb057	ggml : optimize AMX (#24806 ) Flatten the partition over n_batch * M so every thread participates in the quantization \| CPU \| Model \| Test \| t/s OLD \| t/s NEW \| Speedup \| \|:--------------------------------\|:------------------------------\|:-------\|----------:\|----------:\|----------:\| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B IQ4_NL - 4.5 bpw \| pp512 \| 730.71 \| 779.86 \| 1.07 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B IQ4_NL - 4.5 bpw \| tg128 \| 87.88 \| 86.79 \| 0.99 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B IQ4_XS - 4.25 bpw \| pp512 \| 725.09 \| 1023.31 \| 1.41 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B IQ4_XS - 4.25 bpw \| tg128 \| 83.64 \| 83.62 \| 1.00 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_0 \| pp512 \| 820.51 \| 924.05 \| 1.13 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_0 \| tg128 \| 90.59 \| 92.46 \| 1.02 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_1 \| pp512 \| 776.88 \| 872.79 \| 1.12 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_1 \| tg128 \| 89.39 \| 90.94 \| 1.02 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_K_M \| pp512 \| 719.28 \| 1009.27 \| 1.40 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_K_M \| tg128 \| 80.62 \| 80.86 \| 1.00 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_K_S \| pp512 \| 732.29 \| 1077.29 \| 1.47 \| \| Intel(R) Xeon(R) Platinum 8488C \| qwen35 0.8B Q4_K_S \| tg128 \| 86.42 \| 83.53 \| 0.97 \| Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-20 13:43:06 +03:00
Masashi Yoshimura	f449e05537	ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA	2026-06-20 08:12:32 +09:00
Xuan-Son Nguyen	e475fa2b5f	mtmd, arg: fix utf8 handling on windows (#24779 ) * mtmd, arg: fix utf8 handling on windows * also fix ggml_fopen * fix build fail * also fix CLI	2026-06-19 22:28:38 +02:00
Georgi Gerganov	1868af13ac	ggml : bump version to 0.15.2 (ggml/1548)	2026-06-19 10:19:14 +03:00
shalinib-ibm	8141e730f1	ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753 ) * ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack. * Apply suggestion from @taronaeo Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-06-19 08:55:38 +03:00
Pascal	3a3edc9ac6	Ggml/cuda col2im 1d (#24417 ) * cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op * cuda: col2im_1d use fast_div_modulo for the index decomposition * cuda: col2im_1d tighten supports_op, type match and contiguous dst	2026-06-18 22:23:01 +02:00
Max Krasnyansky	d2c67959b3	hexagon: support for op-trace (fine-grain tracing of HVX/HMX/DMA events) (#24592 ) * hex-optrace: add support for optrace and instrument matmul and flash-atten code * hex-trace: improve trace event and prefetto generator * hex-trace: add new script dedicated to handling traces, specifically perfetto traces * hex-trace: add --head/--tail options to profile and trace tools * hex-trace: fix whitespaces * hex-trace: fix flake8 warnings * hex-trace: fix flake8 warnings * hmx-fa: restore q_tiles clearing * hex-profile: remove circular dep in includes * hex-trace: simplify trace sizing check * hex-profile: sort events in the summary by name	2026-06-18 08:35:02 -07:00
Neo Zhang	9724f664e8	[SYCL] rename GGML_SYCL_SUPPORT_LEVEL_ZERO (#24719 ) * rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to GGML_SYCL_USE_LEVEL_ZERO_API * fix code format * fix error when rebase	2026-06-18 11:18:26 +03:00
Neo Zhang	dd69db2924	sycl : support MUL_MAT and OUT_PROD with Q1_0 (#24721 )	2026-06-18 11:17:37 +03:00
Neo Zhang	6f1034b32a	[SYCL] support OPs: conv_2d, conv_2d_dw, conv2d_transpose (#24600 ) * fix conflict * fix format issue, rename * rm debug code * correct the file name	2026-06-18 09:40:03 +03:00
Georgi Gerganov	cae0a3b0b0	metal : check for BF16 support in concat kernel (#24747 )	2026-06-18 09:16:06 +03:00
shalinib-ibm	2e88c49c90	ggml-cpu: Conditionally enable power11 backend based on compiler support (#24687 ) * ggml: Conditionally enable power11 backend based on compiler support Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang toolchains while preserving forward compatibility once POWER11 support becomes available. * Update CMakeLists.txt ggml-cpu: Use -mcpu=power10 for P10 and P11	2026-06-18 02:45:19 +08:00
Georgi Gerganov	0843245cb1	metal : implement rope_back operator (#24725 ) Reuse existing rope kernels with a function constant to toggle forward/backward rotation, avoiding duplicate kernel code. Assisted-by: pi:llama.cpp/Qwen3.6-27B	2026-06-17 20:36:05 +03:00
Georgi Gerganov	8d2e580632	metal : add f16 and bf16 support for concat operator (#24724 ) * metal : add f16 and bf16 support for concat operator Extend the Metal backend concat operator to support f16 and bf16 tensor types in addition to the existing f32 and i32 support. - Template kernel_concat on type T with specializations for float, half, bfloat, and int - Add type-specific pipeline getter ggml_metal_library_get_pipeline_concat() - Update device support check to allow f16 unconditionally and bf16 when device supports bfloat16 - Update dispatch to select the correct kernel specialization by type Assisted-by: pi:llama.cpp/Qwen3.6-27B * metal : extend concat operator to support f16, bf16, i8, i16 and i64 Assisted-by: pi:llama.cpp/Qwen3.6-27B	2026-06-17 19:38:55 +03:00
Neo Zhang	74a80dd9c0	[SYCL] add dev2dev memcpy by SYCL API (#24476 ) * add dev2dev memcpy by SYCL API * mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table * update the detect method for p2p comm * fix the erro created during fix confilct --------- Co-authored-by: Neo Zhang <NA>	2026-06-17 17:21:34 +03:00
Neo Zhang	d1759e4156	[SYCL] Add conv_3d (#24691 ) * add conv_3d * optimize * update ops.md * restore test script * rm unused code * rm copyright notes	2026-06-17 17:20:01 +03:00
Winston Ma	558e221b70	vulkan: record actual memory properties during buffer creation (#24326 )	2026-06-17 11:14:48 +02:00
Ruben Ortlam	ea21e03955	Revert "cuda: reset cuda context after reading memory size (#23935 )" (#24715 ) This reverts commit `0f7fada56b`.	2026-06-17 10:59:35 +02:00
kononnable	d5376cf5d7	ci: fix vulkan docker images (#24595 ) * Update vulkan-shaders-gen.cpp * Update vulkan-shaders-gen.cpp add comment describing code change intention * Update vulkan-shaders-gen.cpp fix potential UB	2026-06-17 09:43:45 +02:00
lhez	51571722aa	opencl: optimize mul_mat_f16_f32_l4 for decode (#24504 )	2026-06-16 23:21:26 -07:00
Zijun Yu	890f1a27ed	openvino: OV 2026.2, context-shift, Q5_1 support, gemma4 dense/embedding, and -fa off (#24503 ) * Add interface is_model_splitted() to check the c-graph is splited or not * Infer and propagate dynamic-dimension indices for all tensors in the GGML graph in api compute_model_outputs() * Only do this for fallback sub graph * Move dynamic dims compute in graph missmatch * ggml-openvino: fix tensor data handling for PERMUTE/VIEW ops in split models * ggml-openvino:add comments * ggml-openvino: override VIEW op_case to 0 for split model inputs * openvino backend: Handle unsupported VIEW shape-mismatch in OpenVINO backend * Enable additional mul_mat tests and add tensor data saving function (#81) * ggml-openvino: fix CONT/TRANSPOSE mapping and improve dynamic-dimension handling * OpenVINO: add NORM/TANH support and rework SOFT_MAX translation * ggml-openvino: extend VIEW handling * Enable -fa off (#118) * Enable --context-shift * Fix llm param compute error for normal softmax not the softmax in attention * OpenVINO backend: fix error for attention size compute in llm param * use tensor->extra in infer_request i/o * OpenVINO backend: refacter the compute_llm_params() func add get_attention_pattern_case to easy extand * OpenVINO backend: clean unused code * 1to1 match op update (#146) * added translate_1to1_match_1_input function and updated gelu and tanh translations * Remove unused translation function calls --------- Co-authored-by: Mustafa Cavus <mustafacavus@intel.com> * initial gemma4 support * removed hardcoded names for kv cache slicing * OpenVINO backend: Add new attention pattern for llm parameters compute * flash attn Q shape static conversion * Remove slice in permute translation when n_seq is 1 * return optional in extract_layer_from_name * OpenVINO backend: refactor VIEW related operation (#148) * OpenVINO backend: refactor VIEW related operation * Enable VIEW handling in following ops * OpenVINO backend does not support GGML_OP_NORM & GGML_OP_L2_NORM with VIEW input accuracy issue from OpenVINO * OpenVINO backend: Add ops l2_norm & pad * OpenVINO backend does not support CPY with non-contiguous data or mismatched types * add op SSM_CONV GATED_DELTA_NET * OpenVINO backend: fix error for bf16 in OV gpu plugin * reverted static Q input shape for attention layer * OpenVINO backend: remove hardcode name inp_tokens, which ignore some leaf case * Disable remote tensor due to bug in ov gpu * Disable n_token > 1 GATED_DELTA_NET on gpu * OpenVINO backend: fix the view op dynamic handling issue in gemma4 & enable view + get_row * OpenVINO backend: clean code * OpenVINO backend: enable view + norm/rms_norm * OpenVINO backend: concat op * OpenVINO backend: argsort op * OpenVINO backend: enable unary + view & GGML_UNARY_OP_SOFTPLUS * Fix issue for test-backend-ops in TOPK_MOE, which compare VIEW ops result, VIEW node in OpenVINO no need compare, the whole graph result is correct * OpenVINO backend: enable sum_rows * OpenVINO backend: enable clamp * OpenVINO backend: enable DIV * OpenVINO backend: enable GGML_OP_MUL_MAT_ID * OpenVINO backend: disable MUL_MAT_ID_FUSION case with large mem needed * OpenVINO backend: Disable GGML_OP_ARGSORT, cause test_backend-ops failed * OpenVINO backend: fix issue in mul_mat_id * OpenVINO backend: Disable DIV with broadcast on GPU * OpenVINO backend: update DIV * use ov internal op GatedDeltaNet * OpenVINO backend: enable llama erch test qwen3next * OpenVINO backend: enable RMS_NORM + VIEW & remove op_case 2 for rope * OpenVINO backend: fix error * suggested changes, need review * suggested changes, need review * OpenVINO backend: clean unused code & fix build warning * OpenVINO backend: enable minicpm3 for arch test * Disable GDN op (#177) * disable gated_delta_net * update stateful_kv_size correctly in mismatch case * OpenVINO backend: enable arch test for qwen3vl * OpenVINO backend: enable cohere2 for arch test * OpenVINO backend: enable t5 for arch test * OpenVINO backend: enable jamba for arch test * OpenVINO backend: remove warning for tmp * OpenVINO backend: enable kimi-linear for arch test * Remove unused * Fix gpt-oss accuracy issue * OpenVINO backend: enable arctic for arch test * OpenVINO backend: enable grok for arch test * Gemma4 initial npu support (#179) * Initiall gemma4 npu support * temp. fix for gemma4 accuracy bug on npu * Remove hardcoded names for npu-fold handling * revert static n tokens for cont translation as it is not needed * removed unused variable * ggml-openvino: add GGML_OPENVINO_ENABLE_CACHE env var to control decoder cache. Add environment variable GGML_OPENVINO_ENABLE_CACHE (default: YES). When set to NO, the decoder_cache is bypassed and models are rebuilt from the cgraph on every inference call in both dynamic and static compute paths. This is useful for debugging and verifying correctness without caching interference. * Revert "Gemma4 initial npu support (#179)" This reverts commit 0d29a9c4a52dc2c8aa52990f1a3854cfb01768ad. * OpenVINO backend: disable debug log print * Update TBB discovery. Delegated to OpenVINOs own config. * OpenVINO backend: GGML_OPENVINO_ENABLE_CACHE YES -> 1 * OpenVINO backend: fallback FLASH_ATTN_EXT in gemma3n to CPU backend * Add raw ov infer profiling metric * Add OV raw infer time metric to static compute path Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com> * Modify precision of static profiling * update to OV 2026.2, add OV windows CI * fix editorconfig-checks * Initiall gemma4 npu support * temp. fix for gemma4 accuracy bug on npu * Remove hardcoded names for npu-fold handling * revert static n tokens for cont translation as it is not needed * removed unused variable * test-llama-archs fix * Fix gemma4 flash_attn fallback * support im2col * fix code style * disable add_rope_sin_cos optimization * stateless boradcast and rope optimizations * Enable manual gqa attn by default for stateless gpu * manual gqa: fixed static batch * gemma4 llama-bench ctx update fix * Update OV win CI * stateful rope fusion temp. fix * OpenVINO backend: Conslolidate supported ops * Exclude unsupported GGML_OP_SUB cases * Exclude unsupported TOPK_MOE cases * OpenVINO Backend: MUL_MAT enhancements * Update OV CI * support f16 mask input for npu * Make GGML_OPENVINO_* env vars usage uniform Standardize all GGML_OPENVINO_* env flags: positive integers >0 to enable. Unset, empty, =0, or non-numeric values to disable. This fixes cases where text values or empty strings enabled features. * OpenVINO backend: Enhance envvar handling * more cleanup * move ggml_openvino_env_flag to appropriate place * OpenVINO backend: add REPEAT translator, Q5_1 weights, and GLU view-input fix * ggml-openvino: fix -Werror=cast-qual in extract_q5_1_data * Update openvino.Dockerfile Use BuildKit cache mounts for faster Docker rebuilds. Use apt instead of dpkg, remove unused .ddeb downloads, add DLLAMA_BUILD_TESTS=OFF. * ggml-openvino: centralize env var access via getenv_str/getenv_int helpers Replace getenv and legacy flags with _str and _int helpers.Minor cleanup, doc updates. OpenVINO backend: Enable GGML_OP_ADD_ID * Uptade openvino backend clamg-format * clang-format * Update OPENVINO.md (#211) * OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision * Remove strict concurrency for gpu-openvino-low-perf * Update openvino CI keynames; add ccache-clear * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> * Fix formatting --------- Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel.com> Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com> Co-authored-by: Mustafa Cavus <mustafacavus@intel.com> Co-authored-by: Xuejun <XuejunZhai@intel.com> Co-authored-by: Wang Yang <yang4.wang@intel.com> Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Mostafa Faheem <mostafaaafaheem@gmail.com> Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>	2026-06-17 09:11:21 +03:00
Neo Zhang	58728bdbf0	sycl : Enable to support fp16 by OPs: SQR, SQRT, LOG, SIN, COS, CLAMP (#24692 )	2026-06-17 08:58:03 +03:00
Alexey Kopytko	ebbc1e51c1	SYCL: fix use-after-free bug with async memcpy in MoE prefill (#24676 ) * SYCL: fix a bug with async memcpy * make mmid_row_mapping_host persistent * comment on stream->wait * Apply suggestion from @sanmai * Apply suggestion from @sanmai * Apply suggestion from @sanmai	2026-06-17 08:57:29 +03:00
Francois Dugast	9b260fc9ef	sycl: Add optional USM system allocations (#22526 ) This introduces an optional feature to allocate large GPU buffers (≥ 1GB) using USM system allocations if supported by the device. It allows using buffers from the system allocator then letting the system manage memory migrations between host and device as necessary. This feature is disabled by default and requires the GGML_SYCL_USM_SYSTEM environment variable to enable. If USM system allocations are not supported by the device or the system, we fallback to regular allocations. This feature can allow VRAM overcommit. For example, the test below fails on B580 due to lack of memory for allocation, but it passes when enabling USM system allocations: ./examples/sycl/test.sh -m Qwen3.5-27B-Q3_K_M.gguf -lv 4 Signed-off-by: Francois Dugast <francois.dugast@intel.com>	2026-06-17 08:54:21 +03:00
Winston Ma	32120c10e3	vulkan: prefer host-visible memory buffers on UMA devices (#22930 ) * implement UMA host-visible memory * update based on 0cc4m's suggestion	2026-06-16 09:36:52 +02:00
Jeff Bolz	d5fb104293	vulkan: Support gated_delta_net with S_v=16 (#24581 )	2026-06-16 09:26:57 +02:00
Frosty40	ac79caa7ce	sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID (#24452 ) * sycl: support reordered Q4_K and Q5_K MoE MUL_MAT_ID Extend reordered-weight handling to fused MoE MUL_MAT_ID for Q4_K and Q5_K expert tensors and add Q5_K reordered DMMV coverage. Unsupported 3D reorder cases now fall back instead of aborting. * sycl: extend MoE reorder to Q6_K mul_mat_id	2026-06-16 08:35:00 +03:00
Neo Zhang	fdd109883d	[SYCL] Support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND (#24363 ) * support OP EXPM1, support all UT cases of FLOOR, TRUNC, ROUND * fix conflict * rebase, support new UT case of repeat, concat	2026-06-16 08:34:29 +03:00
Pascal	ad39ccaa19	vulkan: add col2im_1d op (#24425 ) * vulkan: add GGML_OP_COL2IM_1D, follow-up to the CPU op * vulkan: col2im_1d bounded gather loop instead of full-K scan with modulo * vulkan: col2im_1d address review from @jeffbolznv * vulkan: col2im_1d return nullptr for unsupported types, address review from @0cc4m	2026-06-16 06:34:43 +02:00
Jeff Bolz	9dbc6621ae	vulkan: support more CONCAT types (#24579 )	2026-06-15 13:19:21 +02:00

1 2 3 4 5 ...

2655 Commits