12 Commits

Author SHA1 Message Date
Zijun Yu 890f1a27ed openvino: OV 2026.2, context-shift, Q5_1 support, gemma4 dense/embedding, and -fa off (#24503)
* Add interface is_model_splitted() to check the c-graph is splited or not

* Infer and propagate dynamic-dimension indices for all tensors in the GGML graph in api compute_model_outputs()

* Only do this for fallback sub graph

* Move dynamic dims compute in graph missmatch

* ggml-openvino: fix tensor data handling for PERMUTE/VIEW ops in split models

* ggml-openvino:add comments

* ggml-openvino: override VIEW op_case to 0 for split model inputs

* openvino backend: Handle unsupported VIEW shape-mismatch in OpenVINO backend

* Enable additional mul_mat tests and add tensor data saving function (#81)

* ggml-openvino: fix CONT/TRANSPOSE mapping and improve dynamic-dimension handling

* OpenVINO: add NORM/TANH support and rework SOFT_MAX translation

* ggml-openvino: extend VIEW handling

* Enable -fa off (#118)

* Enable --context-shift

* Fix llm param compute error for normal softmax not the softmax in attention

* OpenVINO backend: fix error for attention size compute in llm param

* use tensor->extra in infer_request i/o

* OpenVINO backend: refacter the compute_llm_params() func add get_attention_pattern_case to easy extand

* OpenVINO backend: clean unused code

* 1to1 match op update (#146)

* added translate_1to1_match_1_input function and updated gelu and tanh translations

* Remove unused translation function calls

---------

Co-authored-by: Mustafa Cavus <mustafacavus@intel.com>

* initial gemma4 support

* removed hardcoded names for kv cache slicing

* OpenVINO backend: Add new attention pattern for llm parameters compute

* flash attn Q shape static conversion

* Remove slice in permute translation when n_seq is 1

* return optional in extract_layer_from_name

* OpenVINO backend: refactor VIEW related operation (#148)

* OpenVINO backend: refactor VIEW related operation

* Enable VIEW handling in following ops

* OpenVINO backend does not support GGML_OP_NORM & GGML_OP_L2_NORM with VIEW input accuracy issue from OpenVINO

* OpenVINO backend: Add ops l2_norm & pad

* OpenVINO backend does not support CPY with non-contiguous data or mismatched types

* add op SSM_CONV GATED_DELTA_NET

* OpenVINO backend: fix error for bf16 in OV gpu plugin

* reverted static Q input shape for attention layer

* OpenVINO backend: remove hardcode name inp_tokens, which ignore some leaf case

* Disable remote tensor due to bug in ov gpu

* Disable n_token > 1 GATED_DELTA_NET on gpu

* OpenVINO backend: fix the view op dynamic handling issue in gemma4 & enable view + get_row

* OpenVINO backend: clean code

* OpenVINO backend: enable view + norm/rms_norm

* OpenVINO backend: concat op

* OpenVINO backend: argsort op

* OpenVINO backend: enable unary + view & GGML_UNARY_OP_SOFTPLUS

* Fix issue for test-backend-ops in TOPK_MOE, which compare VIEW ops result, VIEW node in OpenVINO no need compare, the whole graph result is correct

* OpenVINO backend: enable sum_rows

* OpenVINO backend: enable clamp

* OpenVINO backend: enable DIV

* OpenVINO backend: enable GGML_OP_MUL_MAT_ID

* OpenVINO backend: disable MUL_MAT_ID_FUSION case with large mem needed

* OpenVINO backend: Disable GGML_OP_ARGSORT, cause test_backend-ops failed

* OpenVINO backend: fix issue in mul_mat_id

* OpenVINO backend: Disable DIV with broadcast on GPU

* OpenVINO backend: update DIV

* use ov internal op GatedDeltaNet

* OpenVINO backend: enable llama erch test qwen3next

* OpenVINO backend: enable RMS_NORM + VIEW & remove op_case 2 for rope

* OpenVINO backend: fix error

* suggested changes, need review

* suggested changes, need review

* OpenVINO backend: clean unused code & fix build warning

* OpenVINO backend: enable minicpm3 for arch test

* Disable GDN op (#177)

* disable gated_delta_net

* update stateful_kv_size correctly in mismatch case

* OpenVINO backend: enable arch test for qwen3vl

* OpenVINO backend: enable cohere2 for arch test

* OpenVINO backend: enable t5 for arch test

* OpenVINO backend: enable jamba for arch test

* OpenVINO backend: remove warning for tmp

* OpenVINO backend: enable kimi-linear for arch test

* Remove unused

* Fix gpt-oss accuracy issue

* OpenVINO backend: enable arctic for arch test

* OpenVINO backend: enable grok for arch test

* Gemma4 initial npu support (#179)

* Initiall gemma4 npu support

* temp. fix for gemma4 accuracy bug on npu

* Remove hardcoded names for npu-fold handling

* revert static n tokens for cont translation as it is not needed

* removed unused variable

* ggml-openvino: add GGML_OPENVINO_ENABLE_CACHE env var to control decoder cache. Add environment variable GGML_OPENVINO_ENABLE_CACHE (default: YES). When set to NO, the decoder_cache is bypassed and models are rebuilt from the cgraph on every inference call in both dynamic and static compute paths. This is useful for debugging and verifying correctness without caching interference.

* Revert "Gemma4 initial npu support (#179)"

This reverts commit 0d29a9c4a52dc2c8aa52990f1a3854cfb01768ad.

* OpenVINO backend: disable debug log print

* Update TBB discovery. Delegated to OpenVINOs own config.

* OpenVINO backend: GGML_OPENVINO_ENABLE_CACHE YES -> 1

* OpenVINO backend: fallback FLASH_ATTN_EXT in gemma3n to CPU backend

* Add raw ov infer profiling metric

* Add OV raw infer time metric to static compute path

Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com>

* Modify precision of static profiling

* update to OV 2026.2, add OV windows CI

* fix editorconfig-checks

* Initiall gemma4 npu support

* temp. fix for gemma4 accuracy bug on npu

* Remove hardcoded names for npu-fold handling

* revert static n tokens for cont translation as it is not needed

* removed unused variable

* test-llama-archs fix

* Fix gemma4 flash_attn fallback

* support im2col

* fix code style

* disable add_rope_sin_cos optimization

* stateless boradcast and rope optimizations

* Enable manual gqa attn by default for stateless gpu

* manual gqa: fixed static batch

* gemma4 llama-bench ctx update fix

* Update OV win CI

* stateful rope fusion temp. fix

* OpenVINO backend: Conslolidate supported ops

* Exclude unsupported GGML_OP_SUB cases

* Exclude unsupported TOPK_MOE cases

* OpenVINO Backend: MUL_MAT enhancements

* Update OV CI

* support f16 mask input for npu

* Make GGML_OPENVINO_* env vars usage uniform

Standardize all GGML_OPENVINO_* env flags:
positive integers >0 to enable. Unset, empty, =0, or non-numeric values to disable.
This fixes cases where text values or empty strings enabled features.

* OpenVINO backend: Enhance envvar handling

* more cleanup

* move ggml_openvino_env_flag to appropriate place

* OpenVINO backend: add REPEAT translator, Q5_1 weights, and GLU view-input fix

* ggml-openvino: fix -Werror=cast-qual in extract_q5_1_data

* Update openvino.Dockerfile

Use BuildKit cache mounts for faster Docker rebuilds.
Use apt instead of dpkg, remove unused .ddeb downloads, add DLLAMA_BUILD_TESTS=OFF.

* ggml-openvino: centralize env var access via *getenv_str/getenv_int helpers

Replace getenv and legacy flags with _str and _int helpers.Minor cleanup, doc updates.

* OpenVINO backend: Enable GGML_OP_ADD_ID

* Uptade openvino backend clamg-format

* clang-format

* Update OPENVINO.md (#211)

* OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision

* Remove strict concurrency for gpu-openvino-low-perf

* Update openvino CI keynames; add ccache-clear

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

* Fix formatting

---------

Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel.com>
Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Mustafa Cavus <mustafacavus@intel.com>
Co-authored-by: Xuejun <XuejunZhai@intel.com>
Co-authored-by: Wang Yang <yang4.wang@intel.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: virajwad <84867530+virajwad@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Mostafa Faheem <mostafaaafaheem@gmail.com>
Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>
2026-06-17 09:11:21 +03:00
Georgi Gerganov d4204b03a5 ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
2026-05-30 08:52:30 +03:00
Sigbjørn Skjæret 2d0656fbdd ci : bump cuda release to 13.3 (#23749) 2026-05-27 15:06:08 +03:00
alex-spacemit 5fdf07e33b ci : update spacemit toolchain url and enhance curl command (#23642)
* fix(action): update SpacemiT toolchain URL and version

Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c

* fix(action): add -L flag to curl command for URL redirection

Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06
2026-05-25 10:43:24 +02:00
Zijun Yu 9789c4ecdc ggml : add OpenVINO backend (#15307)
* Update build doc

* Add cgraph tensor output name to OV op name

* Update openvino build instructions

* Add initial NPU support

* draft NPU support version 2: prefill + kvcache

* NPU support version 2: prefill + kvcache

* Change due to ggml cgraph changes, not correct yet

* Change due to ggml cgraph changes, llama-3.2 CPU work

* Add AMD64 to CMakeLists

* Change due to ggml cgraph changes, all device work

* Refactor: clean, fix warning

* Update clang-format

* Statful transformation for CPU GPU

* Add SwiGLU

* Fuse to SDPA

* Replace Concat with Broadcast in MulMat for GQA

* Pull out indices creation for kv cache update

* Refactor: remove past_token_len from extra_inputs

* Fix Phi3 SwiGLU and SoftMax

* Pull out sin cos from rope

* Reduce memory: free ov weights node after graph conversion

* Fix CPY due to cgraph change

* Added OpenVINO CI/CD. Updated docs

* Fix llama-cli

* Fix Phi3 ROPE; Add test-backend-ops

* Fix NPU

* Fix llama-bench; Clang-format

* Fix llama-perplexity

* temp. changes for mark decomp

* matmul in fp32

* mulmat input conversion fix

* mulmat type conversion update

* add mark decomp pass

* Revert changes in fuse_to_sdpa

* Update build.md

* Fix test-backend-ops

* Skip test-thread-safety; Run ctest only in ci/run.sh

* Use CiD for NPU

* Optimize tensor conversion, improve TTFT

* Support op SET_ROWS

* Fix NPU

* Remove CPY

* Fix test-backend-ops

* Minor updates for raising PR

* Perf: RMS fused to OV internal RMS op

* Fix after rebasing

- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run

* Change openvino device_type to GPU; Enable flash_attn

* Update supports_buft and supports_op for quantized models

* Add quant weight conversion functions from genai gguf reader

* Quant models run with accuracy issue

* Fix accuracy: disable cpu_repack

* Fix CI; Disable test-backend-ops

* Fix Q4_1

* Fix test-backend-ops: Treat quantized tensors as weights

* Add NPU Q4_0 support

* NPU perf: eliminate zp

* Dequantize q4_1 q4_k q6_k for NPU

* Add custom quant type: q8_1_c, q4_0_128

* Set m_is_static=false as default in decoder

* Simpilfy translation of get_rows

* Fix after rebasing

* Improve debug util; Eliminate nop ReshapeReshape

* STYLE: make get_types_to_requant a function

* Support BF16 model

* Fix NPU compile

* WA for npu 1st token acc issue

* Apply EliminateZP only for npu

* Add GeGLU

* Fix Hunyuan

* Support iSWA

* Fix NPU accuracy

* Fix ROPE accuracy when freq_scale != 1

* Minor: not add attention_size_swa for non-swa model

* Minor refactor

* Add Q5_K to support phi-3-q4_k_m

* Requantize Q6_K (gs16) to gs32 on GPU

* Fix after rebasing

* Always apply Eliminate_ZP to fix GPU compile issue on some platforms

* kvcachefusion support

* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

* Fix for Phi3

* Fix llama-cli (need to run with --no-warmup)

* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working

* fix after rebasing

* Fix llama-3-8b and phi3-mini q4_0 NPU

* Update to OV-2025.3 and CMakeLists.txt

* Add OV CI cache

* Apply CISC review and update CI to OV2025.3

* Update CI to run OV dep install before build

* Update OV dockerfile to use OV2025.3 and update build docs

* Style: use switch in supports_ops

* Style: middle ptr and ref align, omit optional struct keyword

* NPU Unify PD (#14)

* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims

* Clean placeholders in ggml-openvino.cpp

* NPU unify PD (handled internally)

* change graph to 4d, support multi sequences

* Fix llama-bench

* Fix NPU

* Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Remove the second decoder for node. Moving the function into the model decoder

* Fix error for naive

* NPU prefill chunking

* NPU fix llama-bench

* fallback naive run with accuracy issue

* NPU support llma-perplexity -b 512 --no-warmup

* Refactor: split ov_graph_compute for dynamic and static

* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)

* minor update due to ov 2025.4

* remove unused API GgmlOvDecoder::get_output_names()

* remove unused API get_output_shape(const std::string & name)

* Modified API GgmlOvDecoder::get_output_type(const std::string & name)

* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)

* Removed API get_output_ggml_tensor(const std::string & name)

* Removed API m_outputs

* Removed m_output_names

* Removed API GgmlOvDecoder::get_input_names()

* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)

* Removed API get_input_type

* Removed API get_input_type

* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)

* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)

* Fix error for decoder cache

* Reuse cached decoder

* GPU remove Q6_K requantization

* NPU fix wrong model output shape

* NPU fix q4 perf regression

* Remove unused variable nodes

* Fix decoder can_reuse for llama-bench

* Update build.md for Windows

* backend buffer: allocate on host

* Use shared_buffer for GPU NPU; Refactor

* Add ov_backend_host_buffer; Use cached remote context

* Put kvcache on GPU

* Use ggml_aligned_malloc

* only use remote tensor for kvcache

* only use remote tensor for kvcache for GPU

* FIX: use remote tensor from singleton

* Update build.md to include OpenCL

* NPU always requant to q4_0_128

* Optimize symmetric quant weight extraction: use single zp

* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant

* Update build.md

* Support -ctk f32

* Initial stateful graph support

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* code cleanup

* npu perf fix

* requant to f16 for Q6 embed on NPU

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp

* Create OPENVINO.md in llama.cpp backend docs

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update build.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* kq_mask naming fix

* Syntax correction for workflows build file

* Change ov backend buffer is_host to false

* Fix llama-bench -p -n where p<=256

* Fix --direct-io 0

* Don't put kvcache on GPU in stateful mode

* Remove hardcode names

* Fix stateful shapes

* Simplification for stateful and update output shape processing

* Remove hardcode names

* Avoid re-compilation in llama-bench

* Extract zp directly instead of bias

* Refactor weight tensor processing

* create_weight_node accept non-ov backend buffer

* remove changes in llama-graph.cpp

* stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add

* hardcoded name handling for rope_freqs.weight

* Suppress logging and add error handling to allow test-backend-ops to complete

* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases

* Use bias instead of zp in test-backend-ops

* Update OV in CI, Add OV CI Tests in GH Actions

* Temp fix for multithreading bug

* Update OV CI, fix review suggestions.

* fix editorconfig-checker, update docs

* Fix tabs to spaces for editorconfig-checker

* fix editorconfig-checker

* Update docs

* updated model link to be GGUF model links

* Remove GGML_CPU_REPACK=OFF

* Skip permuted ADD and MUL

* Removed static variables from utils.cpp

* Removed initializing non-existing variable

* Remove unused structs

* Fix test-backend-ops for OV GPU

* unify api calling

* Update utils.cpp

* When the dim is dynamic, throw an error, need to is stastic forst

* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using

* No need to return

* Fix test-backend-ops for OV GPU LNL

* Fix test-thread-safety

* use the shape from infer request of output tensor create to avoid issue

* fix dynamic output shape  issue

* fix issue for the unused node in tests

* Remove unused lock

* Add comment

* Update openvino docs

* update to OV release version 2026.0

* add ci ov-gpu self hosted runner

* fix editorconfig

* Fix perplexity

* Rewrite the model inputs finding mechanism  (#54)

* Rewrite the model inputs finding logistic

* Put stateful shape handle in get input shape

* Put the iteration logistic in func

* Added ggml-ci-intel-openvino-gpu and doc update

* .hpp files converted to .h

* fix ggml-ci-x64-intel-openvino-gpu

* Fix for stateful execution bug in llama-bench

* Minor updates after stateful llama-bench fix

* Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* Remove multiple get_shape calls

* Bring back mutex into compute

* Fix VIEW op, which slice the input node

* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access

* Temp. fix for test requant errors

* Update to OV ggml-ci to low-perf

* ci : temporary disable "test-llama-archs"

* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag

* docs : update url

* Fix OV link in docker and Update docs

---------

Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-14 07:56:55 +02:00
Slobodan Josic 3af34b9ff5 ci : update the ROCm/HIP toolchain versions [no ci] (#19891)
* [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1

* revert container version

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-25 15:54:49 +01:00
Adrien Gallouët 537d4240d4 ci : remove libcurl in releases (#18775)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 21:43:02 +01:00
Sigbjørn Skjæret 0a540f9abd ci : add windows-cuda 13.1 release (#17839) 2025-12-07 14:02:04 +01:00
Sigbjørn Skjæret 3a002afafa ci : refactor sdk caching to minimize storage (#16414)
* refactor sdk caching to minimize storage

* use correct action

* add myself as owner to /.github/actions/ [no ci]
2025-10-06 17:40:21 +02:00
Diego Devesa 415e40a357 releases : use arm version of curl for arm releases (#13592) 2025-05-16 19:36:51 +02:00
Diego Devesa 70a6991edf ci : move release workflow to a separate file (#13362) 2025-05-08 13:15:28 +02:00
Xuan-Son Nguyen bd3f59f812 cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
2025-04-07 13:35:19 +02:00