9221 Commits

Author SHA1 Message Date
Pranav Dhinakar b7340443d4 ggml-hexagon: add PAD op HVX kernel (#23078)
* ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-pad: fix editorconfig checks and macro alignment

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b9221
2026-05-18 13:39:36 -07:00
SamareshSingh 5cbaa5e69e docker : add OCI image labels for version and build date (#21653)
* docker: add OCI image labels to all published images

* docker: propagate OCI labels as manifest and index annotations

* docker: drop hardcoded org URL and revert accidental intel version bump

The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.

* docker: add skip_s390x workflow_dispatch input for fast test runs

Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
2026-05-18 22:14:45 +02:00
Adrien Gallouët 45b455e66f common : remove hf cache migration (#23266)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9219
2026-05-18 17:11:47 +02:00
Aleksander Grygier 3a9c1b854d ui: Update KaTeX package and clean up logs from sass warnings (#23275)
* ui: migrate katex imports to @use to resolve SCSS deprecation warnings

* ci: Use `ubuntu-slim` for CI (UI) workflow
2026-05-18 16:26:01 +02:00
Aleksander Grygier b9a2170fce feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270) 2026-05-18 16:17:21 +02:00
Aleksander Grygier 1ff0fc1384 ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars

* refactor: skip MCP proxy probe when no server requires it

* refactor: suppress expected disconnect errors during MCP client shutdown

* refactor: Deduplicate requests

* refactor: deduplicate model fetching across ROUTER and MODEL modes

* refactor: Clean up models logic

* chore: Add `.env.example` file

* refactor: replace client-side CORS proxy probe with server status flag

* refactor: Post-review fixes

* test: add vitest client setup with API fetch mocks
b9216
2026-05-18 16:09:40 +02:00
Aleksander Grygier a135ec0baa ui: Centralize monospace font styles in app.css (#23272) 2026-05-18 15:10:14 +02:00
Martin Andersson 232f466583 webui: fix Tailwind v4 utility classes missing when built via cmake (#23253) 2026-05-18 14:08:02 +02:00
Andrei 49c21f97cd llama: initialize pre-norm embedding mask flag (#23256) b9213 2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret 77e38d68f2 add myself to conversion (#23261) 2026-05-18 12:42:56 +02:00
Martin Klacer 053e01dff6 ci : added kleidiai-server to server-self-hosted workflow (#22435)
* kleidiai: added kleidiai-server to server-self-hosted workflow

 * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test
   workflow into the server-self-hosted.yml configuration file

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee

* Added self-hosted executor for KleidiAI server workflow

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Update .github/workflows/server-self-hosted.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 11:14:57 +02:00
Georgi Gerganov c3f95c1f06 scripts : allow wc2wt with an existing branch (#23189) 2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions 0caf2a1d48 sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
b9209
2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions 5511965b19 sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
b9208
2026-05-18 08:11:51 +03:00
Neo Zhang e98bcfec28 sycl : fix error when use -mg 1 error (#23140) 2026-05-18 08:11:19 +03:00
Incarnas 1867a0c692 update bid to match each layers MTP source (#23237)
* update bid to match each layers MTP source

* Update conversion/qwen.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret dd7cad7197 cmake : do not check for bin install dir (#23234) 2026-05-18 02:33:14 +02:00
Gabe Goodhart 726704a160 feat: Support d_conv=15 for ssm-conv.cu (#23017)
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b9204
2026-05-17 23:05:11 +02:00
Aldehir Rojas 87589042ca cmake : fix LLAMA_BUILD_UI logic (#23190) b9203 2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret e0de4c2419 cmake : do not install conversion script (#23204) b9202 2026-05-17 18:07:21 +02:00
Oliver Simons 84c678242a CUDA: Continue directly including cuda/iterator (#23102)
Cont of #22936, forgot to update one site
2026-05-17 18:00:10 +02:00
Aman Gupta 3e12fbdea5 llama: avoid copying logits during prompt decode in MTP (#23198)
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
b9200
2026-05-17 23:30:25 +08:00
Aldehir Rojas 39cf5d6191 common : delegate assistant continuation to underlying template handlers (#23089)
* common : delegate assistant continuation to template handler

* server : implement echo parameter to exclude assistant prefill in the response

* server : fix tests for prefill

* server : use existing llama template

* cont : clean up
2026-05-17 13:36:05 +02:00
Jan Ekström a6d6183dbc ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009)
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI

For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).

This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.

* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers

This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
b9198
2026-05-17 13:12:11 +02:00
Pascal fcae601e44 vulkan: add cpy bf16 -> f32 pipelines (#22677) b9197 2026-05-17 11:31:20 +02:00
Jeff Bolz 7ba22c6a09 vulkan: Support unaligned tensors for ROPE (#22637) b9196 2026-05-17 11:30:16 +02:00
Aldehir Rojas f4cc787b9f common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments
2026-05-17 03:44:34 -05:00
Jeff Bolz 3fbadb06dc vulkan: fuse SSM_CONV + BIAS + SILU (#22653) b9194 2026-05-17 10:25:50 +02:00
Rares Vernica 1a68ec9378 server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
b9193
2026-05-17 09:39:04 +03:00
ddh0 a16cce81d3 ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
b9192
2026-05-17 09:38:17 +03:00
Judd 4f13cb7424 webui: support video files as input (#22830) b9191 2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen b64739ea39 server: (router) alloc tmp buffer on heap (#23159) b9190 2026-05-16 23:42:16 +02:00
Pascal 64b38b561b server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137) b9189 2026-05-16 21:21:06 +02:00
Winston Ma 6049906133 vulkan: removed duplicate #include <memory> in headers (#23144) 2026-05-16 19:57:35 +02:00
Aleksander Grygier 0253fb21f5 ui: Add request timeout for MCP tool calls (#23138)
* feat: Add request timeout for MCP tool calls in llama-ui

* feat: MCP Settings tab with max timeout setting
2026-05-16 15:20:27 +02:00
Georgi Gerganov 3a92bc99db sync : ggml b9186 2026-05-16 16:11:29 +03:00
Georgi Gerganov e6c37a1adc ggml : bump version to 0.12.0 (ggml/1494) 2026-05-16 16:11:29 +03:00
CrispStrobe 560445bf34 metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).

The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.

Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).

Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.
2026-05-16 16:11:29 +03:00
Steve Lhomme 2eb3e6b242 ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)
That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code
2026-05-16 16:11:29 +03:00
Holger Voormann 25b1bc9c2f ui: Correct links in tools/ui/README.md [no ci] (#23139)
In `tools/ui/README.md`, update the relative links, now that the `README.md` file has been moved from `tools/server/webui/` to `tools/ui/`.

See https://github.com/ggml-org/llama.cpp/commit/59778f0196a82db32580bb649d5d839355d6d7bf.
2026-05-16 14:42:38 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 18675b6bbc vendor : update cpp-httplib to 0.45.0 (#23103) b9181 2026-05-16 15:25:21 +03:00
Aman Gupta 255582687b llama + spec: MTP Support (#22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (#7)

* MTP: clean-up (#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b9180
2026-05-16 20:06:23 +08:00
kubawoo b81c2cdd74 ui: Fix handling of MCP resource template parameters (#23117)
* Fix handling of MCP resource template parameters

* Fix formatting for uri-template.test.ts

---------

Co-authored-by: kuba <kuba@laptop.local.net>
2026-05-16 13:25:41 +02:00
viggy 1428004808 webui : [ChatFormActionAdd][a11y] fix accessibility issues in add menu trigger and items (#22736)
* fix tab order on attach button, and dont focus on disabled mennu item

* add a11y tests
2026-05-16 12:00:46 +02:00
Pascal 366c5e2a3b ui: untrack settings sync in props effect to prevent reactive loop (#23127) 2026-05-16 11:25:34 +02:00
Aleksander Grygier 1d9f99aa75 fix: Add build step using build workflow to publish workflow (#23134) 2026-05-16 11:22:59 +02:00
ynankani 42928bc14d model : NvFP4 quantized LM head support (#23046)
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-16 11:09:27 +02:00
Aleksander Grygier 59778f0196 ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064)
* webui: Move static build output from `tools/server/public` to `build/ui` directory

* refactor: Move to `tools/ui`

* refactor: rename CMake variables and preprocessor defines

- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED

* refactor: rename CLI flags (--webui -> --ui) with backward compat

- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals

* refactor: update C++ server internals with backward compat

- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)

* refactor: rename CI/CD workflows, artifacts, and build script

- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label

* docs: update CODEOWNERS and server README docs

- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/

* fix: Small fixes for UI build

* fix: CMake.txt syntax

* chore: Formatting

* fix: `.editorconfig` for llama-ui

* chore: Formatting

* refactor: Use `APP_NAME` in Error route

* refactor: Cleanup

* refactor: Single migration service

* make llama-ui a linkable target

* fix: UI Build output

* fix: Missing change

* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI

* refactor: UI workflows cleanup

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b9174
2026-05-16 02:02:40 +02:00
Sigbjørn Skjæret 49d1701bd2 ci : fix release symlinks (#23119) b9173 2026-05-16 01:09:28 +02:00
Omer Ozarslan 1348f67c58 webui: Use lowercase hash for HF checksum check (#23107) b9172 2026-05-15 19:38:16 +02:00