* ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128
Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using
WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so
non-wasm builds are completely unaffected.
Approach:
- single wasm_v128_load covers all 32 packed 4-bit weights
- nibbles unpacked via AND/SHR into two u8x16 registers
- widened to i16 before multiply (WASM SIMD has no i8*i8 instruction)
- 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs
- horizontal reduce via 4x wasm_i32x4_extract_lane
Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32,
200k iterations):
| impl | ns/call | speedup |
|--------|---------|---------|
| scalar | 880.7 | 1.00x |
| simd | 257.8 | 3.42x |
Correctness verified against scalar reference across 10 random seeds
with exact output match.
* ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend
Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c.
Move for loop in the else block.
* ggml: use generic q4_1_q8_1 fallback in wasm backend
* server: avoid unnecessary checkpoint restore when new tokens are present
The pos_min_thold calculation unconditionally subtracts 1 to ensure at
least one token is evaluated for logits when no new tokens exist.
However, when the request contains new tokens beyond the cached prefix,
this -1 is overly conservative and may trigger an unnecessary checkpoint
restore.
Conditionally apply the -1 only when n_past >= task.n_tokens() (no new
tokens), avoiding redundant KV state restoration when there is actual
work to do.
* cont : add ref
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* webui: fix tool selector toggle/counter, key tools by stable identity
Key the disabled set, counts and toggles by a stable per-tool key
instead of bare function name, deduped from one canonical list. Per-tool
checkboxes become presentational (single row handler, no nested button),
category checkboxes drop the tristate (n/total carries partial). One
getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name.
* ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity
The XCFramework generated by build-xcframework.sh creates a module map
that manually lists public headers.
That list can fall out of sync with the framework's Headers directory.
The module map is currently missing ggml-opt.h, which is present in the
framework headers. This can cause downstream Apple builds to fail with:
Include of non-modular header inside framework module 'llama'
Use the framework's Headers directory itself as the module map umbrella
instead of maintaining a manual header list. This makes all public headers
under the generated framework's Headers directory part of the llama module.
* tests : refactor test-save-load-state to accept token input
- Default prompt is now empty; when not provided, generate n_batch
random tokens (useful for models without a tokenizer)
- Tokenization happens once upfront; pass token vector to test functions
- generate_tokens prints token IDs instead of decoded pieces
- Use llama_model_get_vocab / llama_vocab_n_tokens API
- Upgrade log level from LOG_TRC to LOG_INF for visibility
Assisted-by: llama.cpp:local pi
* cont : use llama_tokens alias
* Start work on flash_attn refactor
* Refactor
* Split k/v quantization
* Refactor and abstract quantization logic for flash_attn and mul_mat
* Add quantization support to tile path
* formatting
* Move to functions, add a check
* Tidy up SYCL doc a bit
- Add explicit links to referenced items
- Fix spelling errors
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
* Correct documented default for GGML_SYCL_GRAPH
The default is ON, not OFF:
$ cmake -LAH -B build | grep GGML_SYCL_GRAPH
...
GGML_SYCL_GRAPH:BOOL=ON
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
* Move docker instructions from SYCL.md to docker.md
This makes them directly accesible from the Quick Start section
of the top-level README.md.
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
* Refer to intel.Dockerfile for ARGs and their defaults
The defaults are always changing; this avoids accuracy errors
from duplicating the information.
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
* Remove mention of Nvidia in SYCL row of backend table
This support was removed in 2026.02 - refer to the SYCL.md News.
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
---------
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
* Removes __restrict__ from PDL kernel headers due to incompatibility with
PDL. Adds preprocessor directives based on arch in kernel body to add
__restrict__ to retain performance on older architectures.
* Simplifies new __restrict__ usage via macro
* Add hopper to PDL __restrict__ fix.
Co-authored-by: Oliver Simons <osimons@nvidia.com>
---------
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* cuda: reserve space for quantize kv-cache at startup
* address review comments
* remove forward decl
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* remove assert in ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add support for the ibm-granite/granite-embedding-{97m,311m}-multilingual-r2 embedding models:
* Added a version of the gpt4o tokenizer that has a fixed regex (better handling of marks), and different token merging setting for the 97m model
* Reused gemma4 tokenizer for the 311m model
* granite-embedding-*-multilingual-r2 : add support SwiGLU FFN for Granite Embedding Multilingual R2
* added new GGUF key <arch>.hidden_activation (LLM_KV_HIDDEN_ACT) + writer
* added a forward declaration of llm_ffn_op_type to llama-hparams.h
* added llm_ffn_op in hparams
* added LLM_FFN_NONE = 0 sentinel to llm_ffn_op_type (value-initialization), modern-bert: explicitly assigns LLM_FFN_GEGLU before reading GGUF (unchanged).
* centralized hidden_act mapping in llama-model.cpp, added llm_ffn_op_type_from_string() helper, mirroring rope_scaling_type/llama_rope_scaling_type_from_string()
* modern-bert reads the GGUF key (when present) and uses the resulting op in its FFN graph
* Added granite-embedding-{97m,311m}-multilingual-r2 to the converter code
* Added the hashes for the granite embedding multilingual R2 models
* Set the hidden_activation in the GGUF if the field is present in config.json (such as for the granite embedding models)
* common : fix state save in common_prompt_batch_decode
This commit addresses a bug in common_prompt_batch_decode that affects
the session state store/restore in completion.cpp and
save-load-state.cpp.
The motivation for this is that currently the code is saving n-1 tokens
in both the session_tokens and in the KV cache. Then when loading the
session tokens, and if the prompt matches, it would replay the last
saved token (n-1) into the next position, effectively replaying the
same token in the wrong position.
The fix is to store all n tokens in session_tokens, while the memory
state only reflects n-1 processed tokens as the saving happens before
the last token is decoded in common_prompt_batch_decode.
I ran both completion.cpp and save-load-state.cpp with a transformer, a
recurrent, and a hybrid model.
Resolves: https://github.com/ggml-org/llama.cpp/issues/23400
Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>
Reduce the number of parallel jobs in server-self-hosted.yml by stacking
test configurations as sequential steps within a single job, following the
pattern from #23927.
- server-metal: 4 matrix jobs -> 1 job with 4 sequential test steps
- server-cuda: 2 matrix jobs -> 1 job with 2 sequential test steps
- server-kleidiai: removed unnecessary single-entry matrix
- removed unused Setup Node.js step from server-metal
Total: 7 parallel jobs -> 3 parallel jobs
Assisted-by: llama.cpp:local pi
Previously error to string conversion was split in two different files,
with one converting errors into strings, and another function analyzing
those strings to generate yet another string.
Now the the error handling for network fetches has been centralised and
uses directly HTTP error codes whereas possible to generate the
human-readable error strings.
It also fixes an issue where all JSON errors reported from the backend,
such as "Invalid API key", would get turned incorrectly in to
"Failed to connect to server" due to poor matching logic in the
now-gone getErrorMessage function.
* feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel
* test: Update test reference
* fix: Icon
* fix: E2E test command
* fix: wait for greeting h1 to be visible in e2e test
* fix: remove duplicate PDF option in attachment dropdown
* fix: use label-based group toggle to avoid stale references
* refactor: inline MCP server and tool toggles in mobile sheet
* fix: serve correct build directory in e2e playwright config
* feat: add reasoning effort levels selector in model dropdown
* feat: Reasoning effort
* refactor: Make server origin configurable via environment variable
* feat: Add chat template thinking detector utility
* feat: Add thinking support detection to models store
* refactor: Update model selector components with thinking detection and message-specific indicators
* feat: Update chat form components for model selection and thinking support
* feat: Improve Reasoning controls UI
* refactor: Apply suggestions from code review
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix: Model tags
* refactor: Cleanup
* refactor: Remove unneeded components
* refactor: Cleanup
* hex-mm: initial support for F32 * F32 -> F32 matmuls
* hex-rms-norm: fix src1 stride use in fused rms_norm_mul
* hex-ops: clear spad pointers in the ops that clober it
This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.
* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX
Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.
* hmx-mm: route f16 2D matmuls through the same kernel used for all other types
* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way
This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.
* hmx-fa: slighlty improved pipeline simimar to matmul updates
* hmx-mm: initial version of MAT_MUL_ID support for HMX
* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
* hex-gdn: optimize GATED_DELTA_NET
DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)
* hmx-mm: missed one more case where we can use fastmod
* hexagon: update DCVS settings for a slight perf bump
* hmx-fa: use fastdiv in hmx-flash-attn
* hmx-fa: precompute slope values to avoid disrupting the inner loop
* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
* server: real-time reasoning interruption via control endpoint
Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
* ui: track reasoning phase via explicit streaming state
Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.
* ui: extract control endpoint and action into constants
Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.
* server: target reasoning control by completion id
Address @ngxson review on the control endpoint.
Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.
* ui: target reasoning control by completion id
Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.
* server: address review from @ngxson
Move the control fields into task_params and drop the redundant
comments on the control path.
* server: document the reasoning control endpoint
* Update tools/ui/src/lib/types/database.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* ui: rename cmplId to completionId
Per @allozaur review, clearer name for the streamed completion id.
* ui: wire completion id capture through the agentic flow
The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.
* ui: target reasoning control model from the message
The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* speculative : add common_speculative_n_max helper function
Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.
Assisted-by: llama.cpp:local pi
* cont : draft context always has n_parallel outputs
* llama : log n_outputs_max
* speculative : remove draft-simple auto-enable
* ci : enable server tests on PRs
* nix: add nix-nodejs facilities to build Web UI
Build the Web UI locally using standard Nix systems for building NodeJS
packages.
- Create derivation for the web UI
- npm dependencies are imported via buildNodeModules. Does not require
setting any shasum.
- Copy build artifacts to the correct folders.
- Prevents having to download from huggingface.co
Fixes#23067
* nix: simplify webui derivation using LLAMA_UI_OUT_DIR
- Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write
output directly to the Nix store
- Copy built assets to tools/ui/dist (source tree) instead of
build/tools/ui/dist so CMake's copy_src_dist() finds them