* cpu: add GGML_OP_COL2IM_1D
Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.
Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.
CPU uses a gather formulation parallelized over output channels,
supporting F32, F16 and BF16 with an F32 accumulator.
* tests: add backend coverage for GGML_OP_COL2IM_1D
Add test_col2im_1d next to the conv_transpose_1d cases, covering F32,
F16 and BF16 across eight geometries: the canonical kernel = 2*stride
DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and
p0 = stride/2), kernel < stride with zeroed gaps, kernel not a
multiple of stride, and a single column unfold.
Perf mode gets three real vocoder stage shapes reporting memory
bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.
* cpu: harden GGML_OP_COL2IM_1D
ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph
build time, before the oc division, protecting every backend at once.
The kernel asserts the contiguity its flat indexing assumes and its
doc states the full output length including the crop term.
The kernel parallelizes over the time axis: the split stays balanced
down to OC = 1, where the previous channel split was single threaded.
Values are bit identical on the three real vocoder chains, two out of
three improve.
* tests: extend the GGML_OP_COL2IM_1D grid
The eval grid grows to eleven geometries: OC = 1 (mono output stage),
K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and
a crop down to T_out = 2 where all the gather bounds act at once.
* tests: add col2im_1d equivalence test
tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the
native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16
and BF16 through casts of the column matrix. test-backend-ops cannot
cover this for a CPU only op since the CPU backend is its own
reference there.
* rpc: bump protocol patch version for GGML_OP_COL2IM_1D
GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the
static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the
op is appended and no existing op code shifts.
This allows vec4 loads of the B elements. Also increase BK to 64 when this is
enabled. Neither of these alone is consistently faster, but together these give
a nice speedup.
In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are
multiples of 4.
* tests : refactor test-save-load-state to accept token input
- Default prompt is now empty; when not provided, generate n_batch
random tokens (useful for models without a tokenizer)
- Tokenization happens once upfront; pass token vector to test functions
- generate_tokens prints token IDs instead of decoded pieces
- Use llama_model_get_vocab / llama_vocab_n_tokens API
- Upgrade log level from LOG_TRC to LOG_INF for visibility
Assisted-by: llama.cpp:local pi
* cont : use llama_tokens alias
* Start work on flash_attn refactor
* Refactor
* Split k/v quantization
* Refactor and abstract quantization logic for flash_attn and mul_mat
* Add quantization support to tile path
* formatting
* Move to functions, add a check
* common : fix state save in common_prompt_batch_decode
This commit addresses a bug in common_prompt_batch_decode that affects
the session state store/restore in completion.cpp and
save-load-state.cpp.
The motivation for this is that currently the code is saving n-1 tokens
in both the session_tokens and in the KV cache. Then when loading the
session tokens, and if the prompt matches, it would replay the last
saved token (n-1) into the next position, effectively replaying the
same token in the wrong position.
The fix is to store all n tokens in session_tokens, while the memory
state only reflects n-1 processed tokens as the saving happens before
the last token is decoded in common_prompt_batch_decode.
I ran both completion.cpp and save-load-state.cpp with a transformer, a
recurrent, and a hybrid model.
Resolves: https://github.com/ggml-org/llama.cpp/issues/23400
Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>
* ci : skip release workflow on master when commit message contains [no release]
Assisted-by: llama.cpp:local pi
* ci : restrict sanitizer builds to x86_64 + fix build type
the spark is apparently too slow for some reason
* tests : fix undefined warning
[no ci]
Create a pool of N threads that grab a chunk of up to 100 tests at a time to
iterate through. The number of tests at a time decreases as fewer remain.
Each thread uses its own dev and cpu backend, and set_n_threads_fn is not
called on the cpu backend.
Fix some TSAN issues that arose:
- In init_tensor_uniform, don't use static vector of generators.
- Replace gmtime with versions that don't use a global variable.
- Mutex calls to print_test_result.
* ggml: implement `gguf_init_from_buffer`
* test: `gguf_init_from_buffer`
* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
* fix: use `GGML_UNUSED`
Co-authored-by: Copilot <copilot@github.com>
* fix: remove `total_size` from `gguf_reader`
* fix: file offset calculation, rename `offset` to `data_offset`
Co-authored-by: Copilot <copilot@github.com>
* refactor: extract model loader bug fixes to another PR
* feat: add `gguf_init_from_callback`
* fix: always require a max expected size
* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`
* fix: harden against offset overflow in buffer read
* fix: remove seek behavior from the callback
* feat: `max_chunk_read == 0` means `SIZE_MAX`
* fix: seeking in a gguf file with no tensors
---------
Co-authored-by: Copilot <copilot@github.com>
* common : add common_chat_split_by_role
* cont : fix spans to reach end of message
* server: fix checkpoints creation
- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints
* cont : clean-up
* Support autoparser detection for message barriers
* server: fix message span delimiter and update docs
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* tests : move save-load-state from examples to tests
- Move examples/save-load-state/ to tests/test-save-load-state.cpp
- Remove subdirectory reference from examples/CMakeLists.txt
- Add test to tests/CMakeLists.txt as a model test
- Remove CODEOWNERS entry for removed example directory
Assisted-by: llama.cpp:local pi
* cont : update ci
* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi
* hexagon: remove gathers and better handling of vtcm in ssm-conv
* hexagon: relax ssm-conv gating requirements
* hexagon: add new prefill ssm-conv backend test
* hexagon: remove trailing white space
* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The MUL_MAT test loop iterates over base_types[] to generate non-contig
permutation cases (3 standard permutations across n in {1, 8, 16}).
BF16 is absent from base_types[], so these 9 cases were never generated
for BF16 even though every other type covered by base_types[] tests them.
Add the missing 9 cases explicitly: BF16 x F32, m=16, k=256, bs=[2,3],
permutations {0,2,1,3}, {0,1,3,2}, {0,3,2,1}, with n in {1, 8, 16}.
Suggested-by: @jeffbolznv
* Support for Codex CLI by skipping unsupported Responses tools
* Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection
* Revert gpt-oss apply_patch special handling
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes#21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes#21919.
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)
* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)
* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)
* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16
bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128
* mimo-v2.5: follow (256, 256) fattn templates
* mimo-v2.5: cleanup comments
* mimo-v2.5: further comment cleanup
* mimo-v2.5: address PR feedback
fix GQA handling
check for other dangling 320/576 carveouts and mirror them for 192
Add to backend ops test so new paths are covered
* cuda: fuse snake activation (mul, sin, sqr, mul, add)
Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.
Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.
* cuda: address review feedback from @am17an
Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.
* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an
* Update tests/test-backend-ops.cpp
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cuda: snake fusion check add->type matches x->type
Address review feedback from @am17an
* cuda: snake fusion check add->type matches x->type
Moved for readability (equivalent)
Address review feedback from @am17an
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
* vulkan: Support asymmetric FA in coopmat2 path
There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.
Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).
We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.
* reorder cases
* Changed to leak logger singleton to prevent hanging on Windows
* Fix comment
* Stopped using static vector
Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash
* Change so all logs are output before exit
* Added debug logging
* added more logging
* Added logging
* Explicitly free logger to avoid hanging on Win
* Reverted to leak logger instance again
* Removed debug log and fixed comment
* Fixed comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>