GLM-5.2 ships the DSA "lightning indexer" on only a subset of layers (the
"full" layers; others omit it), but the GLM_DSA loader created the five
indexer tensors on every layer as required, so loading any GLM-5.2 GGUF
failed with e.g. `missing tensor 'blk.3.indexer.k_norm.weight'`.
GLM_DSA's graph is llama_model_deepseek2::graph (plain MLA) and does not use
the indexer tensors (indexer runtime not yet implemented), so they are
loaded-but-unused. Marking them TENSOR_NOT_REQUIRED lets layers without an
indexer load as nullptr and the model runs as full MLA attention.
DeepSeek-V3.2 (uniform indexer on all layers) is unaffected.
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
* Apply suggestion from @taronaeo
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
* cuda: col2im_1d use fast_div_modulo for the index decomposition
* cuda: col2im_1d tighten supports_op, type match and contiguous dst
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
* ui : add model selector storybook stories
Covers list, favorites, single-model, all status states
(loading/loaded/sleeping/failed/idle), and selection states.
* ui : improve model selector mobile UX with hover media queries
Use @media (hover:none) to show action buttons directly on touch
devices and color-code them by model status (amber=sleeping,
green=loaded, muted=idle). Status dots hidden on touch. Desktop
hover behavior unchanged.
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.
Fixes#24144
* rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to GGML_SYCL_USE_LEVEL_ZERO_API
* fix code format
* fix error when rebase
* ggml: Conditionally enable power11 backend based on compiler support
Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang toolchains while preserving forward compatibility once POWER11 support becomes available.
* Update CMakeLists.txt
ggml-cpu: Use -mcpu=power10 for P10 and P11
Reuse existing rope kernels with a function constant to toggle forward/backward
rotation, avoiding duplicate kernel code.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* metal : add f16 and bf16 support for concat operator
Extend the Metal backend concat operator to support f16 and bf16 tensor
types in addition to the existing f32 and i32 support.
- Template kernel_concat on type T with specializations for float, half,
bfloat, and int
- Add type-specific pipeline getter ggml_metal_library_get_pipeline_concat()
- Update device support check to allow f16 unconditionally and bf16 when
device supports bfloat16
- Update dispatch to select the correct kernel specialization by type
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* metal : extend concat operator to support f16, bf16, i8, i16 and i64
Assisted-by: pi:llama.cpp/Qwen3.6-27B