* ui: model status and load progress via /models/sse feed
* ui: centralize SSE wire-format delimiters into shared constants for the chat and /models/sse parsers
* ui: type /models/sse event names as a ServerModelsSseEventType enum
Address review from allozaur
line_start -1 normalized to n+1, so append inserted at lines.begin() + n + 1,
one past end() -> heap-buffer-overflow in vector::_M_range_insert.
Normalize -1 to n (insert at end()), restrict -1 to append mode and reject it
for replace/delete instead of silently clobbering the last line. Parenthesize
the insert offset so empty-file append computes the position as int first,
avoiding a transient begin() - 1 on a null vector data pointer.
* common/peg : refactor until gbnf grammar into an ac automaton
* cont : add a test with multiple strings
* cont : pad state with 0s so rules line up
* cont : clean up comments
* cont : use set everywhere
* cont : inline state num string padding
* cont : add a ref to PR
* cont : fix regression in server-tools.cpp
* arg: try fixing test-args-parser randomly fails
* return ref
* try triggering the workflow
* exception wrapper
* wip
* test
* test 2
* arg: guard win32 utf8 argv override
make_utf8_argv rebuilds argv from GetCommandLineW to fix utf8 handling of
non ascii arguments on windows. the override runs unconditionally inside
common_params_parse, so it also clobbers a programmatic argv passed by a
caller. test-arg-parser builds a synthetic argv but then sees the real
process command line instead, the model argument is never parsed, and the
assert that expects success aborts via fastfail (0xC0000409). this shows up
as a random failure in the openvino windows workflow.
only override argv when its length matches the caller argc, so the utf8
repair still applies to real binaries while a programmatic argv stays intact.
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* server: avoid forwarding auth headers in CORS proxy
* format
* fix test
* fix e2e test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
GLM-5.2 ships the DSA "lightning indexer" on only a subset of layers (the
"full" layers; others omit it), but the GLM_DSA loader created the five
indexer tensors on every layer as required, so loading any GLM-5.2 GGUF
failed with e.g. `missing tensor 'blk.3.indexer.k_norm.weight'`.
GLM_DSA's graph is llama_model_deepseek2::graph (plain MLA) and does not use
the indexer tensors (indexer runtime not yet implemented), so they are
loaded-but-unused. Marking them TENSOR_NOT_REQUIRED lets layers without an
indexer load as nullptr and the model runs as full MLA attention.
DeepSeek-V3.2 (uniform indexer on all layers) is unaffected.
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
* Apply suggestion from @taronaeo
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
* cuda: col2im_1d use fast_div_modulo for the index decomposition
* cuda: col2im_1d tighten supports_op, type match and contiguous dst
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).