* server : improve message span logic
* cont : cast size_t to int32_t in comparisons
* server : create checkpoints before every user msg
* chat : remove \n in gemma4 delimiters
* chat : merge msg delimiter structs into one
* cont : reword comment
* cont : initialize tokens in delimiter
* cont : add server_tokens::get_raw_tokens() for mtmd
* cont : move message finding to server_tokens and skip mtmd tokens
* cont : update cohere2moe parser
* cont : increase min-step to 8192 and always produce a chkpt for last user message
* server: real-time model load progress tracking via /models/sse
* update docs
* server: move model download to child process
* rm unused
* fix most problems
* clean up
* nit fixes
* fix test case
* do not detact() thread
* shorter MODEL_DOWNLOAD_TIMEOUT in test
* throttle
line_start -1 normalized to n+1, so append inserted at lines.begin() + n + 1,
one past end() -> heap-buffer-overflow in vector::_M_range_insert.
Normalize -1 to n (insert at end()), restrict -1 to append mode and reject it
for replace/delete instead of silently clobbering the last line. Parenthesize
the insert offset so empty-file append computes the position as int first,
avoiding a transient begin() - 1 on a null vector data pointer.
* common/peg : refactor until gbnf grammar into an ac automaton
* cont : add a test with multiple strings
* cont : pad state with 0s so rules line up
* cont : clean up comments
* cont : use set everywhere
* cont : inline state num string padding
* cont : add a ref to PR
* cont : fix regression in server-tools.cpp
* server: avoid forwarding auth headers in CORS proxy
* format
* fix test
* fix e2e test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.
Fixes#24144
* spec: add spec metrics mean acceptance length and acceptance per pos
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: clean up static assets handling
* nits
* simplify file name handling, use static file name everywhere
* cmake/ui : bundle UI assets in an archive
* ui : run prettier on post-build.js
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
When reasoning-budget is set in model.ini, the per-request
thinking_budget_tokens from the WebUI was ignored because the
model.ini value took unconditional precedence.
Swap the precedence so the WebUI per-request value is checked
first, with the model.ini value serving as a fallback default.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* server: log prompts to directory
Add `--log-prompts-dir` to write each prompt to a separate text file in
the specified directory.
* Apply suggestion from @ngxson
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Always export idle slots to RAM
Without this, a slot's VRAM cache may not be written to RAM. If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.
* cont : clean-up
---------
Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* common : relax sampler name matching
Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.
This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.
I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.
This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.
* add more alt names
* cont. fix
* cast to unsigned char for correctness
* common : unify sampler name mapping
* annotate canonical vs. alt sampler name mappings per @CISC
* Update common/sampling.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : auto-generate sampler name aliases per @ngxson
* use merged map for matching
* use `.merge` instead of iterating
* nit: simplify comment
* nit: use insert everywhere, not index assignment
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>