* server : improve message span logic
* cont : cast size_t to int32_t in comparisons
* server : create checkpoints before every user msg
* chat : remove \n in gemma4 delimiters
* chat : merge msg delimiter structs into one
* cont : reword comment
* cont : initialize tokens in delimiter
* cont : add server_tokens::get_raw_tokens() for mtmd
* cont : move message finding to server_tokens and skip mtmd tokens
* cont : update cohere2moe parser
* cont : increase min-step to 8192 and always produce a chkpt for last user message
* server: real-time model load progress tracking via /models/sse
* update docs
* server: move model download to child process
* rm unused
* fix most problems
* clean up
* nit fixes
* fix test case
* do not detact() thread
* shorter MODEL_DOWNLOAD_TIMEOUT in test
* throttle
* common/peg : refactor until gbnf grammar into an ac automaton
* cont : add a test with multiple strings
* cont : pad state with 0s so rules line up
* cont : clean up comments
* cont : use set everywhere
* cont : inline state num string padding
* cont : add a ref to PR
* cont : fix regression in server-tools.cpp
* arg: try fixing test-args-parser randomly fails
* return ref
* try triggering the workflow
* exception wrapper
* wip
* test
* test 2
* arg: guard win32 utf8 argv override
make_utf8_argv rebuilds argv from GetCommandLineW to fix utf8 handling of
non ascii arguments on windows. the override runs unconditionally inside
common_params_parse, so it also clobbers a programmatic argv passed by a
caller. test-arg-parser builds a synthetic argv but then sees the real
process command line instead, the model argument is never parsed, and the
assert that expects success aborts via fastfail (0xC0000409). this shows up
as a random failure in the openvino windows workflow.
only override argv when its length matches the caller argc, so the utf8
repair still applies to real binaries while a programmatic argv stays intact.
---------
Co-authored-by: Pascal <admin@serveurperso.com>
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.
Fixes#24144
* spec: add spec metrics mean acceptance length and acceptance per pos
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix as suggestions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* chat: harden peg-native tool call parsing
accept an optional leading type: function field in
build_json_tools_flat_keys so openai style tool calls parse on
templates whose serialization opens on the name field.
return a clean error and log the unparsed fragment on a final peg
parse failure instead of throwing the raw parser position and input.
keep the raw arguments string in func_args_not_string when it is not
valid json instead of aborting the prompt render.
* chat: surface peg-native parse failures
a final peg parse failure threw the raw parser position and input. log
the unparsed fragment and raise a clearer error instead, so a model
output that does not match the expected format no longer fails silently
with an empty assistant turn.
minimal change, no behavior change on successful parses.
* chat: handle openai style tool calls in peg-native
* nits
* common: scope OpenAI wrapper grammar trigger via autoparser flag
* chat: gate type:function parsing leniency on the analysis flag
Thread accept_openai_wrapper from the generator to build_json_tools_flat_keys
so the leading "type": "function" field is accepted only when openai_wrapper_trigger is set.
* chat: fix whitespace problems once and for all
* Purge trailing spaces from grammar generation
* Revert "Purge trailing spaces from grammar generation"
This reverts commit b0827ecb7d.
This is a non-functional change.
When using `--spec-type ngram-map-k4v`, the log messages at startup and
runtime say `ngram-map-k`. Added logic in the in the constructor of
`common_speculative_impl_ngram_map_k` to pass the correct
`COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is
`false`.
After this change, the log messages use the correct name.
* server: log prompts to directory
Add `--log-prompts-dir` to write each prompt to a separate text file in
the specified directory.
* Apply suggestion from @ngxson
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Always export idle slots to RAM
Without this, a slot's VRAM cache may not be written to RAM. If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.
* cont : clean-up
---------
Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* common : relax sampler name matching
Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.
This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.
I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.
This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.
* add more alt names
* cont. fix
* cast to unsigned char for correctness
* common : unify sampler name mapping
* annotate canonical vs. alt sampler name mappings per @CISC
* Update common/sampling.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : auto-generate sampler name aliases per @ngxson
* use merged map for matching
* use `.merge` instead of iterating
* nit: simplify comment
* nit: use insert everywhere, not index assignment
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : fix state save in common_prompt_batch_decode
This commit addresses a bug in common_prompt_batch_decode that affects
the session state store/restore in completion.cpp and
save-load-state.cpp.
The motivation for this is that currently the code is saving n-1 tokens
in both the session_tokens and in the KV cache. Then when loading the
session tokens, and if the prompt matches, it would replay the last
saved token (n-1) into the next position, effectively replaying the
same token in the wrong position.
The fix is to store all n tokens in session_tokens, while the memory
state only reflects n-1 processed tokens as the saving happens before
the last token is decoded in common_prompt_batch_decode.
I ran both completion.cpp and save-load-state.cpp with a transformer, a
recurrent, and a hybrid model.
Resolves: https://github.com/ggml-org/llama.cpp/issues/23400
Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>