* server : improve message span logic
* cont : cast size_t to int32_t in comparisons
* server : create checkpoints before every user msg
* chat : remove \n in gemma4 delimiters
* chat : merge msg delimiter structs into one
* cont : reword comment
* cont : initialize tokens in delimiter
* cont : add server_tokens::get_raw_tokens() for mtmd
* cont : move message finding to server_tokens and skip mtmd tokens
* cont : update cohere2moe parser
* cont : increase min-step to 8192 and always produce a chkpt for last user message
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
* server: prevent data race from HTTP threads
* fix params
* fix default_generation_settings
* nits: make handle_completions_impl looks less strange
* stricter const
* fix GGML_ASSERT(idx < states.size())
* move index to be managed by server_response_reader
* http: make sure req & res lifecycle are tied together
* fix compile
* fix index handling buggy
* fix data race for lora endpoint
* nits: fix shadow variable
* nits: revert redundant changes
* nits: correct naming for json_webui_settings
* server: improve speed of speculative decoding
* fix small draft case
* add link to the PR
* server : fix generation time measurement
* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)
* server : add comment
* add PR to docs
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* backend support
* server: support multiple generations from one prompt (OAI "n" option)
* fix invalid batch
* format oai
* clean up
* disable ctx shift
* add test
* update comments
* fix style
* add n_cmpl to docs [no ci]
* allowing using both n_cmpl and n
* server : add Anthropic Messages API support
* remove -@pytest.mark.slow from tool calling/jinja tests
* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py
* server : removed redundant n field logic in anthropic_params_from_json
* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()
* server : refactor Anthropic API to use OAI conversion
* make sure basic test always go first
* clean up
* clean up api key check, add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>