* chat: harden peg-native tool call parsing
accept an optional leading type: function field in
build_json_tools_flat_keys so openai style tool calls parse on
templates whose serialization opens on the name field.
return a clean error and log the unparsed fragment on a final peg
parse failure instead of throwing the raw parser position and input.
keep the raw arguments string in func_args_not_string when it is not
valid json instead of aborting the prompt render.
* chat: surface peg-native parse failures
a final peg parse failure threw the raw parser position and input. log
the unparsed fragment and raise a clearer error instead, so a model
output that does not match the expected format no longer fails silently
with an empty assistant turn.
minimal change, no behavior change on successful parses.
* chat: handle openai style tool calls in peg-native
* nits
* common: scope OpenAI wrapper grammar trigger via autoparser flag
* chat: gate type:function parsing leniency on the analysis flag
Thread accept_openai_wrapper from the generator to build_json_tools_flat_keys
so the leading "type": "function" field is accepted only when openai_wrapper_trigger is set.
* common : add common_chat_split_by_role
* cont : fix spans to reach end of message
* server: fix checkpoints creation
- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints
* cont : clean-up
* Support autoparser detection for message barriers
* server: fix message span delimiter and update docs
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
* fix: enable reasoning budget sampler for gemma4
Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.
Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).
Add test case for empty thinking block.
Fixes#21487
* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser
* Relax prefill parser to allow space.
* Move changes from prefix() to parser generation
* Only allow spaces if we're not having a pure content parser next
* chat : fix out_of_range crash in throw path (#20424 regression)
#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.
Test crashes on unpatched master, passes with fix:
cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat
* Update test-chat.cpp
* Update test-chat.cpp
* Update test-chat.cpp
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* common : handle incomplete UTF-8 at end of input in PEG parser
* cont : if reached end prematurely, emit needs_more_input to propagate partial output
* cont: refactor peg parse context to add lenient flag
* cont : remove partial flag, keep lenient flag
* common : fix Step-3.5-Flash format detection and thinking support
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.
Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.
Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
<parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test
Builds on: https://github.com/ggml-org/llama.cpp/pull/19283
* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.
Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.
Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.
* chat : remove dead thinking code from qwen3_coder_xml
Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.