mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-28 15:20:20 +00:00
b7739
481 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d98b548120 |
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality * Move to common * Remove unneeded header * Unlink from common * chore: update webui build output * Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code. * Revert change to webapp * Post-merge adjust * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply code review changes * Remove changes to server-context * Remove mtmd.h include * Remove utility functions from header * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Rename functions * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> |
||
|
|
516a4ca9b5 | refactor : remove libcurl, use OpenSSL when available (#18828) | ||
|
|
e047f9ee9d |
mtmd: fix use_non_causal being reported incorrectly (#18793)
* mtmd: fix use_non_causal being reported incorrectly * move clip_is_mrope to mtmd_decode_use_mrope * fix sloppy code ggml_cpy |
||
|
|
db79dc06b1 | llama-bench: add direct_io parameter (#18778) | ||
|
|
bcf7546160 |
server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments |
||
|
|
ce3bf9b1a4 | server: update docs for sleeping [no ci] (#18777) | ||
|
|
f307926482 | server : adjust unified KV cache tests (#18716) | ||
|
|
9ac2693a30 |
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
a61c8bc3bf |
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py. * Add mobilenetv5 impl * Fix comments, remove unused vars * Fix permute and remove transpose of projection weights * Fix comments, remove debugging prints from hf_to_gguf * 1. Hard-code image_mean = 0 and image_std = 1 2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder * 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size * Remove obsolete comments * - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest - convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf - mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_* - clip.cpp: Remove unused embedding and hard_emb_norm tensor loading * - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology * Fix stem conv bias name * Remove explicit handling of bias term for stem conv * - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer - Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable * clean up conversion script * fix code style * also preserve audio tensors * trailing space * split arch A and V * rm unused gemma3 func * fix alignment --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> |
||
|
|
ec8fd7876b |
Webui/file upload (#18694)
* webui: fix restrictive file type validation * webui: simplify file processing logic * chore: update webui build output * webui: remove file picker extension whitelist (1/2) * webui: remove file picker extension whitelist (2/2) * chore: update webui build output * refactor: Cleanup * chore: update webui build output * fix: update ChatForm storybook test after removing accept attribute * chore: update webui build output * refactor: more cleanup * chore: update webui build output |
||
|
|
a180ba78c7 | cmake: only build cli when server is enabled (#18670) | ||
|
|
53eb9435da | server : fix timing of prompt/generation (#18713) | ||
|
|
f5f8812f7c |
server : use different seeds for child completions (#18700)
* server : use different seeds for child completions * cont : handle default seed * cont : note |
||
|
|
55abc39355 |
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading |
||
|
|
64848deb18 | llama-fit-params: free memory target per device (#18679) | ||
|
|
56d2fed2b3 |
tools : remove llama-run (#18661)
* tools : remove llama-run * Remove licenses/LICENSE-linenoise Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
ccbc84a537 |
mtmd: mtmd_audio_streaming_istft (#18645)
Change is decoupled from https://github.com/ggml-org/llama.cpp/pull/18641. [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) needs streaming istft for generating output audio. * add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction * replace global audio cache with per-instance cache, the model requires two independent caches, for preprocessing (audio input) and for istft (audio output). * unified templated FFT/IFFT implementation supporting both forward and inverse transforms |
||
|
|
3d26a09dc7 |
server : add thinking content blocks to Anthropic Messages API (#18551)
* server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns. |
||
|
|
73d284a250 |
model : add LFM2-ColBert-350M (#18607)
* model : add LFM2-ColBert-350M * llama_model_n_embd_out() - returns `hparams.n_embd_out` if set and fallbacks to `hparams.n_embd` |
||
|
|
da143b9940 | server : fix router child env in containerized environments (#18562) | ||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
4974bf53cf |
model : mtmd : make input norm optional in LFM2-VL (#18594)
Upcoming LFM2-VL releases will have configurable input norm. See https://github.com/huggingface/transformers/pull/43087 for details. |
||
|
|
ced765be44 |
model: support youtu-vl model (#18479)
* Support Youtu-VL Model * merge code * fix bug * revert qwen2 code & support rsplit in minja.hpp * update warm info * fix annotation * u * revert minja.hpp * fix * Do not write routed_scaling_factor to gguf when routed_scaling_factor is None * fix expert_weights_scale * LGTM after whitespace fixes * fix * fix * fix * layers to layer_index * enum fix --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
d5574c919c |
webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags * webui: update static build |
||
|
|
33ded988ba |
quantize: prevent input/output file collision (#18451)
Check if input and output files are the same before quantizing to prevent file corruption when mmap reads from a file being written to. Fixes #12753 |
||
|
|
9b8329de7a |
mtmd : Adding support for Nvidia Music Flamingo Model (#18470)
* Inital commit, debugging q5_k_s quant * Made hf_to_gguf extend whisper to reduce code duplication * addressed convert_hf_to_gguf pull request issue --------- Co-authored-by: Henry D <henrydorsey147@gmail.com> |
||
|
|
f14f4e421b | server: fix files built redundantly (#18474) | ||
|
|
51a48720b8 |
webui: fix prompt progress ETA calculation (#18468)
* webui: fix prompt progress ETA calculation * handle case done === 0 |
||
|
|
c9a3b40d65 |
Webui/prompt processing progress (#18300)
* webui: display prompt preprocessing progress * webui: add percentage/ETA and exclude cached tokens from progress Address review feedback from ngxson * webui: add minutes and first chunk (0%) case * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: address review feedback from allozaur * chore: update webui build output * webui: address review feedback from allozaur * nit * chore: update webui build output * feat: Enhance chat processing state * feat: Improve chat processing statistics UI * chore: update webui build output * feat: Add live generation statistics to processing state hook * feat: Persist prompt processing stats in hook for better UX * refactor: Enhance ChatMessageStatistics for live stream display * feat: Implement enhanced live chat statistics into assistant message * chore: update webui build output * fix: Proper tab for each stage of prompt processing/generation * chore: update webui build output * fix: Improved ETA calculation & display logic * chore: update webui build output * feat: Simplify logic & remove ETA from prompt progress * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> |
||
|
|
5b1248c9af |
server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days * server : allow configurable HTTP timeouts for child models * server : pass needed timeouts from params only --------- Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net> |
||
|
|
2a85f720b8 | server : handle closed connection for tasks (#18459) | ||
|
|
daa242dfc8 |
common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority * tools: add logging for process priority setting |
||
|
|
cffa5c46ea | mtmd: clarify that we no longer accept AI-generated PRs (#18406) | ||
|
|
a52dc60ba3 | llama_fit_params: return enum for fail vs. error (#18374) | ||
|
|
4893cc07bb |
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models * server : add allow_processing param to clear_slot |
||
|
|
f5acfb2ffa |
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option * also allow stop while loading * add docs * unload_lru: also wait for unload to complete |
||
|
|
c184284230 | fit-params : fix race condition in fit-params output (#18276) | ||
|
|
5ee4e43f26 | server: return_progress to also report 0% processing state (#18305) | ||
|
|
5b6c9bc0f3 |
webui: apply webui_settings on first load (#18223)
* webui: apply webui_settings on first load The webui_settings from /props were not applied on initial load when default_generation_settings.params was null Now syncs whenever serverProps is available, regardless of params, works for both single-model and router modes * chore: update webui build output |
||
|
|
849d021104 | server: fix crash with model not having BOS/EOS (#18321) | ||
|
|
179fd82a72 |
gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file * also strip whitespace * do not add extra newline * update TOC |
||
|
|
6ce863c803 |
server: prevent data race from HTTP threads (#18263)
* server: prevent data race from HTTP threads * fix params * fix default_generation_settings * nits: make handle_completions_impl looks less strange * stricter const * fix GGML_ASSERT(idx < states.size()) * move index to be managed by server_response_reader * http: make sure req & res lifecycle are tied together * fix compile * fix index handling buggy * fix data race for lora endpoint * nits: fix shadow variable * nits: revert redundant changes * nits: correct naming for json_webui_settings |
||
|
|
3997c78e33 | server: fix data race in to_json_anthropic (#18283) | ||
|
|
86af848153 | server: (docs) remove mention about extra_args (#18262) | ||
|
|
147a521636 | tool/ex/tests: consistently free ctx, then model (#18168) | ||
|
|
ddcb75dd8a |
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments |
||
|
|
408616adbd |
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before. To reproduce: Run llama-server with devstral-2 as main model and devstral-2-small as md, and verbose logging: ``` % ./build/bin/llama-server -v \ -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \ -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \ -c 8192 2> /tmp/llama.cpp.debug Check the log: slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new n_tokens = 741 slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new n_tokens = 746 slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new n_tokens = 763 slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new n_tokens = 775 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 778 slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new n_tokens = 783 slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new n_tokens = 792 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 795 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 797 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 799 slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new n_tokens = 800 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 803 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 805 slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new n_tokens = 812 slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new n_tokens = 816 ``` After the fix, get correct per round logging: ``` slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new n_tokens = 654 slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new n_tokens = 656 slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new n_tokens = 659 slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new n_tokens = 661 slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new n_tokens = 664 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 681 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 698 slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new n_tokens = 702 slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new n_tokens = 708 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 725 slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new n_tokens = 727 slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new n_tokens = 736 ``` |
||
|
|
9e39a1e6a9 |
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com> |
||
|
|
14931a826e |
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form * arg: update doc * arg: update test-arg-parser * arg: address review feedback from ngxson simplified to check first.length() <= last.length() only fixed: --sampler-seq, --rerank, --draft ordering note: middle positions in 3+ arg sets are not verified * arg: update doc |
||
|
|
cc0a04343e |
server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input This PR adds formatted strings to the server's send_error function * llama-server: use string_format inline * fix test |