mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-28 15:20:20 +00:00
b7739
309 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
bcf7546160 |
server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments |
||
|
|
ce3bf9b1a4 | server: update docs for sleeping [no ci] (#18777) | ||
|
|
f307926482 | server : adjust unified KV cache tests (#18716) | ||
|
|
9ac2693a30 |
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
ec8fd7876b |
Webui/file upload (#18694)
* webui: fix restrictive file type validation * webui: simplify file processing logic * chore: update webui build output * webui: remove file picker extension whitelist (1/2) * webui: remove file picker extension whitelist (2/2) * chore: update webui build output * refactor: Cleanup * chore: update webui build output * fix: update ChatForm storybook test after removing accept attribute * chore: update webui build output * refactor: more cleanup * chore: update webui build output |
||
|
|
53eb9435da | server : fix timing of prompt/generation (#18713) | ||
|
|
f5f8812f7c |
server : use different seeds for child completions (#18700)
* server : use different seeds for child completions * cont : handle default seed * cont : note |
||
|
|
55abc39355 |
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading |
||
|
|
3d26a09dc7 |
server : add thinking content blocks to Anthropic Messages API (#18551)
* server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns. |
||
|
|
73d284a250 |
model : add LFM2-ColBert-350M (#18607)
* model : add LFM2-ColBert-350M * llama_model_n_embd_out() - returns `hparams.n_embd_out` if set and fallbacks to `hparams.n_embd` |
||
|
|
da143b9940 | server : fix router child env in containerized environments (#18562) | ||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
d5574c919c |
webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags * webui: update static build |
||
|
|
f14f4e421b | server: fix files built redundantly (#18474) | ||
|
|
51a48720b8 |
webui: fix prompt progress ETA calculation (#18468)
* webui: fix prompt progress ETA calculation * handle case done === 0 |
||
|
|
c9a3b40d65 |
Webui/prompt processing progress (#18300)
* webui: display prompt preprocessing progress * webui: add percentage/ETA and exclude cached tokens from progress Address review feedback from ngxson * webui: add minutes and first chunk (0%) case * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: address review feedback from allozaur * chore: update webui build output * webui: address review feedback from allozaur * nit * chore: update webui build output * feat: Enhance chat processing state * feat: Improve chat processing statistics UI * chore: update webui build output * feat: Add live generation statistics to processing state hook * feat: Persist prompt processing stats in hook for better UX * refactor: Enhance ChatMessageStatistics for live stream display * feat: Implement enhanced live chat statistics into assistant message * chore: update webui build output * fix: Proper tab for each stage of prompt processing/generation * chore: update webui build output * fix: Improved ETA calculation & display logic * chore: update webui build output * feat: Simplify logic & remove ETA from prompt progress * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> |
||
|
|
5b1248c9af |
server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days * server : allow configurable HTTP timeouts for child models * server : pass needed timeouts from params only --------- Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net> |
||
|
|
2a85f720b8 | server : handle closed connection for tasks (#18459) | ||
|
|
4893cc07bb |
server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models * server : add allow_processing param to clear_slot |
||
|
|
f5acfb2ffa |
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option * also allow stop while loading * add docs * unload_lru: also wait for unload to complete |
||
|
|
5ee4e43f26 | server: return_progress to also report 0% processing state (#18305) | ||
|
|
5b6c9bc0f3 |
webui: apply webui_settings on first load (#18223)
* webui: apply webui_settings on first load The webui_settings from /props were not applied on initial load when default_generation_settings.params was null Now syncs whenever serverProps is available, regardless of params, works for both single-model and router modes * chore: update webui build output |
||
|
|
849d021104 | server: fix crash with model not having BOS/EOS (#18321) | ||
|
|
179fd82a72 |
gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file * also strip whitespace * do not add extra newline * update TOC |
||
|
|
6ce863c803 |
server: prevent data race from HTTP threads (#18263)
* server: prevent data race from HTTP threads * fix params * fix default_generation_settings * nits: make handle_completions_impl looks less strange * stricter const * fix GGML_ASSERT(idx < states.size()) * move index to be managed by server_response_reader * http: make sure req & res lifecycle are tied together * fix compile * fix index handling buggy * fix data race for lora endpoint * nits: fix shadow variable * nits: revert redundant changes * nits: correct naming for json_webui_settings |
||
|
|
3997c78e33 | server: fix data race in to_json_anthropic (#18283) | ||
|
|
86af848153 | server: (docs) remove mention about extra_args (#18262) | ||
|
|
ddcb75dd8a |
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments |
||
|
|
408616adbd |
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before. To reproduce: Run llama-server with devstral-2 as main model and devstral-2-small as md, and verbose logging: ``` % ./build/bin/llama-server -v \ -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \ -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \ -c 8192 2> /tmp/llama.cpp.debug Check the log: slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new n_tokens = 741 slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new n_tokens = 746 slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new n_tokens = 763 slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new n_tokens = 775 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 778 slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new n_tokens = 783 slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new n_tokens = 792 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 795 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 797 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 799 slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new n_tokens = 800 slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new n_tokens = 803 slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new n_tokens = 805 slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new n_tokens = 812 slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new n_tokens = 816 ``` After the fix, get correct per round logging: ``` slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new n_tokens = 654 slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new n_tokens = 656 slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new n_tokens = 659 slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new n_tokens = 661 slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new n_tokens = 664 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 681 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 698 slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new n_tokens = 702 slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new n_tokens = 708 slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new n_tokens = 725 slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new n_tokens = 727 slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new n_tokens = 736 ``` |
||
|
|
9e39a1e6a9 |
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com> |
||
|
|
14931a826e |
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form * arg: update doc * arg: update test-arg-parser * arg: address review feedback from ngxson simplified to check first.length() <= last.length() only fixed: --sampler-seq, --rerank, --draft ordering note: middle positions in 3+ arg sets are not verified * arg: update doc |
||
|
|
cc0a04343e |
server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input This PR adds formatted strings to the server's send_error function * llama-server: use string_format inline * fix test |
||
|
|
98c1c7a7bf |
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources * update docs * fix neg arg handling * fix empty mmproj * also filter out server-controlled args before to_ini() * skip loading custom_models if not specified * fix unset_reserved_args * fix crash on windows |
||
|
|
acb73d8340 |
webui: Add editing attachments in user messages (#18147)
* feat: Enable editing attachments in user messages * feat: Improvements for data handling & UI * docs: Update Architecture diagrams * chore: update webui build output * refactor: Exports * chore: update webui build output * feat: Add handling paste for Chat Message Edit Form * chore: update webui build output * refactor: Cleanup * chore: update webui build output |
||
|
|
f9ec8858ed |
webui: display prompt processing stats (#18146)
* webui: display prompt processing stats * feat: Improve UI of Chat Message Statistics * chore: update webui build output * refactor: Post-review improvements * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> |
||
|
|
9ce64aed7d |
webui: Fix selecting generated output issues during active streaming (#18091)
* draft: incremental markdown rendering with stable blocks * refactor: Logic improvements * refactor: DRY Markdown post-processing logic * refactor: ID generation improvements * fix: Remove runes * refactor: Clean up & add JSDocs * chore: update webui static output * fix: Add tick to prevent race conditions for rendering Markdown blocks Suggestion from @ServeurpersoCom Co-authored-by: Pascal <admin@serveurperso.com> * chore: Run `npm audit fix` * chore: update webui static output * feat: Improve performance using global counter & id instead of UUID * refactor: Enhance Markdown rendering with link and code features * chore: update webui static output * fix: Code block content extraction * chore: update webui static output * chore: update webui static output --------- Co-authored-by: Pascal <admin@serveurperso.com> |
||
|
|
900316da4e |
webui: fix chat screen shadow width (#18010)
* webui: fix chat screen shadow width * chore: add index.html.gz |
||
|
|
6ce3d85796 |
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support Add CLI arguments --webui-config (inline JSON) and --webui-config-file (file path) to configure WebUI default settings from server side. Backend changes: - Parse JSON once in server_context::load_model() for performance - Cache parsed config in webui_settings member (zero overhead on /props) - Add proper error handling in router mode with try/catch - Expose webui_settings in /props endpoint for both router and child modes Frontend changes: - Add 14 configurable WebUI settings via parameter sync - Add tests for webui settings extraction - Fix subpath support with base path in API calls Addresses feedback from @ngxson and @ggerganov * server: address review feedback from ngxson * server: regenerate README with llama-gen-docs |
||
|
|
e85e9d7637 | server: (router) disable SSL on child process (#18141) | ||
|
|
d37fc93505 |
webui: fix chat header width when sidebar is closed (#17981)
* webui: fix chat header width when sidebar is closed * chore: add index.html.gz |
||
|
|
bde461de8c |
server: (router) allow child process to report status via stdout (#18110)
* server: (router) allow child process to report status via stdout * apply suggestions |
||
|
|
59977eba7b |
server: fix crash when batch > ubatch with embeddings (#17912)
* server: fix crash when batch > ubatch with embeddings (#12836) Fixes #12836 where the server crashes with GGML_ASSERT failure when running with embeddings enabled and n_batch > n_ubatch. Root cause: Embeddings use non-causal attention which requires all tokens to be processed within a single ubatch. When n_batch > n_ubatch, the server attempts to split processing, causing assertion failure. Solution: - Add parameter validation in main() after common_params_parse() - When embeddings enabled and n_batch > n_ubatch: * Log warnings explaining the issue * Automatically set n_batch = n_ubatch * Prevent server crash This follows the approach suggested by @ggerganov in issue #12836. Note: This supersedes stalled PR #12940 which attempted a runtime fix in the old examples/server/server.cpp location. This implementation validates at startup in tools/server/server.cpp (current location). Testing: - Build: Compiles successfully - Validation triggers: Warns when -b > -ub with --embedding - Auto-correction works: Adjusts n_batch = n_ubatch - No false positives: Valid params don't trigger warnings - Verified on macOS M3 Pro with embedding model * Update tools/server/server.cpp --------- Co-authored-by: ytian218 <ytian218@bloomberg.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
7b1db3d3b7 |
arg: clarify auto kvu/np being set on server (#17997)
* arg: clarify auto kvu/np being set on server * improve docs * use invalid_argument |
||
|
|
5f5f9b4637 |
server: Update README.md incorrect argument (#18073)
n-gpu-layer is incorrect argument is n-gpu-layers with the 's' |
||
|
|
3034836d36 |
webui: Improve copy to clipboard with text attachments (#17969)
* feat: Create copy/paste user message including "pasted text" attachments * chore: update webui build output * chore: update webui static output * fix: UI issues * chore: update webui static output * fix: Decode HTML entities using `DOMParser` * chore: update webui build output * chore: update webui static output |
||
|
|
a20979d433 |
webui: Add setting to always show sidebar on Desktop (#17809)
* feat: Add setting to always show Sidebar on Desktop * chore: update webui build output * feat: Add auto-show sidebar setting * fix: Mobile settings dialog UI * chore: update webui build output * feat: UI label update * chore: update webui build output * chore: update webui build output * chore: update webui build output * refactor: Cleanup * chore: update webui build output |
||
|
|
40d9c394f4 |
Webui: Disable attachment button and model selector button when prompt textbox is disabled. (#17925)
* Pass disabled state to the file attachments button and the model selector button. * Update index.html.gz * Fix model info card in non-router mode. * Update index.html.gz |
||
|
|
0f4f35e7be |
Fix unreadable user markdown colors and truncate long texts in deletion dialogs (#17555)
* webui: limit conversation name length in dialogs * webui: fix unreadable colors on links and table cell hover in user markdown * webui: keep table borders visible in user markdown * webui: updating unified exports * Update tools/server/webui/src/lib/components/app/chat/ChatAttachments/ChatAttachmentThumbnailFile.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui build output * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> |
||
|
|
e73d548659 |
webui: add "delete all conversations" button to import/export tab (#17444)
* webui: add "delete all conversations" button to import/export tab - Add 'Delete all conversations' functionality with confirmation dialog - Add Trash icon and destructive styling for clear visual indication - Redirects to "?new_chat=true#/" by using conversationsStore.deleteAll() * chore: update webui build output |
||
|
|
254098a279 |
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers |