llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-26 06:10:19 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	60bc8866b1	common: refactor model handling (#24980 ) * common: refactor models handling * remote preset * cont * rm skip_download option * missing header * fix plan.model_files * fix --offline case * move hf_plan to download * refactor * rm redundant curr_ex, add comments * adapt	2026-06-25 15:17:51 +02:00
Adrien Gallouët	683b04cc4a	app : add the llama download subcommand (#24982 ) * app : add the download command (with llama-download) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove llama-download tool for now Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-25 13:36:36 +02:00
Aldehir Rojas	73618f27a8	server: improve user message detection and create checkpoints at every user message (#24176 ) * server : improve message span logic * cont : cast size_t to int32_t in comparisons * server : create checkpoints before every user msg * chat : remove \n in gemma4 delimiters * chat : merge msg delimiter structs into one * cont : reword comment * cont : initialize tokens in delimiter * cont : add server_tokens::get_raw_tokens() for mtmd * cont : move message finding to server_tokens and skip mtmd tokens * cont : update cohere2moe parser * cont : increase min-step to 8192 and always produce a chkpt for last user message	2026-06-23 08:27:28 +03:00
Xuan-Son Nguyen	2b686a9120	server: refactor child --> router communication (#24821 ) * server: refactor child --> router communication * fix wakeup case * add docs * improve update_status() * nits	2026-06-20 01:02:26 +02:00
Xuan-Son Nguyen	e475fa2b5f	mtmd, arg: fix utf8 handling on windows (#24779 ) * mtmd, arg: fix utf8 handling on windows * also fix ggml_fopen * fix build fail * also fix CLI	2026-06-19 22:28:38 +02:00
Xuan-Son Nguyen	8c2d6f6475	server: add --agent arg, remove redundant webui naming compat (#24801 ) * server: add --agent arg, remove redundant webui naming compat * corrent env * fix the test * llama-gen-docs * nits: wordings	2026-06-19 16:06:13 +02:00
Ruixiang Wang	b14e3fb90c	spec: support eagle3 for qwen3.5 & 3.6 (#24593 ) * spec: support qwen3.5 & 3.6 eagle3 draft * eagle3: Add deferred boundary checkpoints restore support for hybrid models * apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spec: adapt to API change * spec: fix naming * cont : add TODO --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-19 13:08:50 +03:00
Xuan-Son Nguyen	552258c535	server: (router) rework -hf preset repo (#24739 ) * server: temporary remove HF remote preset * rework remove preset.ini support * rm unused get_remote_preset_whitelist() * print warning * add docs * rm stray file	2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen	e37abd6b5f	mtmd: add batching API (#24384 ) * mtmd: add batching API * wip * first working version (gemma4v) * add arg * nits * wire up support_batch() * fix 0.0 output embd * fix audio * nits * refactor a bit * nits * fix non-batching case * fix comment	2026-06-13 00:10:29 +02:00
jacekpoplawski	1e912561dd	server: log prompts to directory (#22031 ) * server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-06-09 12:09:07 +02:00
Xuan-Son Nguyen	8f83d6c271	mtmd : add video input support (#24269 ) * wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested	2026-06-08 14:40:12 +03:00
Daniel Bevenius	0b7154066e	common : fix state save in common_prompt_batch_decode (#23468 ) * common : fix state save in common_prompt_batch_decode This commit addresses a bug in common_prompt_batch_decode that affects the session state store/restore in completion.cpp and save-load-state.cpp. The motivation for this is that currently the code is saving n-1 tokens in both the session_tokens and in the KV cache. Then when loading the session tokens, and if the prompt matches, it would replay the last saved token (n-1) into the next position, effectively replaying the same token in the wrong position. The fix is to store all n tokens in session_tokens, while the memory state only reflects n-1 processed tokens as the saving happens before the last token is decoded in common_prompt_batch_decode. I ran both completion.cpp and save-load-state.cpp with a transformer, a recurrent, and a hybrid model. Resolves: https://github.com/ggml-org/llama.cpp/issues/23400 Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>	2026-06-02 15:44:15 +02:00
Xuan-Son Nguyen	60130d18f9	server: add SSE ping interval (#24013 )	2026-06-02 14:14:55 +02:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-02 07:26:20 +02:00
Aman Gupta	de6f727aae	llama: limit max outputs of `llama_context` (#23861 ) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere	2026-06-01 18:01:38 +03:00
Xuan-Son Nguyen	06d26dfdff	download: add option to skip_download (#23059 ) * download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag	2026-05-29 16:30:55 +02:00
Xuan-Son Nguyen	cb47092b00	server: bump timeout to 3600s (#23842 ) * server: bump timeout to 3600s * nits: change wording	2026-05-29 10:23:17 +02:00
jacekpoplawski	e2ef8fe42c	server: fix checkpoints creation (#22929 ) * common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>	2026-05-25 08:56:18 +03:00
Aldehir Rojas	b22ff4b7b4	cmake/ui : refactor the build (#23352 )	2026-05-23 17:08:22 -04:00
Gaurav Garg	ad27757261	Move to backend sampling for MTP draft path (#23287 ) * Move to backend sampling for MTP draft path Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K. * Allow sampler chains to be partially offloaded to backend * Add --spec-draft-backend-sampling argument. Enabled by default.	2026-05-20 22:34:45 +05:30
Georgi Gerganov	d14ce3dab4	llama : MTP clean-up (#23269 ) * llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name	2026-05-19 15:32:58 +03:00
Georgi Gerganov	cd963fee6a	save-load-state : refactor tests and improve readability (#23196 ) * save-load-state : refactor into separate phase functions - Split monolithic main() into 4 self-contained phase functions, each managing its own context/sampler/batch lifecycle - Each function tokenizes internally using its local ctx instance - main() is now a clean orchestrator: init -> run phases -> assert results - Proper resource cleanup on every exit path (return {} on error) Assisted-by: llama.cpp:local pi * save-load-state : use params.out_file instead of separate state_file - Remove state_file parameter from all phase functions - Each function accesses params.out_file directly - Initialize params.out_file in main alongside params.prompt Assisted-by: llama.cpp:local pi * save-load-state : use smart pointers for ctx and smpl - Replace raw llama_context* with llama_context_ptr - Replace raw llama_sampler* with llama_sampler_ptr - Remove all manual llama_free() and llama_sampler_free() calls - Keep llama_batch as raw (managed manually with llama_batch_free) Assisted-by: llama.cpp:local pi * save-load-state : add local llama_batch_ptr RAII wrapper - Add llama_batch_ptr struct holding llama_batch by value - Calls llama_batch_free() in destructor - Eliminates all manual llama_batch_free() calls Assisted-by: llama.cpp:local pi * save-load-state : replace printf/fprintf with logging macros - Add log.h include - Replace fprintf(stderr, ...) errors with LOG_ERR - Replace fprintf(stderr, ...) info with LOG_TRC - Replace printf output with LOG Assisted-by: llama.cpp:local pi * save-load-state : refactor tests to check results inline Each follow-up phase now accepts an expected result and performs the comparison internally instead of collecting results in main(). Assisted-by: llama.cpp:local pi * save-load-state : improve test output readability Add phase labels, remove redundant run prefixes, and show PASS after each test. Assisted-by: llama.cpp:local pi * pi : add rule about git signing * save-load-state : simplify llama_batch_ptr Change get() to return a reference and remove operator(). Use batch.get() throughout for consistency. Assisted-by: llama.cpp:local pi save-load-state : extract generate_tokens helper Factor out the repeated token generation loop into a shared helper function used by all phases. Assisted-by: llama.cpp:local pi * save-load-state : update comments to use test terminology Replace "Phase" with "Test" and list each test's steps as bullet points. Assisted-by: llama.cpp:local pi * save-load-state : rename test functions Rename to test_baseline, test_state_load, test_seq_cp_host, test_seq_cp_device. Update comments and logs accordingly. Assisted-by: llama.cpp:local pi * pi : add rule to never git push without confirmation Assisted-by: llama.cpp:local pi * common : add model_only option to common_init_from_params Add bool model_only parameter to skip context creation, sampler init, and context-dependent setup. Use in save-load-state to initialize only the model, with each test creating its own context. Assisted-by: llama.cpp:local pi --------- Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-19 09:46:34 +03:00
Aldehir Rojas	87589042ca	cmake : fix LLAMA_BUILD_UI logic (#23190 )	2026-05-17 14:42:26 -04:00
Pascal	64b38b561b	server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137 )	2026-05-16 21:21:06 +02:00
Aman Gupta	255582687b	llama + spec: MTP Support (#22673 ) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-16 20:06:23 +08:00
Aleksander Grygier	59778f0196	ui: Restructure repo to use `tools/ui` folder and `ui` / `UI` / `llama-ui` / `LLAMA_UI` naming (#23064 ) * webui: Move static build output from `tools/server/public` to `build/ui` directory * refactor: Move to `tools/ui` * refactor: rename CMake variables and preprocessor defines - Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated) - Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated) - Backward compat: old vars auto-forward to new ones with DEPRECATION warning - Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc. - Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET - Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines - Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED * refactor: rename CLI flags (--webui -> --ui) with backward compat - Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases) - Add --ui-config (old --webui-config kept as deprecated alias) - Add --ui-config-file (old --webui-config-file kept as deprecated alias) - Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated) - Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY - C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields - Backward compat: old fields synced to new ones in g_params_to_internals * refactor: update C++ server internals with backward compat - Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta) - Rename params.webui usage -> params.ui (both synced, old still works) - JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys - Server routes use params.ui_mcp_proxy \|\| params.webui_mcp_proxy - Preprocessor guards use #if defined(LLAMA_BUILD_UI) \|\| defined(LLAMA_BUILD_WEBUI) * refactor: rename CI/CD workflows, artifacts, and build script - Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build - Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT - Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks - Update server.yml: job/artifact refs webui-build -> ui-build - Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT - Update server-self-hosted.yml: webui-build -> ui-build - Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION - Rename webui-download.cmake -> ui-download.cmake (internal refs updated) - Update labeler.yml: server/webui -> server/ui path label * docs: update CODEOWNERS and server README docs - Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/ - Update server README.md: CLI tables show --ui flags with deprecated --webui aliases - Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/ * fix: Small fixes for UI build * fix: CMake.txt syntax * chore: Formatting * fix: `.editorconfig` for llama-ui * chore: Formatting * refactor: Use `APP_NAME` in Error route * refactor: Cleanup * refactor: Single migration service * make llama-ui a linkable target * fix: UI Build output * fix: Missing change * fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI * refactor: UI workflows cleanup --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-05-16 02:02:40 +02:00
Aleksander Grygier	253ba110bc	webui: Move static build output from repo code to HF Bucket (#22937 ) * ci: add workflow to publish webui to Hugging Face bucket * ci: add webui release job to release workflow * ci: test webui release job * chore: Return to default minification strategy for build output files * ci: extract webui build into separate workflow and job * chore: Ignore webui static output + clean up references * chore: Delete legacy webui static output * chore: Ignore webui build static output * fix: Workflow * fix: Versioning naming * chore: Update package name * test: Test CI fix * refactor: Naming * server: implement webui build strategy with HF Bucket support * chore: Remove test workflow * chore: Use WebUI build workflow call in other workflows * server: HF Buckets fallback for WebUI build * refactor: App name variable * refactor: Naming * fix: Retrieve loading.html * fix: workflow syntax * fix: Rewrite malformed release.yml * fix: Req param * test: Re-add missing Playwright installation for CI tests * refactor: Logic & security improvements * refactor: Retrieve publishing jobs and DRY the workflows * fix: Test workflow syntax * fix: Upstream Release Tag for test workflow * chore: Remove test workflow * ci: Run WebUI jobs on `ubuntu-24.04-arm` * refactor: Post-CR cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: CI cleanup * refactor: Cleanup * test: Test workflow * refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads * server: add fallback mechanism for HF Bucket webui downloads from latest directory * fix: Incorrect argument order in file(SHA256) calls for checksum verification * refactor: Use cmake script for handling the HF Bucket download on build time * feat: support local npm build for WebUI assets * refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning * refactor: Cleanup * chore: Remove test workflow * fix: remove s390x from release workflow * fix: add webui-build dependency to ubuntu-22-rocm and windows-hip * Revert "fix: remove s390x from release workflow" This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19. * fix: Release workflow file * fix: Proper release tag used for HF Bucket upload * fix: Remove duplicate steps in release workflow --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-14 13:21:41 +02:00
Georgi Gerganov	67b2b7f2f2	logs : reduce (#23021 ) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces	2026-05-14 13:05:52 +03:00
Georgi Gerganov	634275fbbb	spec : update CLI arguments for better consistency (#22964 ) * spec : update CLI arguments for better consistency * cont : fix CLI arg message	2026-05-13 09:15:39 +03:00
Georgi Gerganov	68e7ea3eab	spec : parallel drafting support (#22838 ) * spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draft prompt cache and checkpoints * server : improve ctx names * server, spec : transition to unified spec context * cont : sync main and drft contexts * cont : async drft eval when possible * cont : handle non-ckpt models * cont : pass correct n_past for drafting * cont : process images throught the draft context * spec : handle draft running out of context * server : fix mtmd draft processing * server : fix URL for draft model * server : add comment * server : clean-up + dry * speculative-simple : update * spec : fix n_past type * server : fix slot ctx_drft ptr * tools : update readme * naming : improve consistency * spec : refactor for multi-sequence speculative context * cont : prepare params * cont : prepare params * spec : support parallel drafts * server : support parallel drafting * llama : reuse device buffers when possible * server, spec : clean-up * cont : clean-up * cont : minor * spec : reset `drafting` flag at the end * spec : introduce `common_speculative_process()` * spec : allow for multiple spec types (chain of speculators) * replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length --------- Co-authored-by: Petros Sideris <petros.sideris@nokia.com>	2026-05-11 19:09:43 +03:00
Georgi Gerganov	14e733e36f	spec : refactor params (#22397 ) * spec : refactor params * cont : fix * cont : rename "sparam" to "sampling" * cont : add spec params category * cont : add info about removed arguments * cont : skip param length check for spec params * cont : adapt server tests	2026-04-28 09:07:33 +03:00
Matthias Straka	0dd7f915fd	cli : cleanup auto-completion code (#21745 )	2026-04-23 15:03:28 +02:00
Ethan Turner	750579ff14	common: Refactoring sampler parameters (#20429 ) (#22233 ) This change refactors the reasoning_budget_message parameter from the common params into the sampling parameters specifically. It also removes the reasoning_budget common parameter and standardizes on the existing reasoning_budget_tokens parameter in the sampling configuration. Issue: https://github.com/ggml-org/llama.cpp/issues/20429 Original PR: https://github.com/ggml-org/llama.cpp/pull/20297	2026-04-22 10:40:19 +02:00
Georgi Gerganov	cfe9838d26	fit-params : refactor + add option to output estimated memory per device (#22171 ) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners	2026-04-21 09:54:36 +03:00
Georgi Gerganov	de71b5f81c	server : refactor "use checkpoint" logic (#22114 )	2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own	9d49acb2a7	server: rename --clear-idle to --cache-idle-slots (#21741 )	2026-04-20 08:30:24 +03:00
Sascha Rogmann	455d8e4be8	server : speculative checkpointing (#19493 ) * server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-19 10:24:06 +03:00
Georgi Gerganov	6990e2f1f7	libs : rename libcommon -> libllama-common (#21936 ) * cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()	2026-04-17 11:11:46 +03:00
Yes You Can Have Your Own	50e0ad08fb	server: save and clear idle slots on new task (`--clear-idle`) (#20993 ) * server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE) * server: move idle slot KV clearing to slot release The save "cost" is now paid by the finishing request. * server: add --kv-clear-idle flag, enable by default * server: skip clearing last idle slot, clear on launch * server: test --no-kv-clear-idle flag * server: simplify on-release clearing loop * server: remove on-release KV clearing, keep launch-only * cont : clean-up * tests: update log strings after --clear-idle rename * tests: use debug tags instead of log message matching * test: fix Windows CI by dropping temp log file unlink --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-03 19:02:27 +02:00
Ruben Ortlam	5803c8d115	tests: allow exporting graph ops from HF file without downloading weights (#21182 ) * tests: allow exporting graph ops from HF file without downloading weights * use unique_ptr for llama_context in HF metadata case * fix missing non-required tensors falling back to type f32 * use unique pointers where possible * use no_alloc instead of fixing f32 fallback * fix missing space	2026-04-02 18:19:20 +02:00
Sigbjørn Skjæret	c46758d28f	cli : add /glob command (#21084 ) * add /glob command * output error when max files reached * support globbing outside curdir	2026-03-28 02:33:04 +01:00
Adrien Gallouët	5c1a7b8355	server : add custom socket options to disable SO_REUSEPORT (#21056 ) * server : add custom socket options to disable SO_REUSEPORT Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --reuse-port $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update tools/server/README.md (llama-gen-docs) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix windows Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 01:12:43 +01:00
Xuan-Son Nguyen	20197b6fe3	server: add built-in tools backend support (#20898 ) * wip: server_tools * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * change arg to --tools all * add readme mention * llama-gen-docs	2026-03-27 10:07:11 +01:00
Piotr Wilkin (ilintar)	5e54d51b19	common/parser: add proper reasoning tag prefill reading (#20424 ) * Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-19 16:58:21 +01:00
Piotr Wilkin (ilintar)	d2ecd2d1cf	common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289 ) * Add `--force-pure-content` to force a pure content parser. * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Change parameter name [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 16:16:43 +01:00
Ruben Ortlam	128142fe7d	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 ) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build	2026-03-12 13:26:00 +01:00
ddh0	4a748b8f15	common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416 )	2026-03-12 00:13:28 +01:00
Piotr Wilkin (ilintar)	acb7c79069	common/parser: handle reasoning budget (#20297 ) * v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev>	2026-03-11 10:26:12 +01:00
Johannes Gäßler	a976ff081b	llama: end-to-end tests (#19802 ) * tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments	2026-03-08 12:30:21 +01:00
Piotr Wilkin (ilintar)	f5ddcd1696	Checkpoint every n tokens: squash (#20087 )	2026-03-06 11:39:26 +01:00

1 2 3 4 5 ...

321 Commits