Commit Graph

  • 38f7c28795 server: can_speculate() tests self-spec Sascha Rogmann 2026-01-02 00:10:46 +01:00
  • e3e809cc01 can_speculate() includes self-speculation Sascha Rogmann 2026-01-02 00:17:53 +01:00
  • 1faeb628db server: moved self-call into speculative.cpp Sascha Rogmann 2025-12-31 00:55:39 +01:00
  • 1fb2658b0d server: introduce self-speculative decoding Sascha Rogmann 2025-12-29 20:46:32 +01:00
  • 8f91ca54ec CUDA: re-use MLA K data for V in MMA FA (#19057) b7822 Johannes Gäßler 2026-01-24 10:09:36 +01:00
  • 81ab64f3c8 ggml-cuda: enable cuda-graphs for n-cpu-moe (#18934) b7821 Aman Gupta 2026-01-24 14:25:20 +08:00
  • 8af1f5f430 ggml-hexagon: flash-attn opt (#19025) b7820 nullname 2026-01-24 14:02:07 +08:00
  • 557515be1e graph : utilize ggml_build_forward_select() to avoid reallocations (#18898) b7819 Georgi Gerganov 2026-01-23 18:22:34 +02:00
  • cb6caca191 [SYCL] use malloc to support both iGPU and dGPU in same time (#18992) b7818 Neo Zhang 2026-01-23 20:54:10 +08:00
  • b5b8fa1c8b chat : fix translategemma crash on common_chat_format_example (#19019) Xuan-Son Nguyen 2026-01-23 12:03:42 +01:00
  • a14b960bc7 model-conversion : use BUILD_DIR variable in all scripts (#19015) Daniel Bevenius 2026-01-23 09:01:36 +01:00
  • 091a46cb8d ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (#18860) b7815 Alberto Cabrera Pérez 2026-01-23 07:55:08 +00:00
  • a3e812811d cli : load parser definition (#19031) b7814 Aldehir Rojas 2026-01-22 20:31:22 -06:00
  • 51fa458a92 server : support preserving reasoning_content in assistant message (#18994) b7813 Xuan-Son Nguyen 2026-01-22 21:30:06 +01:00
  • a5eaa1d6a3 mla : make the V tensor a view of K (#18986) b7812 Georgi Gerganov 2026-01-22 22:09:01 +02:00
  • e2baf02162 CUDA: fix alignment check for FA (#19023) b7811 Johannes Gäßler 2026-01-22 20:39:25 +01:00
  • e34d6d03b2 convert_hf_to_gguf.py: refactor modify_tensors to call super (#18866) Aman Gupta 2026-01-23 02:58:07 +08:00
  • 9c96465f99 opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970) b7809 lhez 2026-01-22 10:29:25 -08:00
  • 4e595b250a server: do not log certain endpoints (avoid log spam) (#19028) b7808 Xuan-Son Nguyen 2026-01-22 19:24:37 +01:00
  • 0e4ebeb057 quant : manual overrides of tensor types take precedence (#18952) b7807 Georgi Gerganov 2026-01-22 16:17:06 +02:00
  • 8b30840703 release: update github api (#19022) b7806 Aaron Teo 2026-01-22 21:38:02 +08:00
  • 9eb5bfec1a mtmd : update docs to use llama_model_n_embd_inp (#18999) b7805 Xuan-Son Nguyen 2026-01-22 14:36:32 +01:00
  • c6926d1d95 server: Reorder methods in server-task.cpp (#19016) b7804 손희준 2026-01-22 22:36:04 +09:00
  • b70d251076 CUDA: add gqa_ratio 4 for GLM 4.7 flash (#18953) Aman Gupta 2026-01-22 18:51:53 +08:00
  • 5516b9c16a opencl: add TRI op support (#18979) b7802 shaofeiqi 2026-01-21 22:05:54 -08:00
  • 94242a62c0 ggml-zdnn : mark zDNN buffers as non-host (#18967) b7801 Aleksei Nikiforov 2026-01-22 01:16:21 +01:00
  • 6b99a223e3 ci : update GitHub Actions versions [no ci] (#18935) Pádraic Slattery 2026-01-22 00:57:18 +01:00
  • 77078e80e5 convert : add Devstral-2 (Ministral3ForCausalLM) arch (#18972) Mariusz Woloszyn 2026-01-22 00:55:55 +01:00
  • c301172f66 jinja: support none|string (#18995) b7798 Piotr Wilkin (ilintar) 2026-01-21 19:24:37 +01:00
  • 3802d3c78f fix: Use tabular-nums for chat message statistics (#18915) Hendrik Erz 2026-01-21 18:46:01 +01:00
  • 9da3dcd753 llama : clarify nemotron-h.cpp comment about RoPE [no ci] (#18997) Daniel Bevenius 2026-01-21 18:31:34 +01:00
  • bd544c94a3 vulkan: Remove transfer_ctx, do everything in compute_ctx. (#18945) b7795 Jeff Bolz 2026-01-21 11:01:40 -06:00
  • 14be5a39b1 common : improve error message when HTTPS is missing but required (#18987) b7794 Adrien Gallouët 2026-01-21 17:58:38 +01:00
  • fbbf3ad190 server: /v1/responses (partial) (#18486) b7793 손희준 2026-01-22 01:47:23 +09:00
  • 33f890e579 vulkan: support flash attention GQA/split_k with small batches (#18938) b7792 Jeff Bolz 2026-01-21 10:43:43 -06:00
  • 067b8d7af3 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (#18831) b7791 Masato Nakasaka 2026-01-22 01:13:43 +09:00
  • 50b7f076a5 vulkan: Use mul_mat_vec_id for small values of n (#18918) b7790 Jeff Bolz 2026-01-21 09:22:02 -06:00
  • ad8d85bd94 memory : add llama_memory_hybrid_iswa (#18601) b7789 Tarek Dakhran 2026-01-21 13:30:23 +01:00
  • 12a4a47e6a Fix GLM 4.7 Lite MoE gating func (#18980) b7788 Piotr Wilkin (ilintar) 2026-01-21 12:35:20 +01:00
  • 37c35f0e1c gguf: display strerrno when cant load a model (#18884) b7787 Matthieu Coudron 2026-01-21 07:52:46 +01:00
  • 5bd341c9a1 CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (#18964) b7786 Oliver Simons 2026-01-21 02:34:29 +01:00
  • 1c7cf94b22 common, server : use the same User-Agent by default (#18957) b7785 Adrien Gallouët 2026-01-20 18:28:43 +01:00
  • 2c1f199653 cli : fix reasoning responses in CLI (#18961) b7784 Xuan-Son Nguyen 2026-01-20 18:23:25 +01:00
  • d1e3556481 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (#18930) b7783 Oliver Simons 2026-01-20 13:11:01 +01:00
  • 08f3f4a8a3 ggml : cleanup path_str() (#18928) b7782 Adrien Gallouët 2026-01-20 11:42:49 +01:00
  • 271191906c metal : enable FA for MLA heads (#18950) b7781 Georgi Gerganov 2026-01-20 12:21:28 +02:00
  • 8b407e3978 quant : manual overrides of tensor types take precedence gg/quant-manual-overrides Georgi Gerganov 2026-01-20 11:16:46 +02:00
  • 7dee9ff59a convert : use n_groups instead of hardcoded values in reshape (#18929) Daniel Bevenius 2026-01-20 06:55:24 +01:00
  • 6df686bee6 server : refactor oai_parser_opt, move it to server_chat_params (#18937) b7779 Xuan-Son Nguyen 2026-01-19 23:28:01 +01:00
  • 1706a6d7c6 convert : support Glm4MoeLite (#18936) ddh0 2026-01-19 16:09:20 -06:00
  • 959ecf7f23 jinja : fix undefined keys and attributes and int/float as bool (#18924) b7777 Sigbjørn Skjæret 2026-01-19 20:29:43 +01:00
  • 4037093c66 ci : run test-jinja -py on high perf [no ci] (#18916) Sigbjørn Skjæret 2026-01-19 20:29:15 +01:00
  • 18361c579c server: fix memory reservations in populate_token_probs (#18787) b7775 Lennart Austenfeld 2026-01-19 19:13:31 +01:00
  • 365a3e8c31 ggml : add ggml_build_forward_select (#18550) b7774 Georgi Gerganov 2026-01-19 20:03:19 +02:00
  • 3d55846a5c model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927) Daniel Bevenius 2026-01-19 13:12:38 +01:00
  • 287a33017b llama : Extend fallback, fix fileno for dio file, exclude case that mmap uses dio file (#18887) b7772 Julius Tischbein 2026-01-18 17:35:57 +01:00
  • 293a1565dc docs: add linux to index (#18907) Francisco Herrera 2026-01-18 05:03:35 -05:00
  • 3bfbbcc5fc winget : update komac version gg/winget-update Georgi Gerganov 2026-01-18 10:29:03 +02:00
  • fe44d35574 tests : add test-jinja -py option for cross-checking (#18906) b7770 Xuan-Son Nguyen 2026-01-18 08:14:27 +01:00
  • bbcdac0189 jinja : fix object item order (and properly implement dictsort) (#18904) b7769 Sigbjørn Skjæret 2026-01-18 03:40:06 +01:00
  • d03c45c9c5 jinja : attribute support for join, map and sort (#18883) b7768 Sigbjørn Skjæret 2026-01-18 02:53:01 +01:00
  • 10c98cbdf6 jinja : add missing tojson filter for bool (#18900) b7767 Sigbjørn Skjæret 2026-01-18 01:05:09 +01:00
  • 420960ab92 jinja : fix lexing of float literals with sign (#18901) b7766 Sigbjørn Skjæret 2026-01-18 00:57:51 +01:00
  • f55b033ae6 jinja: correct member access rule (#18905) b7765 Xuan-Son Nguyen 2026-01-18 00:48:55 +01:00
  • d1b4757ded opencl: fix q6_K mv for m=1 (#18893) lhez 2026-01-17 13:50:32 -08:00
  • 57c0beaed0 ci : add label for jinja changes (#18903) Sigbjørn Skjæret 2026-01-17 21:52:02 +01:00
  • 2fbde785bc kv-cache : optimize KQ mask construction (#18842) b7762 Georgi Gerganov 2026-01-17 15:42:42 +02:00
  • e2751545b9 cont : inline verification gg/kv-mask-opt-verify Georgi Gerganov 2026-01-17 14:33:07 +02:00
  • e08d3ac323 tests : add test-kq-mask.cpp Georgi Gerganov 2026-01-17 14:04:20 +02:00
  • a89002f07b ggml webgpu: support for backend sampling (#18880) b7761 Reese Levine 2026-01-16 16:12:43 -08:00
  • 388ce82241 ggml : extend ggml_pool_1d + metal (#16429) b7760 Thore Koritzius 2026-01-16 15:59:56 +01:00
  • 490f6f70c0 cont : fix Georgi Gerganov 2026-01-16 16:16:37 +02:00
  • bac56aef91 cont : add explanation + improve Georgi Gerganov 2026-01-16 15:38:30 +02:00
  • 6ba6a3c76f docs : update ops.md for CANN backend (#18654) hipudding 2026-01-16 20:32:17 +08:00
  • 6628f5186a kv-cache : optimize KQ mask construction Georgi Gerganov 2026-01-14 17:35:24 +02:00
  • 0802d4cfb3 ggml-blas: hide warnings from included BLAS headers (#18818) b7758 Perry Naseck 2026-01-16 06:38:25 -05:00
  • c945aaaef2 mtmd : Fix ASR for LFM2.5-Audio-1.5B (#18876) b7757 Tarek Dakhran 2026-01-16 11:23:08 +01:00
  • c15395f73c common : implement new jinja template engine (#18462) b7756 Xuan-Son Nguyen 2026-01-16 11:22:06 +01:00
  • aa1dc3770a Setting mmap and direct_io to false as default in llama-bench.cpp (#18841) b7755 Julius Tischbein 2026-01-16 09:46:51 +01:00
  • 4ea2eaac01 CANN: Remove unused ggml_cann_get_device function (#18625) b7754 Raul Torres 2026-01-16 08:34:09 +00:00
  • e20fa27a02 CANN: fix an issue where get_env was not fully renamed (#18796) b7753 Chenguang Li 2026-01-16 16:24:04 +08:00
  • baa4ba0aec CANN: support gated linear attn (#18653) b7752 hipudding 2026-01-16 16:18:49 +08:00
  • 7b78bfa984 eagle3: add support for RedHtAI eagle3 speculator series models ruixiangw 2026-01-16 00:54:14 +00:00
  • 785a710085 OpenCL: add SOLVE_TRI op support (#18846) b7751 shaofeiqi 2026-01-15 11:17:17 -08:00
  • 6e7fc8a146 cuda : print less debug logs when disabling cuda graphs (#18868) b7750 Georgi Gerganov 2026-01-15 20:53:01 +02:00
  • be8e3d9515 context : do not reserve scheduler for warmups (#18867) b7749 Georgi Gerganov 2026-01-15 19:35:57 +02:00
  • 13f1e4a9ca llama : add adaptive-p sampler (#17927) b7748 ddh0 2026-01-15 11:16:29 -06:00
  • a04c2b06a3 server: improve slots scheduling for n_cmpl (#18789) b7747 Xuan-Son Nguyen 2026-01-15 17:10:28 +01:00
  • 39173bcacb context : reserve new scheduler when graph topology changes (#18547) b7746 Georgi Gerganov 2026-01-15 16:39:17 +02:00
  • 5c662d21a3 CUDA: fix allignment on register spill for FA (#18815) b7745 Johannes Gäßler 2026-01-15 15:14:50 +01:00
  • 8cc0ba957b ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837) b7744 shalinib-ibm 2026-01-15 15:01:18 +05:30
  • a7e6ddb8bd lora: make sure model keep track of associated adapters (#18490) b7743 Xuan-Son Nguyen 2026-01-15 10:24:28 +01:00
  • 2a13180100 model-loader : support bool array sliding window pattern (#18850) b7742 Sigbjørn Skjæret 2026-01-15 10:12:46 +01:00
  • ec997b4f2b tests : download models only when running ctest (#18843) b7741 Adrien Gallouët 2026-01-15 09:47:29 +01:00
  • cff777f226 hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822) b7740 Max Krasnyansky 2026-01-14 21:46:12 -08:00
  • 36f0132464 CUDA: Factor out and re-use block_reduce function (#18785) b7739 b7739 Oliver Simons 2026-01-15 03:44:54 +01:00
  • d98b548120 Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914) b7738 Piotr Wilkin (ilintar) 2026-01-14 20:29:35 +01:00
  • 8fb7175576 model : clean up and fix EXAONE-MoE configuration (#18840) b7737 Junwon Hwang 2026-01-15 03:38:21 +09:00
  • 60864997fe fit-params : print signed int for -ngl param gg/fit-params-ngl Georgi Gerganov 2026-01-14 19:59:23 +02:00
  • 516a4ca9b5 refactor : remove libcurl, use OpenSSL when available (#18828) b7736 Adrien Gallouët 2026-01-14 18:02:47 +01:00