Commit Graph

  • 3dd95914d0 quantize: add option --tensor-type-file to llama-quantize (#18572) b7897 EugeoSynthesisThirtyTwo 2026-01-31 04:39:21 +01:00
  • ec6c7421e4 mtmd: support MiniCPM-o 4.5(vision only) (#19211) b7896 tc-mb 2026-01-31 06:19:30 +08:00
  • 1488339138 lookup, lookahead: fix crash when n_ctx not specified (#18729) b7895 Daniele Pinna 2026-01-30 21:10:24 +01:00
  • 4927795810 ngram-mod : fix build [no ci] (#19216) Georgi Gerganov 2026-01-30 21:27:27 +02:00
  • 971facc38e opencl: add optimized q8_0 mm kernel for adreno (#18871) shaofeiqi 2026-01-30 10:19:27 -08:00
  • d9a2a4bcaa sync : ggml Georgi Gerganov 2026-01-30 16:27:14 +02:00
  • dfd6106c84 cuda : fix compile warnings (whisper/0) Georgi Gerganov 2026-01-30 15:56:15 +02:00
  • bbada8bfb9 server : wrap around the "id_slot" parameter (#19207) Georgi Gerganov 2026-01-30 19:46:10 +02:00
  • 13f3ebfae1 Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (#19194) Simon Redman 2026-01-30 11:27:16 -05:00
  • dabaa2e77a spec : add ngram-mod (#19164) Georgi Gerganov 2026-01-30 18:21:48 +02:00
  • 2e916f996a jinja : add unordered_map include to value.h [no ci] (#19205) Marcello Seri 2026-01-30 16:09:44 +01:00
  • f3bc98890c memory : clarify comments for r_l and s_l tensors [no ci] (#19203) Daniel Bevenius 2026-01-30 15:18:41 +01:00
  • c3b87cebff tests : add GQA=20 FA test (#19095) b7885 Georgi Gerganov 2026-01-30 13:52:57 +02:00
  • 0562503154 convert : add missing return statement for GraniteMoeModel (#19202) Daniel Bevenius 2026-01-30 11:12:53 +01:00
  • 83bcdf7217 memory : remove unused tmp_buf (#19199) b7883 Daniel Bevenius 2026-01-30 10:37:06 +01:00
  • b316895ff9 docs: Add LlamaLib to UI projects (#19181) Antonis Makropoulos 2026-01-30 08:54:28 +02:00
  • ecbf01d441 add tensor type checking as part of cuda graph properties (#19186) b7881 bssrdf 2026-01-29 23:57:52 -05:00
  • 1025fd2c09 sycl: implement GGML_UNARY_OP_SOFTPLUS (#19114) b7880 s8322 2026-01-30 06:01:38 +02:00
  • c7358ddf64 sycl: implement GGML_OP_TRI (#19089) b7879 RachelMantel 2026-01-30 06:00:49 +02:00
  • d284baf1b5 Fix typos in SYCL documentation (#19162) DDXDB 2026-01-30 09:46:57 +08:00
  • bd90fc74c3 ggml-webgpu: improve flastAttention performance by software pipelining (#19151) Zheyuan Chen 2026-01-29 14:05:30 -08:00
  • ce38a4db47 hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150) b7876 Todor Boinovski 2026-01-29 12:33:21 -08:00
  • 4fdbc1e4db cuda : fix nkvo, offload and cuda graph node properties matching (#19165) b7875 Georgi Gerganov 2026-01-29 18:45:30 +02:00
  • 7b7ae857f6 chat : add parsing for solar-open-100b (#18540) Aldehir Rojas 2026-01-29 09:06:15 -06:00
  • 84b0a98319 webui: Update Svelte to fix effect_update_depth_exceeded errors (#19144) Andrew Marshall 2026-01-29 09:56:39 -05:00
  • b45ef2702c jinja : do not pass empty tools and add some none filters (#19176) b7872 Sigbjørn Skjæret 2026-01-29 14:06:54 +01:00
  • f3dd7b8e68 HIP: add mmf for CDNA (#18896) b7871 yulo 2026-01-29 18:10:53 +08:00
  • eed25bc6b0 arg : add -kvu to llama-batched-bench (#19172) b7870 Georgi Gerganov 2026-01-29 08:50:47 +02:00
  • b33df266d0 ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159) b7869 Vishal Singh 2026-01-29 09:58:57 +05:30
  • 3bcc990997 CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126) b7868 Aman Gupta 2026-01-29 10:31:28 +08:00
  • d4964a7c66 sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154) b7867 Neo Zhang 2026-01-29 09:20:22 +08:00
  • 50e8962f79 ci : find latest release with asset for winget (#19161) Sigbjørn Skjæret 2026-01-28 22:05:39 +01:00
  • f6b533d898 Vulkan Flash Attention Coopmat1 Refactor (#19075) b7865 Ruben Ortlam 2026-01-28 18:52:45 +01:00
  • 72d3b1898a spec : add self‑speculative decoding (no draft model required) + refactor (#18471) b7864 Sascha Rogmann 2026-01-28 18:42:42 +01:00
  • ebf5725870 convert : yield Mamba2Model/GraniteMoeModel modify_tensors (#19157) Daniel Bevenius 2026-01-28 16:49:36 +01:00
  • 0cd7032ca4 ggml-sycl: remove unused syclcompat header (#19140) b7862 Patryk Kaminski 2026-01-28 16:33:54 +01:00
  • 60368e1d73 jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147) b7861 Sigbjørn Skjæret 2026-01-28 14:40:29 +01:00
  • 88d23ad515 vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058) b7860 Oleksandr Kuvshynov 2026-01-28 06:35:54 -05:00
  • 0a95026da9 doc: add build instruction to use Vulkan backend on macos (#19029) Ben Chen 2026-01-28 19:30:16 +08:00
  • b7feacf7f3 ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718) b7858 Kevin Pouget 2026-01-28 10:49:40 +01:00
  • 6c8a04576e experiments gg/ngram-mod Georgi Gerganov 2026-01-28 09:45:07 +02:00
  • 6ad70c5a77 ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108) b7857 Alberto Cabrera Pérez 2026-01-28 07:15:56 +00:00
  • 631cbfcc7a cuda : fix "V is K view" check for non-unified KV cache (#19145) b7856 Georgi Gerganov 2026-01-28 09:15:27 +02:00
  • 2eee6c866c CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (#19142) b7855 Georgi Gerganov 2026-01-28 09:15:11 +02:00
  • b931f81b5a server : adjust spec tests to generate up to 16 tokens (#19093) Georgi Gerganov 2026-01-28 09:11:40 +02:00
  • c5c64f72ac llama : disable Direct IO by default (#19109) b7853 Georgi Gerganov 2026-01-28 09:11:13 +02:00
  • eef375ce16 sampling : remove sampling branching in output_reserve (#18811) b7852 Daniel Bevenius 2026-01-28 05:59:30 +01:00
  • 06961e2876 ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (#18976) b7851 Nikhil Jain 2026-01-27 20:53:36 -08:00
  • f2571df8b7 ggml-zendnn : update ZenDNN git tag to main branch (#19133) b7850 Vishal Singh 2026-01-28 03:51:36 +05:30
  • 2b4cbd2834 jinja : implement mixed type object keys (#18955) b7849 Sigbjørn Skjæret 2026-01-27 19:50:42 +01:00
  • 68ac3acb43 docs: Remove duplicated word on CUDA build section (#19136) David Lima 2026-01-27 10:48:51 -03:00
  • a5bb8ba4c5 CUDA: tune GLM 4.7 Flash FA kernel selection logic (#19097) b7847 Johannes Gäßler 2026-01-27 14:28:56 +01:00
  • c0204a0893 ci : revert slim runner for winget (#19129) Sigbjørn Skjæret 2026-01-27 11:54:25 +01:00
  • 003c90352d ngram-map : take into account the input can become shorter Georgi Gerganov 2026-01-27 11:56:13 +02:00
  • be8890e721 ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (#18888) b7845 Alberto Cabrera Pérez 2026-01-27 09:08:10 +00:00
  • 9f8401a533 ngram-map : fix uninitialized values Georgi Gerganov 2026-01-27 11:07:18 +02:00
  • bc33838037 common : rename speculative.draftless_type -> speculative.type Georgi Gerganov 2026-01-27 10:19:36 +02:00
  • 351e798b2a Merge branch 'master' into pr/18471 Georgi Gerganov 2026-01-27 10:04:19 +02:00
  • a83c73a18a [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042) b7844 Gaurav Garg 2026-01-27 06:52:44 +00:00
  • fc3cdf32ce common : clarify HTTPS build options in error message (#19103) b7843 Daniel Bevenius 2026-01-27 06:16:00 +01:00
  • 7afdfc9b84 ggml-cpu: Enable FP16 MMA kernels on PPC (#19060) b7842 shalinib-ibm 2026-01-27 09:22:34 +05:30
  • 94eeb5967c opencl: add flattened q6_K mv (#19054) b7841 lhez 2026-01-26 19:36:24 -08:00
  • b0311c16d2 CUDA: fix padding of GQA to power of 2 in FA (#19115) Johannes Gäßler 2026-01-26 23:24:58 +01:00
  • dd23149dea CODEOWNERS: add common/ngram-map.* (#18471) Sascha Rogmann 2026-01-26 22:06:43 +01:00
  • 72f416e973 minor: comments Sascha Rogmann 2026-01-26 22:04:00 +01:00
  • 8f80d1b254 graph : fix nkvo offload with FA (#19105) b7839 Georgi Gerganov 2026-01-26 20:18:34 +02:00
  • 142cbe2ac6 ci : use new 1vCPU runner for lightweight jobs (#19107) Sigbjørn Skjæret 2026-01-26 15:22:49 +01:00
  • 1f8d36665d minor : cleanup + fix build Georgi Gerganov 2026-01-26 14:05:17 +02:00
  • a3300937e5 common : better names Georgi Gerganov 2026-01-26 13:59:08 +02:00
  • f895bca71a minor : cleanup Georgi Gerganov 2026-01-26 13:56:28 +02:00
  • 56f3ebf38e model : add correct type for GLM 4.7 Flash (#19106) b7837 Georgi Gerganov 2026-01-26 11:24:30 +02:00
  • fd4d803c60 common: print performance in spec decoding Sascha Rogmann 2026-01-26 00:20:05 +01:00
  • 288ab50597 doc: (draftless) speculative decoding Sascha Rogmann 2026-01-25 23:58:55 +01:00
  • 8ea068e5f8 spec: remove --spec-config Sascha Rogmann 2026-01-25 23:56:29 +01:00
  • 0c21677e43 CUDA: faster FA for GQA > 1 but not power of 2 (#19092) b7836 Johannes Gäßler 2026-01-25 21:19:47 +01:00
  • 9ac881767c cont : naming Georgi Gerganov 2026-01-25 21:15:15 +02:00
  • 0440bfd160 metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (#19088) b7835 ccbinn 2026-01-26 02:07:19 +08:00
  • 0bf5636938 convert : yield Gemma3N custom_map tensors directly (#19091) Sigbjørn Skjæret 2026-01-25 18:03:34 +01:00
  • 924517dd38 spec : refactor Georgi Gerganov 2026-01-25 17:15:46 +02:00
  • af382c384a common: cleanup (use common_speculative_state_draft) Sascha Rogmann 2026-01-25 16:41:44 +01:00
  • bcb43163ae ggml-cpu: Use tiled FA for prompt-processing (#19012) b7833 Aman Gupta 2026-01-25 23:25:58 +08:00
  • d9c6ce46f7 kv-cache : support V-less cache (#19067) b7832 Georgi Gerganov 2026-01-25 15:48:56 +02:00
  • 70d860824a convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe (#19084) Sigbjørn Skjæret 2026-01-25 13:05:05 +01:00
  • 080b161995 completion : fix prompt cache for recurrent models (#19045) b7830 Georgi Gerganov 2026-01-25 09:12:50 +02:00
  • 1243f93a2d readme: update RWKV7 model links (#19061) Molly Sophia 2026-01-25 15:11:19 +08:00
  • 24bc238303 llama: fix integer type consistency in split helpers (#18894) b7828 Jakkala Mahesh 2026-01-25 12:40:52 +05:30
  • 16639ba217 common : use two decimal places for float arg help messages (#19048) b7827 Daniel Bevenius 2026-01-25 07:31:42 +01:00
  • 9981c30130 convert : fix conversion for inheriting models that were bypassing modify_tensors (#19064) b7826 Bartowski 2026-01-24 20:36:47 -05:00
  • cb3a40277a common: moved self-spec impl to ngram-map Sascha Rogmann 2026-01-25 01:16:06 +01:00
  • e9fd8dcab4 llama-fit-params: keep explicit --ctx-size 0 (#19070) b7825 Johannes Gäßler 2026-01-24 22:13:08 +01:00
  • 4e5b83b226 GGUF: check that tensor size is representable (#19072) b7824 Johannes Gäßler 2026-01-24 21:57:51 +01:00
  • bb02f74c61 chat: fix language input for translategemma (#19052) b7823 Xuan-Son Nguyen 2026-01-24 17:58:45 +01:00
  • a1584ac80f server: cleanup (remove slot.batch_spec, rename) Sascha Rogmann 2026-01-23 23:31:32 +01:00
  • 1e29af4ea5 common: add option --spec-draftless Sascha Rogmann 2026-01-22 23:17:56 +01:00
  • eb43748b05 common: add vector of speculative states Sascha Rogmann 2026-01-21 22:46:28 +01:00
  • b38eb5907c common: add enum common_speculative_type Sascha Rogmann 2026-01-18 18:45:10 +01:00
  • 456268fa7f common: ngram map, config self-speculative decoding Sascha Rogmann 2026-01-14 23:44:23 +01:00
  • 907d094f9e server: can_speculate() requires a task instance Sascha Rogmann 2026-01-03 10:16:22 +01:00
  • f1f6584ce6 common: use %zu format specifier for size_t in logging Sascha Rogmann 2026-01-03 09:54:22 +01:00
  • 917f4bb14b server: replace can_speculate() with slot.can_speculate() Sascha Rogmann 2026-01-02 22:42:59 +01:00