Commit Graph

  • 2187e00337 StepFun 3.5 MTP (#23274) b9480 Piotr Wilkin (ilintar) 2026-06-02 17:44:35 +02:00
  • 0b7154066e common : fix state save in common_prompt_batch_decode (#23468) b9479 Daniel Bevenius 2026-06-02 15:44:15 +02:00
  • 60130d18f9 server: add SSE ping interval (#24013) b9478 Xuan-Son Nguyen 2026-06-02 14:14:55 +02:00
  • a468b89018 ci : reduce self-hosted server workflow jobs (#24012) Georgi Gerganov 2026-06-02 13:17:59 +03:00
  • d5ab0834ab docs : update HOWTO-add-model.md (#23883) Mikhail Podvitskii 2026-06-02 11:40:22 +02:00
  • 69cea5b669 ui: simplify network error handling (#23431) Marcos Del Sol Vives 2026-06-02 10:45:25 +02:00
  • f8e67fc583 ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434) b9474 Aleksander Grygier 2026-06-02 10:23:19 +02:00
  • 2365315955 kv-cache : SWA checkpoints store only non-masked cells (#23981) b9473 Georgi Gerganov 2026-06-02 11:06:29 +03:00
  • f7a0777a5c convert : support Step3.7-Flash (#23845) forforever73 2026-06-02 15:54:49 +08:00
  • 4f3a4beb8d llama : deprecate llama_set_warmup (#24009) b9471 Georgi Gerganov 2026-06-02 10:30:38 +03:00
  • 8f7f3bf141 hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989) b9470 Max Krasnyansky 2026-06-01 23:40:08 -07:00
  • d178a11818 hexagon: add gelu_quick (#24007) b9469 Todor Boinovski 2026-06-01 23:19:07 -07:00
  • 354ebac8cb server: real-time reasoning interruption via control endpoint (#23971) b9468 Pascal 2026-06-02 07:26:20 +02:00
  • 1fd5f48037 clean up unused variables warnings (#23975) b9467 Anav Prasad 2026-06-01 19:38:37 -07:00
  • 210a6570ce opencl: fix compiler warnings for non-adreno path (#23922) b9466 lhez 2026-06-01 19:15:09 -07:00
  • b8275a8acc revert to using global_invocation_id for cpy shader (#23955) Masashi Yoshimura 2026-06-02 08:59:06 +09:00
  • 5dcb711666 speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988) b9464 Georgi Gerganov 2026-06-01 22:26:58 +03:00
  • 5aa3a64596 nix : add nix-nodejs facilities to build Web UI (#23846) Christian Hoener zu Siederdissen 2026-06-01 20:01:26 +02:00
  • 27d9ed8397 opencl: add basic support for q5_0 and q5_1 (#23548) shaofeiqi 2026-06-01 10:06:50 -07:00
  • 335abed17d vendor : update cpp-httplib to 0.46.1 (#23980) Adrien Gallouët 2026-06-01 18:40:10 +02:00
  • de6f727aae llama: limit max outputs of llama_context (#23861) b9460 Aman Gupta 2026-06-01 23:01:38 +08:00
  • 95b8b8ec1a metal: template GLU kernels to support f16/f32 (#23882) b9459 Shrivas Shankar 2026-06-01 07:40:28 -05:00
  • 55ac0909e5 vulkan: don't hold the device mutex while compiling pipelines (#23641) b9458 Jeff Bolz 2026-06-01 07:04:01 -05:00
  • bef69f1306 vulkan: reduce host memory lock contention (#23376) b9457 Winston Ma 2026-06-01 20:03:32 +08:00
  • 5aba5364d9 vocab: add normalizer.lowercase support to WPM (#23899) o7si 2026-06-01 19:26:47 +08:00
  • 8e6fff84de TP: quantized KV cache support (#23792) b9455 Johannes Gäßler 2026-06-01 12:30:10 +02:00
  • 02a57017f6 security : disable private disclosures (#23963) Georgi Gerganov 2026-06-01 13:14:12 +03:00
  • 48b88c3b00 model: Add EXAONE 4.5 implementations (#21733) b9453 Junwon Hwang 2026-06-01 18:48:53 +09:00
  • 19620004f5 vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056) b9452 Matt Corallo 2026-06-01 09:46:48 +00:00
  • f8c0a19d46 vulkan: Removed unused functions (#23175) b9451 Winston Ma 2026-06-01 17:46:23 +08:00
  • 5254a7994d common : support manually triggering the reasoning budget end sequence (#23949) Aldehir Rojas 2026-06-01 05:37:11 -04:00
  • e22b0de60d ci : add missing Linux label to cpu-x64-high-perf runner (#23958) Georgi Gerganov 2026-06-01 10:39:59 +03:00
  • a51142497a [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812) Neo Zhang 2026-06-01 14:53:53 +08:00
  • 4162522688 [SYCL] Add more types in GET_ROWS OP (#23710) Neo Zhang 2026-06-01 14:53:04 +08:00
  • 44e211cecf sycl : Optimize Q3_K mul_mat by reorder (#23725) Neo Zhang 2026-06-01 14:50:55 +08:00
  • af6528e6df ci: remove redundant or duplicate jobs (#23927) b9445 Eve 2026-06-01 03:32:17 +00:00
  • 6f165c1c64 server : handle If-None-Match weak ETags (#23916) b9444 Eric Zhang 2026-06-01 05:21:08 +08:00
  • 399739d5c5 ci : limit trigger paths for the CPU workflow (#23938) Georgi Gerganov 2026-05-31 19:02:47 +03:00
  • d4c8e2c29c vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756) b9442 o7si 2026-05-31 18:37:35 +08:00
  • 3292da09f6 ui: fix ETag truncation with MSVC compiler (#23917) b9441 Eric Zhang 2026-05-31 17:21:23 +08:00
  • e6123e2080 docs : update ZenDNN docs for Q8 support (#23791) Vladislav 2026-05-31 11:26:42 +03:00
  • 9cc707ad7b vulkan: fix check results async upload issue 0cc4m/vulkan-check-results-async Ruben Ortlam 2026-05-31 09:52:49 +02:00
  • 22cadc1944 llama: only use one iGPU device by default (#23897) b9439 Ruben Ortlam 2026-05-31 08:17:47 +02:00
  • d749821db3 webui: add custom CSS injection via config (#23904) b9438 Pascal 2026-05-30 23:49:31 +02:00
  • aa46bda89b Support -fa auto in llama-bench (#23714) b9437 Gaurav Garg 2026-05-31 02:03:57 +05:30
  • d6588daa80 opencl: support bf16 by converting to f16 (#23839) b9436 lhez 2026-05-30 10:17:47 -07:00
  • d38d50e7ff ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) Pascal 2026-05-30 16:50:54 +02:00
  • 8b0e0db606 TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843) b9434 Johannes Gäßler 2026-05-30 15:48:00 +02:00
  • 2d9b7c8e98 metal : restore im2col implementation for large kernels (#23901) b9433 Georgi Gerganov 2026-05-30 15:26:13 +03:00
  • e674b1279b test: (test-llama-archs) log the config name first (#23885) b9432 Xuan-Son Nguyen 2026-05-30 12:22:38 +02:00
  • 4c4e91b799 ci : update ios-xcode release job to macos-26 (#23906) b9431 Georgi Gerganov 2026-05-30 13:21:46 +03:00
  • d48a56effb ggml : add some lsx support (#23798) b9430 Jinyang He 2026-05-30 16:53:26 +08:00
  • 6e093b80ea vulkan: add Flash Attention support for BFloat16 KV cache (#23420) Ruben Ortlam 2026-05-30 10:39:31 +02:00
  • 926b94a1bc server: allow API calls to set a lower thinking budget if a global budget is set 0cc4m/server-api-thinking-budget Ruben Ortlam 2026-05-30 08:51:06 +02:00
  • 337528571d ci : fix s390x release job (#23898) b9428 Georgi Gerganov 2026-05-30 09:21:38 +03:00
  • d4204b03a5 ci : clear cache instead of "no timestamp" keys + fix macos (#23895) Georgi Gerganov 2026-05-30 08:52:30 +03:00
  • 1738129bee llama : do not skip iGPU when only RPC devices are present (#23868) b9426 Radoslav Gerganov 2026-05-30 07:48:22 +03:00
  • 0821c5fcfd server: in SSE mode, send HTTP headers when slot starts (#23884) Xuan-Son Nguyen 2026-05-30 00:06:29 +02:00
  • 151f3a98e9 ggml-webgpu: Check earlier for WebGPU required features (#23879) Reese Levine 2026-05-29 14:16:05 -07:00
  • b22da25889 ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760) Reese Levine 2026-05-29 14:14:11 -07:00
  • 689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869) Ruixiang Wang 2026-05-29 23:09:47 +02:00
  • 5a46b46acd app: add llama update self updater (#23865) Pascal 2026-05-29 23:02:40 +02:00
  • 22d66b567e ui: handle audio/vnd.wave as audio WAV file (#23754) ValdikSS 2026-05-29 22:41:35 +03:00
  • 2084434e66 vocab : support tokenizer for LFM2.5-8B-A1B (#23826) Tarek Dakhran 2026-05-29 20:25:43 +02:00
  • 764f1e64a1 graph : ensure DS32 kq_mask_lid is F32 (#23864) Sigbjørn Skjæret 2026-05-29 19:55:14 +02:00
  • b5f52280fb server: remove obsolete scripts (#23870) Xuan-Son Nguyen 2026-05-29 19:47:30 +02:00
  • dc71236b6c ci : update macos release to use macos-26 runner (#23878) Georgi Gerganov 2026-05-29 20:41:57 +03:00
  • 06d26dfdff download: add option to skip_download (#23059) b9415 Xuan-Son Nguyen 2026-05-29 16:30:55 +02:00
  • da3f990a47 mtmd: Add DeepSeekOCR 2 Support (#20975) b9414 Saba Fallah 2026-05-29 16:13:51 +02:00
  • 6ed481eea4 CUDA: Check PTX version on host side to guard PDL dispatch (#23530) b9413 Oliver Simons 2026-05-29 12:28:18 +02:00
  • cb47092b00 server: bump timeout to 3600s (#23842) b9412 Xuan-Son Nguyen 2026-05-29 10:23:17 +02:00
  • 1f0aa2a696 model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346) b9411 fairydreaming 2026-05-29 10:15:17 +02:00
  • 031ddb2e08 llama: use f16 mask for FA to save VRAM (#23764) b9410 Aman Gupta 2026-05-29 15:44:43 +08:00
  • fe12e422ad sync : ggml b9409 Georgi Gerganov 2026-05-29 09:53:41 +03:00
  • ea02bc37f5 ggml : bump version to 0.13.1 (ggml/1523) Georgi Gerganov 2026-05-29 09:46:12 +03:00
  • b000431a0b ngram-mod : Add missing include (#23857) Omid Azizi 2026-05-28 23:21:37 -07:00
  • eef59a7642 llama: add llm_graph_input_mtp (#23643) b9406 Aman Gupta 2026-05-29 14:17:32 +08:00
  • 98e480a32e app : move licences to llama-app (#23824) b9405 Adrien Gallouët 2026-05-29 07:46:11 +02:00
  • 241cbd41d2 cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) b9404 Andreas Kieslinger 2026-05-29 06:46:10 +02:00
  • 33c718db1f meta : Add missing buffer set in allreduce fallback !COMPUTE clear (#23480) b9403 Matt Corallo 2026-05-29 03:30:24 +00:00
  • 24c307d261 server: create checkpoint on task cancel xsn/server_checkpoint_on_cancel Xuan Son Nguyen 2026-05-28 23:19:01 +02:00
  • 19e92c33ef hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835) b9402 Max Krasnyansky 2026-05-28 14:05:54 -07:00
  • 751ebd17a5 mtmd-debug: add color and rainbow mode (#23829) b9401 Xuan-Son Nguyen 2026-05-28 20:59:14 +02:00
  • c8914ad4f4 mtmd: fix gemma 4 projector pre_norm (#23822) b9400 Xuan-Son Nguyen 2026-05-28 20:58:55 +02:00
  • 408ae2b9e5 opencl: move backend info printing into its own function (#23702) b9399 lhez 2026-05-28 11:05:42 -07:00
  • 3ef2369551 ci : run ui publish on ubuntu-slim (#23818) Sigbjørn Skjæret 2026-05-28 19:58:32 +02:00
  • 2f6c815dc4 ui: fix audio and video modality detection (#23756) ValdikSS 2026-05-28 18:36:10 +03:00
  • 445b7cef62 ci : releases use Github-hosted builds for the UI (#23823) Georgi Gerganov 2026-05-28 17:50:32 +03:00
  • 479a9a1b03 app : improve help output (#23805) b9395 Adrien Gallouët 2026-05-28 16:45:06 +02:00
  • 0b56d283bf mtmd: n_head_kv defaults to n_head (#23782) b9394 Saba Fallah 2026-05-28 16:44:36 +02:00
  • d6be3158e1 mtmd: fix gemma 4 audio rms norm eps (#23815) b9393 Xuan-Son Nguyen 2026-05-28 16:31:37 +02:00
  • dd1557907a ci : change Vulkan builds to Release to reduce ccache (#23820) Georgi Gerganov 2026-05-28 17:29:11 +03:00
  • 7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) b9391 Mikolaj Kucharski 2026-05-28 14:25:40 +00:00
  • d374e71e55 test-llama-archs: fix table format [no release] (#23810) Johannes Gäßler 2026-05-28 15:53:54 +02:00
  • 30af6e2b98 ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) b9389 fl0rianr 2026-05-28 15:01:14 +02:00
  • d7be46189f mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729) b9388 redfox 2026-05-28 20:51:14 +08:00
  • bc81d47aba CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227) b9387 Jaden_Mach 2026-05-28 08:50:25 -04:00
  • 0b246862b9 server: minor tweaks to use more cpp features (#23785) b9386 Funtowicz Morgan 2026-05-28 14:00:25 +02:00
  • a919001134 hexagon: minor refresh for HMX FA and MM (#23796) Max Krasnyansky 2026-05-28 04:49:11 -07:00
  • 48e7078ee0 vulkan: fast path for walsh-hadamard transform (#23687) b9384 Jeff Bolz 2026-05-28 06:18:43 -05:00