Commit Graph

  • 930e0210d1 gitignore: add AGENTS.local.md (#22246) Georgi Gerganov 2026-04-23 08:22:24 +03:00
  • 96c1db26c4 ggml-base: use MATH_LIBRARY variable instead of hardcoded 'm' (#22239) Georgi Gerganov 2026-04-23 08:22:08 +03:00
  • 4ead6fd957 [SYCL] Update oneapi 2025.3.3, Seperate SYCL build, release Ubuntu 24 package. (#22078) Neo Zhang Jianyu 2026-04-23 13:21:36 +08:00
  • 5eaee65384 convert : Handle ModelOpt produced mixed precision model during convert to GGUF (#22247) ynankani 2026-04-23 05:19:51 +00:00
  • 60b68a6279 sycl : fused MoE mul_mat_vec_q for TG (#21920) abotsis 2026-04-22 23:18:56 -06:00
  • b76429a69c ggml-webgpu: add support for im2col (#22259) b8895 Chen Yuan 2026-04-22 23:17:41 -04:00
  • 86db42e97f CUDA: fuse relu + sqr (#22249) Anav Prasad 2026-04-23 02:28:56 +00:00
  • 6217b49583 HIP: flip GGML_HIP_GRAPHS to default on (#22254) b8893 uvos 2026-04-23 02:34:31 +02:00
  • 0d0764dfd2 [WebGPU] Implement async tensor api and event api (#22099) b8892 Nikhil Jain 2026-04-22 10:52:01 -07:00
  • 6da7168312 ggml-webgpu: Add fused RMS_NORM + MUL (#21983) b8891 Masashi Yoshimura 2026-04-23 02:51:40 +09:00
  • 8bccdbbff9 chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217) b8890 Piotr Wilkin (ilintar) 2026-04-22 18:10:56 +02:00
  • bcb5eeb645 speculative-simple : add checkpoint support (#22227) b8889 Georgi Gerganov 2026-04-22 15:44:45 +03:00
  • 225088ea76 sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119) b8888 Akarshan Biswas 2026-04-22 18:02:56 +05:30
  • 82d3f4d3b2 mtmd: also support LLAMA_ROPE_TYPE_NONE (#22242) b8887 Xuan-Son Nguyen 2026-04-22 12:16:29 +02:00
  • 17f6245168 server: ignore reasoning content from transcription api (#21905) b8886 Xuan-Son Nguyen 2026-04-22 12:10:50 +02:00
  • 7bfe60fdf9 mtmd, llama : Update HunyuanVL vision-language model support (#22037) b8885 manayang 2026-04-22 17:58:43 +08:00
  • 750579ff14 common: Refactoring sampler parameters (#20429) (#22233) b8884 Ethan Turner 2026-04-22 01:40:19 -07:00
  • 134d6e54d4 common/chat, server: refactor, move all conversion functions to common, add tests (#20690) b8883 Piotr Wilkin (ilintar) 2026-04-22 10:28:45 +02:00
  • a5355a0226 server: keep router model refcount to avoid unloading models that have running requests 0cc4m/server-router-fix-reload-deadlock Ruben Ortlam 2026-04-16 13:40:13 +02:00
  • ca7f7b7b94 ggml-webgpu(shader): support conv2d kernels. (#21964) b8882 Chen Yuan 2026-04-21 23:18:57 -04:00
  • 0dedb9ef7a hexagon: add support for FILL op (#22198) b8881 Aparna M P 2026-04-22 04:54:20 +05:30
  • 2799d933b5 ggml-webgpu: reset CPU/GPU profiling time when freeing context (#22050) b8880 Masashi Yoshimura 2026-04-22 08:05:21 +09:00
  • 04fe84b69d server: allow cancel loading model (#21814) Xuan-Son Nguyen 2026-04-22 00:26:09 +02:00
  • 5a4cd6741f Hexagon: DAIG op (#22195) b8878 Shreya Jain 2026-04-21 14:16:04 -07:00
  • 2248799a58 hexagon: fix missing v79 entry in libggml-htp.inf (#22194) Mengsheng Wu 2026-04-22 04:53:44 +08:00
  • 72d693e4fb spec : reset i_last when low acceptance streak occurs (#22168) b8876 Paul Dubs 2026-04-21 20:29:07 +02:00
  • 98d2d2884e mtmd: Add support for Reka Edge 2603 (#21616) b8875 Kwa Jie Hao 2026-04-22 02:02:49 +08:00
  • 84652b80cf arg : add --spec-default (#22223) b8874 Georgi Gerganov 2026-04-21 19:52:02 +03:00
  • 52f1096f21 openvino: driver setup, CI split, thread safety, and NPU optimizations (#21944) b8873 Zijun Yu 2026-04-21 23:58:34 +08:00
  • 606fa42f5d vendor : update cpp-httplib to 0.43.1 (#22143) b8872 Alessandro de Oliveira Faria (A.K.A.CABELO) 2026-04-21 11:45:48 -03:00
  • 7fc1c4ef78 metal : workaround macOS GPU interactivity watchdog (#22216) b8871 Georgi Gerganov 2026-04-21 17:24:55 +03:00
  • cf0ebc4e64 load directly from downloaded state Ruben Ortlam 2026-04-21 13:22:50 +02:00
  • b1623a614c handle models that need to be downloaded before estimation Ruben Ortlam 2026-04-20 14:48:55 +02:00
  • 1a8aec0afd cont : clean-up Georgi Gerganov 2026-04-16 14:32:47 +03:00
  • eb2cf73ff9 also strip models memory margin from child processes Ruben Ortlam 2026-04-13 10:14:53 +02:00
  • 69e3086190 improve variable naming, fix style Ruben Ortlam 2026-04-07 13:35:02 +02:00
  • 173da43c95 improve memory_per_device map naming Ruben Ortlam 2026-04-07 13:28:49 +02:00
  • 7500063065 fix model count exceeded check Ruben Ortlam 2026-04-02 11:39:36 +02:00
  • ba2521c6a0 move llama_context_device_memory function to llama-ext.h Ruben Ortlam 2026-04-02 11:39:07 +02:00
  • 51538c1f78 add server memory debug logging Ruben Ortlam 2026-04-02 10:07:04 +02:00
  • 56122b35ad use memory margin instead of total size limit, apply to each device separately Ruben Ortlam 2026-04-02 09:24:53 +02:00
  • 9b5af58a9a only set model memory_mb if not previously calculated Ruben Ortlam 2026-03-31 17:37:16 +02:00
  • 2603b4c5bc use no_alloc to get memory requirements for model load Ruben Ortlam 2026-03-31 16:18:03 +02:00
  • 777395f643 estimate with to-be-loaded model size included Ruben Ortlam 2026-03-29 12:18:51 +02:00
  • 8e8e200726 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold Ruben Ortlam 2026-03-29 10:00:49 +02:00
  • 82209efb7e vulkan: Support F16 OP_FILL (#22177) b8870 Jeff Bolz 2026-04-21 11:01:56 +02:00
  • 9998d88bc8 mtmd: correct mtmd_decode_use_mrope() (#22188) b8869 Xuan-Son Nguyen 2026-04-21 10:53:37 +02:00
  • cd03ec7642 llama-ext : fix exports (#22202) b8868 Georgi Gerganov 2026-04-21 11:04:46 +03:00
  • 4889afba5f sync : ggml Georgi Gerganov 2026-04-21 11:03:42 +03:00
  • 041fe83d74 ggml : bump version to 0.10.0 (ggml/1463) Georgi Gerganov 2026-04-21 11:02:56 +03:00
  • cfe9838d26 fit-params : refactor + add option to output estimated memory per device (#22171) Georgi Gerganov 2026-04-21 09:54:36 +03:00
  • ff6b1062af server : fix hardcoded proxy connection timeout in router mode (#18760) (#22003) b8864 xris99 2026-04-21 06:41:14 +02:00
  • 97895129e5 ggml-cuda: flush legacy pool on OOM and retry (#22155) b8863 leonardHONG 2026-04-21 05:30:38 +08:00
  • 86f8daacfe mtmd: correct get_n_pos / get_decoder_pos (#22175) b8862 Xuan-Son Nguyen 2026-04-20 23:29:19 +02:00
  • cf8b0dbda9 server : remove /api endpoints (#22165) b8861 Georgi Gerganov 2026-04-20 20:41:19 +03:00
  • fd6ae4ca1c Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129) b8860 Gaurav Garg 2026-04-20 21:55:39 +05:30
  • fb19f94c71 TP: fix 0-sized tensor slices, AllReduce fallback (#21808) b8859 Johannes Gäßler 2026-04-20 18:09:39 +02:00
  • 7f251fdbce ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636) b8858 pl752 2026-04-20 21:02:54 +05:00
  • a6cc43c286 ggml-webgpu: updated matrix-vector multiplication (#21738) b8857 neha-ha 2026-04-20 07:37:17 -07:00
  • 35df147d80 cont : remove /api/tags gg/server-remove-api Georgi Gerganov 2026-04-20 15:45:42 +03:00
  • a678916623 mtmd: refactor mtmd_decode_use_mrope (#22161) Xuan-Son Nguyen 2026-04-20 14:45:11 +02:00
  • c1891fd6eb server : remove /api endpoints Georgi Gerganov 2026-04-20 15:34:18 +03:00
  • 81df3f7cfa fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102) b8855 SamareshSingh 2026-04-20 02:32:46 -05:00
  • de71b5f81c server : refactor "use checkpoint" logic (#22114) b8854 Georgi Gerganov 2026-04-20 08:42:37 +03:00
  • 788fcbc5dd [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035) b8853 Katostrofik 2026-04-20 01:39:45 -04:00
  • 9d49acb2a7 server: rename --clear-idle to --cache-idle-slots (#21741) b8852 Yes You Can Have Your Own 2026-04-20 08:30:24 +03:00
  • e365e658f0 vendor : update cpp-httplib to 0.42.0 (#21781) b8851 Alessandro de Oliveira Faria (A.K.A.CABELO) 2026-04-19 19:41:43 -03:00
  • 4eac5b4509 CUDA: refactor mma data loading for AMD (#22051) b8850 Johannes Gäßler 2026-04-19 18:26:59 +02:00
  • d5b780a676 common/autoparser : allow space after tool call (#22073) b8849 Aldehir Rojas 2026-04-19 06:28:35 -05:00
  • 471540ae8a HIP: Remove unesscary NCCL_CHECK (#21914) b8848 uvos 2026-04-19 12:59:44 +02:00
  • 19124078be mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082) b8847 Xuan-Son Nguyen 2026-04-19 11:57:21 +02:00
  • bcdcc1044f ggml : reduce CPU overhead in meta backend (#22041) b8846 Gaurav Garg 2026-04-19 15:18:35 +05:30
  • 037bfe38d0 ci : install spirv-headers for vulkan-cross (#22109) Sigbjørn Skjæret 2026-04-19 09:32:08 +02:00
  • 8685e7b075 convert : support sentence-transformer 5.4 config files (#22087) Dowon 2026-04-19 16:25:39 +09:00
  • 09b4efa95f cmake: remove CMP0194 policy to restore MSVC builds (#21934) b8843 texasich 2026-04-19 02:25:05 -05:00
  • 455d8e4be8 server : speculative checkpointing (#19493) b8842 Sascha Rogmann 2026-04-19 09:24:06 +02:00
  • 91fef95362 rpc : refactor the RPC transport (#21998) b8841 Radoslav Gerganov 2026-04-19 10:21:53 +03:00
  • 9e5647affa server: Expose media_tag on /props endpoint. (#22028) b8840 Cetarthoriphros 2026-04-18 19:27:17 -03:00
  • 4f02d47339 model : refactor bias tensor variable names (#22079) b8839 Sigbjørn Skjæret 2026-04-18 20:12:00 +02:00
  • 23b8cc4991 android : libcommon -> libllama-common (#22076) b8838 Sigbjørn Skjæret 2026-04-18 11:19:40 +02:00
  • 59accc8863 ggml-backend-meta: add multi-segment read support in get_tensor (#22063) b8837 SamareshSingh 2026-04-18 03:04:51 -05:00
  • 83d58e02fc ci : free disk space for rocm release (#22012) b8836 Sigbjørn Skjæret 2026-04-18 09:37:30 +02:00
  • 89a5474f0e convert : fix (ignore for now) typings errors (#22002) Sigbjørn Skjæret 2026-04-18 09:36:41 +02:00
  • fd1c0ec3f0 llama: fit ctx size for CPU only (#21568) Johannes Gäßler 2026-04-18 08:16:04 +02:00
  • 45cac7ca70 ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052) b8833 Reese Levine 2026-04-17 09:17:11 -07:00
  • b94050e896 CUDA: use LRU based eviction for cuda graphs (#21611) b8832 Aman Gupta 2026-04-17 23:24:21 +08:00
  • a279d0f0f4 ci : add android arm64 build and release (#21647) b8831 Yuri Khrustalev 2026-04-17 05:32:24 -04:00
  • 268d61e178 mtmd: add missing struct tag (#22023) b8830 65a 2026-04-17 01:48:33 -07:00
  • 6990e2f1f7 libs : rename libcommon -> libllama-common (#21936) b8829 Georgi Gerganov 2026-04-17 11:11:46 +03:00
  • fcc7508759 model : Gemma4 model type detection (#22027) b8828 Eric Zhang 2026-04-17 16:07:11 +08:00
  • 5e6c0e18b6 opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938) b8827 lhez 2026-04-16 22:28:33 -07:00
  • 30dce2cf29 cli : use get_media_marker (#22017) b8826 Sigbjørn Skjæret 2026-04-17 00:12:31 +02:00
  • 089dd41fe3 cmake: use glob to collect src/models sources (#22005) b8825 Xuan-Son Nguyen 2026-04-16 23:25:16 +02:00
  • 85dde8dc4a hexagon: optimize HMX matmul operations (#21071) b8824 nullname 2026-04-17 04:48:34 +08:00
  • 4fbdabdc61 model: using single llm_build per arch (#21970) b8823 Xuan-Son Nguyen 2026-04-16 21:10:22 +02:00
  • e45dbdece8 opencl: add q5_K gemm and gemv kernels for Adreno (#21595) b8822 shaofeiqi 2026-04-16 12:08:33 -07:00
  • 4adac43f6f server: tests: fetch random media marker via /apply-template (#21962) (#21980) b8821 Pascal 2026-04-16 19:46:21 +02:00
  • 9db77a020c model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245) PikaPikachu 2026-04-16 23:41:34 +08:00
  • f772f6e434 model : support NVFP4 tensors for Gemma4 (#21971) Sigbjørn Skjæret 2026-04-16 16:51:47 +02:00
  • b572d1ecd6 codeowners: add team member comments (#21714) Ruben Ortlam 2026-04-16 12:13:11 +02:00