Commit Graph

  • 2e76e01360 vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868) b6909 Jeff Bolz 2025-11-01 00:45:28 -05:00
  • d3dc9dd898 CUDA: Remove unneded bias/gate dims in fused mmvq (#16858) b6908 Oliver Simons 2025-11-01 06:13:26 +01:00
  • bea04522ff refactor : llama-model.cpp (#16252) b6907 Piotr Wilkin (ilintar) 2025-10-31 23:40:23 +01:00
  • 0de0a01576 model : Minimax M2 (#16831) b6906 Piotr Wilkin (ilintar) 2025-10-31 21:20:47 +01:00
  • e58d585604 model : add Granite Hybrid nano types (#16896) b6905 Giuseppe Scrivano 2025-10-31 21:20:07 +01:00
  • 31c511a968 CUDA: Volta tensor core support for MMF (#16843) b6904 Johannes Gäßler 2025-10-31 15:57:19 +01:00
  • 6d39015a74 sync : ggml Georgi Gerganov 2025-10-31 16:25:50 +02:00
  • 4146d6a1a6 CUDA: add expert reduce kernel (#16857) Aman Gupta 2025-10-31 20:05:07 +08:00
  • 8da3c0e200 batch : fix consistency checks for the input positions (#16890) b6901 Georgi Gerganov 2025-10-31 13:50:33 +02:00
  • c22473b580 server : don't print user inputs to console (#16871) b6900 Georgi Gerganov 2025-10-31 10:54:19 +02:00
  • 0f715b4e75 server : fix typos in server.cpp comments [no ci] (#16883) Daniel Bevenius 2025-10-31 09:51:26 +01:00
  • d2d931f173 vulkan: disable spirv-opt for rope shaders (#16872) b6898 Jeff Bolz 2025-10-31 02:34:47 -05:00
  • 2976b0374d vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796) b6897 Masato Nakasaka 2025-10-31 16:18:59 +09:00
  • d2a2673dd1 vulkan: fix shmem overrun in mmq id shader (#16873) b6896 Ruben Ortlam 2025-10-31 08:14:49 +01:00
  • 13002a0896 ggml-hexagon: respect input size when getting/setting tensor data (#16836) b6895 l3utterfly 2025-10-31 12:46:31 +08:00
  • 6eb208d17e ci : enable free-disk-space on cuda docker build (#16877) b6894 Sigbjørn Skjæret 2025-10-31 00:34:27 +01:00
  • 9984cbb61d opencl: fix boundary handling for mul_mm (#16875) lhez 2025-10-30 16:00:20 -07:00
  • ce18efeaf1 convert : update transformers requirements (#16866) RodriMora 2025-10-30 23:15:03 +01:00
  • 16724b5b68 server : bump request URI max length to 32768 (#16862) b6891 chansikpark 2025-10-30 14:22:23 -04:00
  • b52edd2558 server : remove n_past (#16818) b6890 Georgi Gerganov 2025-10-30 18:42:57 +02:00
  • a4b54f2697 cont : add warning about unsupported ops Georgi Gerganov 2025-10-30 18:15:34 +02:00
  • 3aa835bfe6 clip : use FA Georgi Gerganov 2025-10-29 11:28:45 +02:00
  • 517b7170e1 cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (#16833) b6889 Max Krasnyansky 2025-10-30 09:06:13 -07:00
  • 835e918d84 common: fix typo in cli help text (#16864) b6888 Shagun Bera 2025-10-30 21:17:31 +05:30
  • d261223d24 model: add support for qwen3vl series (#16780) b6887 JJJYmmm 2025-10-30 23:19:14 +08:00
  • dcca0d3ab8 cpu: introduce chunking for flash attention (#16829) b6886 Max Krasnyansky 2025-10-30 05:26:05 -07:00
  • bacddc049a model: Add support for CogVLM model (#15002) b6885 Tianyue-Zhao 2025-10-30 07:18:50 -04:00
  • 229bf68628 cuda : fix argsort with 64k+ rows (#16849) b6884 Sigbjørn Skjæret 2025-10-30 08:56:28 +01:00
  • d7395115ba llama : use std::abs instead of abs (#16853) b6883 Jan Boon 2025-10-30 14:30:58 +08:00
  • 052df28b0e vulkan: Handle argsort with a large number of rows (#16851) b6882 Jeff Bolz 2025-10-30 01:27:41 -05:00
  • 8b11deea46 Hide latency of bias and gate-loading (#16847) b6881 Oliver Simons 2025-10-30 04:34:15 +01:00
  • b9ce940177 vulkan: Fuse rope+set_rows (#16769) b6880 Jeff Bolz 2025-10-29 15:13:10 -05:00
  • 3464bdac37 llama: fix ASAN error with M-RoPE (#16848) b6879 Xuan-Son Nguyen 2025-10-29 20:11:39 +01:00
  • e3af5563bd llama: store mrope data in KV cell (#16825) b6878 Xuan-Son Nguyen 2025-10-29 18:09:18 +01:00
  • 10fcc41290 vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656) b6877 Jeff Bolz 2025-10-29 08:44:29 -05:00
  • bcf5bda6f5 Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536) b6876 Ruben Ortlam 2025-10-29 14:39:03 +01:00
  • 3eb2be1ca5 Hexagon Op queue & dispatch optimizations (#16820) b6875 Max Krasnyansky 2025-10-29 06:29:12 -07:00
  • e41bcce8f0 CUDA: use fastdiv in set-rows (#16834) b6874 Aman Gupta 2025-10-29 21:11:53 +08:00
  • 144a4ce824 vendor : sync minja (#16500) b6873 Sigbjørn Skjæret 2025-10-29 14:09:50 +01:00
  • f549b0007d vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793) b6872 Jeff Bolz 2025-10-29 03:53:04 -05:00
  • 9a3ea685b9 CUDA: Fix bug in topk-moe for gpt-oss (#16821) b6871 Aman Gupta 2025-10-29 15:55:06 +08:00
  • 338074c383 sycl: add RMS_NORM_BACK operation support (#16808) b6870 YaelLogic 2025-10-29 08:14:39 +02:00
  • 851553ea6b cuda: add SET operation support (#16804) b6869 YaelGitAccount 2025-10-28 21:10:28 +02:00
  • 85a7d8677b memory : remove KV cache size padding (#16812) b6868 Georgi Gerganov 2025-10-28 20:19:44 +02:00
  • a8ca18b4b8 llama-bench : clarify benchmarked parts of the computation (#16823) Georgi Gerganov 2025-10-28 19:41:43 +02:00
  • 8284efc35c initialise buffer.device in ggml_hexagon_session (#16816) b6866 l3utterfly 2025-10-28 23:16:20 +08:00
  • 1c1409e131 embedding: add raw option for --embd-output-format (#16541) b6865 Sam Malayek 2025-10-28 03:51:41 -07:00
  • 7a0e900e36 llama: consistent ctx <-> buf order for KV cache (#16746) b6864 Johannes Gäßler 2025-10-28 11:23:54 +01:00
  • 280d97be96 grammar : support array references in json schema (#16792) b6863 Aldehir Rojas 2025-10-28 03:37:52 -05:00
  • 3479efd112 CANN: Improve device ID handling and aclnnArange checks (#16752) b6862 Chenguang Li 2025-10-28 10:54:53 +08:00
  • 463bbf20bf CUDA: add unused vars to mmvf and mmvq (#16807) b6861 Aman Gupta 2025-10-28 10:31:21 +08:00
  • ad8d36beff sycl: add SSM_CONV operation support (#16800) b6860 tamarPal 2025-10-28 03:50:33 +02:00
  • c053e18a66 chat: Add LFM2 tool handling (#16763) b6859 Yuri Khrustalev 2025-10-27 18:54:01 -04:00
  • e1ab084803 mtmd : fix idefics3 preprocessing (#16806) b6858 Xuan-Son Nguyen 2025-10-27 23:12:16 +01:00
  • 5a4ff43e7d llama : disable pipeline parallelism if compute buffer allocation fails (#16748) b6857 Diego Devesa 2025-10-27 13:51:28 -07:00
  • 10640e31aa ggml : fix interpolate with align-corners and ne=1 (#16700) b6856 Acly 2025-10-27 21:50:22 +01:00
  • 80d28f104c HIP: fix AMDGPU_TARGETS, update documentation (#16803) b6855 Johannes Gäßler 2025-10-27 21:39:49 +01:00
  • c55d53acec model : add LightOnOCR-1B model (#16764) b6854 Xuan-Son Nguyen 2025-10-27 16:02:58 +01:00
  • 945501f5ea llama: fix leaked buffers for mmap + split files (#16765) b6853 Johannes Gäßler 2025-10-27 09:17:31 +01:00
  • 75cbdd3fce test-backend-ops: print failed tests at the end (#16785) b6852 Aman Gupta 2025-10-27 09:25:10 +08:00
  • 2b9bd9bf4e sycl: add ROLL operation support (#16665) b6851 tamarPal 2025-10-27 03:20:24 +02:00
  • 59fc1ec8e8 sycl: add REPEAT_BACK operation support (#16734) b6850 shani-f 2025-10-27 03:19:50 +02:00
  • 75d33b9302 CUDA: support for weight clamp in top-k norm (#16702) b6849 Aman Gupta 2025-10-27 09:06:16 +08:00
  • 3470a5c891 ggml-alloc : make gallocr prefer chunks that allow memory reuse (#16788) b6848 Acly 2025-10-26 23:19:03 +01:00
  • bd562fe4f7 cuda : use fast copy when src and dst are of different type and contiguous (#16789) b6847 Sigbjørn Skjæret 2025-10-26 21:31:41 +01:00
  • bbac6a26b2 ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (#16744) b6846 leejet 2025-10-27 02:13:31 +08:00
  • 73a48c9790 convert : enable expert group selection for all models with it (#16691) b6845 Sigbjørn Skjæret 2025-10-26 17:21:23 +01:00
  • f696428ce8 graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655) b6844 Sigbjørn Skjæret 2025-10-26 17:20:32 +01:00
  • 7cce4f8158 model : set res->t_embd in SmallThinker models (#16782) b6843 Sigbjørn Skjæret 2025-10-26 16:08:52 +01:00
  • 8d8862829c docs : add Jamba to Text-only models list (#16778) amirai21 2025-10-26 14:01:20 +02:00
  • f77c13b91f CUDA: General GEMV fusion (#16715) b6841 Aman Gupta 2025-10-26 19:28:04 +08:00
  • 3cfa9c3f12 vulkan: deduplicate Microsoft Direct3D12 devices (#16689) b6840 Gilad S. 2025-10-26 06:37:38 +02:00
  • 5d195f17bc convert : handle mmproj filename/path properly (#16760) b6839 Galunid 2025-10-25 20:41:36 +02:00
  • 226f295f4d model : set res->t_embd in PLaMo2 models (#16766) b6838 Shunta Saito 2025-10-25 19:26:27 +09:00
  • f90b4a8efe vulkan: delete dead code (#16732) b6837 Giuseppe Scrivano 2025-10-25 10:59:54 +02:00
  • 8423d01931 vulkan: Optimize SSM_SCAN (#16645) b6836 Jeff Bolz 2025-10-25 00:04:12 -05:00
  • 5cca2542ac convert : avoid dequantizing mxfp4 for GPT-OSS (#16756) b6835 compilade 2025-10-24 20:52:00 -04:00
  • 55945d2ef5 ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (#16742) b6834 leejet 2025-10-25 03:39:37 +08:00
  • 0bcb40b48c CUDA: use CUB for arbitary size argsort (#16754) b6833 Aman Gupta 2025-10-24 20:46:19 +08:00
  • 69e9ff0103 webui: support q URL parameter (#16728) Florian Badie 2025-10-24 14:10:29 +02:00
  • d7f794eadb convert : avoid dequantizing mxfp4 for GPT-OSS compilade/fix-prequant-mxfp4-gpt-oss Francis Couture-Harpin 2025-10-24 07:46:34 -04:00
  • 5a91109a5d model-conversion : add trust_remote_code for orig model run [no ci] (#16751) Daniel Bevenius 2025-10-24 12:02:02 +02:00
  • f8f071fadd convert : handle pre-quantized models (#14810) b6830 compilade 2025-10-23 16:31:41 -04:00
  • 0bf47a1dbb server: add memory breakdown print (#16740) b6829 Johannes Gäßler 2025-10-23 21:30:17 +02:00
  • 93fbd407f3 Merge branch 'master' into compilade/convert-prequant compilade/convert-prequant Francis Couture-Harpin 2025-10-23 14:23:12 -04:00
  • dd62dcfab9 convert : Make mistral-common dependency optional (#16738) Julien Denize 2025-10-23 15:54:46 +02:00
  • d0660f237a mtmd-cli : allow using --jinja (#16718) b6827 Xuan-Son Nguyen 2025-10-23 15:00:49 +02:00
  • fe6a9882ac Manually link -lbsd to resolve flock symbol on AIX (#16610) b6826 Prajwal B Mehendarkar 2025-10-23 17:07:31 +05:30
  • 061f0eff02 ggml-cuda: use passed ops instead of hardcoded ops (#16712) b6825 Aman Gupta 2025-10-23 19:14:06 +08:00
  • 8cf6b42d46 server : send partial stop string when <EOG> is reached (#15007) b6824 matteo 2025-10-23 11:32:24 +02:00
  • 9de9672adb sycl: use async memory allocation to fix crashes during graph recording (#16644) b6823 Matthew Michel 2025-10-22 20:05:15 -05:00
  • 63d2fc46e1 Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) b6822 Max Krasnyansky 2025-10-22 13:47:09 -07:00
  • a2e0088d92 Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…" (#16723) b6821 Diego Devesa 2025-10-22 11:20:55 -07:00
  • 9b9201f65a webui: introduce OpenAI-compatible model selector in JSON payload (#16562) Pascal 2025-10-22 16:58:23 +02:00
  • 19a5a3edfd ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (#16522) sirus20x6 2025-10-22 05:14:14 -05:00
  • d8eaa26e4d tests : fix test-thread-safety when compiling with multiple backends (#16699) b6818 Acly 2025-10-22 12:01:22 +02:00
  • 9285325ce0 CUDA: fix bug in topk-moe softmax (#16711) b6817 Aman Gupta 2025-10-22 12:33:08 +08:00
  • 03792ad936 CUDA: topk-moe: add optional parameter for gpt-oss (#16649) b6816 Aman Gupta 2025-10-21 22:40:38 +08:00
  • 51d1a8c997 CUDA: better error for FA kernel with 0 occupancy (#16643) b6815 Johannes Gäßler 2025-10-21 15:27:53 +02:00
  • 4926419c4d ggml: add ggml_can_fuse_subgraph (#16662) b6814 Aman Gupta 2025-10-21 16:43:14 +08:00