Commit Graph

  • 2a85f720b8 server : handle closed connection for tasks (#18459) b7571 Georgi Gerganov 2025-12-29 15:34:41 +02:00
  • 7cbec34a63 model-conversion : add device option to embd run orig model (#18386) Daniel Bevenius 2025-12-29 13:37:02 +01:00
  • 0c8986403b retrieval : use at most n_seq_max chunks (#18400) b7569 Héctor Estrada Moreno 2025-12-29 05:21:13 -06:00
  • daa242dfc8 common: fix return value check for setpriority (#18412) b7568 o7si 2025-12-29 17:07:49 +08:00
  • e70e640db3 CUDA: Blackwell features for non-native builds (#18436) b7567 Johannes Gäßler 2025-12-29 09:35:42 +01:00
  • 5fa66c6e67 cuda: fix race condition in cumsum (#18448) b7566 Aman Gupta 2025-12-29 14:07:17 +08:00
  • 382808c14b ci : re-enable rocm build on amd64 (#18439) b7565 Tim Neumann 2025-12-29 00:29:23 +01:00
  • 4ffc47cb20 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202) b7564 uvos 2025-12-28 20:12:55 +01:00
  • 9c675c7140 model : Plamo3 support (#17304) b7563 momonga 2025-12-29 01:28:31 +09:00
  • 07a0c4ba92 Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426) b7562 Aman Gupta 2025-12-28 20:53:36 +08:00
  • 60f17f56da rpc: fix segfault on invalid endpoint format (#18387) b7561 o7si 2025-12-28 18:34:41 +08:00
  • f8d561eb87 llama-fit-params: fix step size for last device (#18415) b7560 Johannes Gäßler 2025-12-28 10:52:09 +01:00
  • e59efe6a78 github: update issue templates [no ci] (#18410) Johannes Gäßler 2025-12-28 10:50:56 +01:00
  • cffa5c46ea mtmd: clarify that we no longer accept AI-generated PRs (#18406) b7558 Xuan-Son Nguyen 2025-12-28 09:57:04 +01:00
  • 94de74e7b1 cmake: Added more x86_64 CPU backends when building with GGML_CPU_ALL_VARIANTS=On (#18186) b7557 Boian Berberov 2025-12-28 07:33:29 +00:00
  • 3b54531ead ci : disable mmap gg/test-mmap Georgi Gerganov 2025-12-28 09:26:51 +02:00
  • 060c0a585e ggml : include cub/cub.cuh instead of block_scan.cuh Daniel Bevenius 2025-12-28 07:49:14 +01:00
  • 82c2600585 Merge remote-tracking branch 'upstream/master' into backend-sampling Daniel Bevenius 2025-12-28 07:34:17 +01:00
  • 4fd59e8427 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413) b7556 QDelta 2025-12-27 20:33:14 -05:00
  • 08566977a7 opencl: allow resizing transpose buffers (#18384) b7555 lhez 2025-12-27 15:51:14 -08:00
  • a4bf35889e llama-fit-params: fix overflow check (#18354) b7554 Johannes Gäßler 2025-12-27 20:20:45 +01:00
  • 026d2ad472 llama: fix magic number of 999 for GPU layers (#18266) b7553 Johannes Gäßler 2025-12-27 20:18:35 +01:00
  • 06705fdcb3 ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407) b7552 Aman Gupta 2025-12-27 19:56:27 +08:00
  • a52dc60ba3 llama_fit_params: return enum for fail vs. error (#18374) b7551 Johannes Gäßler 2025-12-27 09:59:19 +01:00
  • 9045c9afe5 llama-fit-params: fix Gemma 3 calculation (#18372) b7550 Johannes Gäßler 2025-12-27 09:56:04 +01:00
  • c9ced4910b vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352) b7549 Jeff Bolz 2025-12-26 16:12:58 -06:00
  • 7ac8902133 vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349) b7548 Jeff Bolz 2025-12-26 11:15:50 -06:00
  • 9bf20d8ac3 vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332) b7547 Jeff Bolz 2025-12-26 11:15:02 -06:00
  • cb999704fb vulkan: small dequantization improvements (#18380) Eve 2025-12-26 17:12:11 +00:00
  • b96b82fc85 vulkan: Support UPSCALE w/antialias (#18327) b7545 Jeff Bolz 2025-12-26 10:00:57 -06:00
  • 10dc500bdb vulkan: handle rope with large number of rows (#18306) b7544 Jeff Bolz 2025-12-26 09:53:46 -06:00
  • 4893cc07bb server : fix crash when seq_rm fails for hybrid/recurrent models (#18391) b7543 o7si 2025-12-26 23:35:29 +08:00
  • af3be131c0 docs: added note for pre SYCL Intel hardware (#18016) b7542 Francisco Herrera 2025-12-25 21:34:30 -05:00
  • b07cda687c CANN: implement the SSM_CONV operator (#17737) b7541 0Marble 2025-12-26 09:12:04 +08:00
  • 85c40c9b02 ggml-cuda: fix regex for arch list (#18371) b7540 Aman Gupta 2025-12-26 01:35:14 +08:00
  • 83b3b1c271 cuda: optimize cumsum cub path (#18362) b7539 Aman Gupta 2025-12-25 23:55:38 +08:00
  • b0fb0f0aee ggml-cuda: fix blackwell native builds (#18361) b7538 Aman Gupta 2025-12-25 22:12:11 +08:00
  • e68c19b0fd CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934) Penglin Cai 2025-12-25 16:46:09 +08:00
  • c54bba869d ggml : optimize cuda cumsum fallback kernel (#18343) b7536 Aadeshveer Singh 2025-12-25 09:41:13 +05:30
  • f5acfb2ffa server: (router) add stop-timeout option (#18350) Xuan-Son Nguyen 2025-12-24 23:47:49 +01:00
  • 4cbafad4f0 model: support MiMo-V2-Flash (#18328) Xuan-Son Nguyen 2025-12-24 23:07:08 +01:00
  • c184284230 fit-params : fix race condition in fit-params output (#18276) Aadeshveer Singh 2025-12-24 20:27:38 +05:30
  • c8a2417d7b CUDA: experimental native mxfp4 support for blackwell (#17906) Aman Gupta 2025-12-24 22:28:26 +08:00
  • 54132f1b1f model : support for LlamaBidirectionalModel architecture (#18220) b7531 Saba Fallah 2025-12-24 14:02:36 +01:00
  • 2a9ea2020c vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302) b7530 Jeff Bolz 2025-12-24 05:36:34 -06:00
  • ce7a6dc0fc CANN : refactor ACL graph cache (#17752) b7529 Wang Weixuan 2025-12-24 17:50:24 +08:00
  • 1ce0126b18 docs: Fix typos in SYCL documentation (#18269) Jesse Ikonen 2025-12-24 11:19:47 +02:00
  • c0a351cc3b tests : revert server test changes (no longer needed) Georgi Gerganov 2025-12-24 10:45:58 +02:00
  • 0ce03597e8 Merge branch 'master' into HEAD Georgi Gerganov 2025-12-24 10:33:21 +02:00
  • 7f459c98e7 vulkan: use fewer FA rows for small cache runs (#18280) b7527 Ruben Ortlam 2025-12-24 08:59:14 +01:00
  • cf2ffc02bc CANN: Uses yarn_ramp cache in ROPE (#17725) b7526 TianHao324 2025-12-24 14:55:33 +08:00
  • 10355dc7d0 common: add LLAMA_ARG_OVERRIDE_TENSOR env var for -ot arg (#18267) b7525 ddh0 2025-12-24 00:19:12 -06:00
  • 5ee4e43f26 server: return_progress to also report 0% processing state (#18305) b7524 Xuan-Son Nguyen 2025-12-23 21:49:05 +01:00
  • 5b6c9bc0f3 webui: apply webui_settings on first load (#18223) Pascal 2025-12-23 15:48:03 +01:00
  • 849d021104 server: fix crash with model not having BOS/EOS (#18321) b7522 Xuan-Son Nguyen 2025-12-23 14:39:36 +01:00
  • 8e3ead6e4d model-conversion : add device option to run-org-model.py (#18318) Daniel Bevenius 2025-12-23 14:07:25 +01:00
  • 12ee1763a6 rpc : add check for rpc buffer type (#18242) b7520 Chris Rohlf 2025-12-23 04:56:49 -05:00
  • ed75977717 ggml-hexagon: create generalized functions for cpu side op (#17500) b7519 nullname 2025-12-23 15:13:24 +08:00
  • 847c35f7d5 model-conversion : add trust_remote_code for embedding scripts (#18288) Daniel Bevenius 2025-12-23 07:27:37 +01:00
  • a6a552e4ec [SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290) Neo Zhang 2025-12-23 12:59:12 +08:00
  • 96e33a814e model : fix div-by-zero for Nemotron V2 (#18309) b7516 Alessandro98-git 2025-12-23 03:04:57 +01:00
  • dfc959b886 model : Granite Embedding support (#15641) b7515 Ryan Mangeno 2025-12-22 18:28:19 -05:00
  • 8f48807380 gguf-py : do not align the data start offset (#18291) compilade 2025-12-22 14:25:16 -05:00
  • bf6bc3c155 ggml-hexagon: gelu optimization (#18151) b7513 Shouyu 2025-12-22 13:56:52 -05:00
  • 179fd82a72 gen-docs: automatically update markdown file (#18294) b7512 Xuan-Son Nguyen 2025-12-22 19:30:19 +01:00
  • d34d5ca1e9 llamafile: add rvv support for sgemm kernels (#18199) b7511 Taimur Ahmad 2025-12-22 23:20:23 +05:00
  • eb492bf43f opencl: unpack q4_0 for adreno in get_tensor (#18278) b7510 lhez 2025-12-22 10:19:01 -08:00
  • e3b35ddf1c vulkan: Extend rope fusions to allow mrope (#18264) b7509 Jeff Bolz 2025-12-22 11:03:13 -06:00
  • 5f14aa8e43 gguf-py : do not align the data start offset compilade/fix-safetensors-unaligned Francis Couture-Harpin 2025-12-22 09:49:54 -05:00
  • 6ce863c803 server: prevent data race from HTTP threads (#18263) b7508 Xuan-Son Nguyen 2025-12-22 14:23:34 +01:00
  • 3997c78e33 server: fix data race in to_json_anthropic (#18283) b7507 Xuan-Son Nguyen 2025-12-22 13:21:43 +01:00
  • ee74642982 release: update release workflow to store XCFramework as Zip file (#18284) b7506 Mattt 2025-12-22 04:11:46 -08:00
  • a28310488c convert: rework ftype heuristics (#18214) Aaron Teo 2025-12-22 20:03:49 +08:00
  • 86af848153 server: (docs) remove mention about extra_args (#18262) Xuan-Son Nguyen 2025-12-22 12:22:01 +01:00
  • 147a521636 tool/ex/tests: consistently free ctx, then model (#18168) b7503 Johannes Gäßler 2025-12-22 11:00:37 +01:00
  • f1310ab904 Merge remote-tracking branch 'upstream/master' into backend-sampling Daniel Bevenius 2025-12-22 06:46:54 +01:00
  • e1f15b454f vulkan: Implement set_tensor_async and the event interfaces (#18047) b7502 Jeff Bolz 2025-12-21 14:52:09 -06:00
  • 0e1ccf15c7 llama: fix RPC for -fit on (#18233) b7501 Johannes Gäßler 2025-12-21 19:33:08 +01:00
  • 5e25ddebff move copilot instructions to AGENTS.md (#18259) Xuan-Son Nguyen 2025-12-21 19:09:21 +01:00
  • fd05c51cec vulkan: fix im2col overflowing maxworkgroupcount (#18180) b7499 Jeff Bolz 2025-12-21 03:32:58 -06:00
  • b365c3ff01 vulkan/cuda: fix topk_moe with exp_probs_b (#18071) b7498 Jeff Bolz 2025-12-21 03:27:34 -06:00
  • cb64222b0c vulkan: support GGML_UNARY_OP_XIELU (#18062) b7497 Jeff Bolz 2025-12-21 03:17:58 -06:00
  • 6eb7081860 vulkan: in graph_optimize, try to group ADD operations (#18060) b7496 Jeff Bolz 2025-12-21 03:05:08 -06:00
  • 4117ae5557 Vulkan: some improvement on mul_mat_iq2_xs (#18031) b7495 lovedheart 2025-12-21 09:59:52 +01:00
  • 65e96a2464 docs : fix links in parsing.md (#18245) Daniel Bevenius 2025-12-21 09:35:40 +01:00
  • 9496bbb808 common : reorganize includes to prioritize vendored deps (#18222) b7493 Aldehir Rojas 2025-12-20 21:43:21 -06:00
  • ddcb75dd8a server: add auto-sleep after N seconds of idle (#18228) b7492 Xuan-Son Nguyen 2025-12-21 02:24:42 +01:00
  • 52ab19df63 tests: Avoid floating point precision false positives in SUM (#17471) b7491 Jeff Bolz 2025-12-20 13:46:46 -06:00
  • 5182dd64cd test-backend-ops: improve msvc build time (#18209) b7490 Jeff Bolz 2025-12-20 13:45:45 -06:00
  • 10b4f82d44 Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212) b7489 Aadeshveer Singh 2025-12-20 16:58:57 +05:30
  • 408616adbd server : [easy] fix per round speculative decode logging (#18211) b7488 Oleksandr Kuvshynov 2025-12-20 04:57:40 -05:00
  • 9e39a1e6a9 server: support load model on startup, support preset-only options (#18206) b7487 Xuan-Son Nguyen 2025-12-20 09:25:27 +01:00
  • 74e05131e9 ci : remove non-windows zip artifacts (#18201) b7486 Sigbjørn Skjæret 2025-12-19 22:29:46 +01:00
  • f74747d886 ci : only save ccache on master (#18207) Sigbjørn Skjæret 2025-12-19 22:29:37 +01:00
  • ce734a8a2f ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977) b7484 Alfred 2025-12-19 12:42:28 -05:00
  • 14931a826e arg: fix order to use short form before long form (#18196) b7483 Pascal 2025-12-19 18:01:56 +01:00
  • 1da013c66e Build with CCCL 3.2 for CUDA backends Oliver Simons 2025-12-19 16:10:51 +01:00
  • f99ef53d2a llama : Changing off_t to size_t for Windows (#18204) b7482 Julius Tischbein 2025-12-19 15:42:46 +01:00
  • b5ec0fd76c Update CCCL version to v3.2.0-rc2 Oliver Simons 2025-12-19 13:42:27 +01:00
  • cc0a04343e server: friendlier error msg when ctx < input (#18174) b7481 Aman Gupta 2025-12-19 19:10:00 +08:00