Commit Graph

  • 84de01a1f1 llama : use LLM_KV for quantization_version & file_type (#24802) b9741 Adrien Gallouët 2026-06-20 20:07:01 +02:00
  • ea65a4b1c8 small nits Xuan Son Nguyen 2026-06-20 19:54:31 +02:00
  • b28e3682e5 Merge branch 'master' into xsn/server_refactor_batch Xuan Son Nguyen 2026-06-20 19:48:36 +02:00
  • 53763db789 rm debug log Xuan Son Nguyen 2026-06-20 19:48:14 +02:00
  • 75f460ac28 arg: try fixing test-args-parser randomly fails (#24826) b9740 Xuan-Son Nguyen 2026-06-20 19:45:27 +02:00
  • bf36838ebd fix assert Xuan Son Nguyen 2026-06-20 19:32:47 +02:00
  • 64ec03d10b handle batch full more carefully Xuan Son Nguyen 2026-06-20 19:30:59 +02:00
  • d704c7929b add abort_all_slots Xuan Son Nguyen 2026-06-20 19:20:12 +02:00
  • af583e3ed3 wip 4 Xuan Son Nguyen 2026-06-20 19:18:05 +02:00
  • b786bb2e60 wip 3 Xuan Son Nguyen 2026-06-20 18:56:58 +02:00
  • 2b2eed8fd7 wip 2 Xuan Son Nguyen 2026-06-20 18:41:56 +02:00
  • 8452824611 release: add missing link for win opencl adreno arm64 (#24809) b9739 Muhammad Salem 2026-06-20 18:08:59 +03:00
  • 6c5c5a29d6 wip Xuan Son Nguyen 2026-06-20 16:48:12 +02:00
  • d5037c508a server: refactor batch construction Xuan Son Nguyen 2026-06-20 16:35:57 +02:00
  • e27f308597 server: avoid forwarding auth headers in CORS proxy (#24373) b9738 Matti4 2026-06-20 15:34:47 +02:00
  • 67e9fd3b74 docker : prebuild web UI for s390x build [no release] (#24829) b9737 Aldehir Rojas 2026-06-20 05:54:42 -05:00
  • 796f41bedc model : glm-dsa load DSA indexer tensors as optional (#24770) b9736 davidrhodus 2026-06-20 03:48:24 -07:00
  • 37a77fb057 ggml : optimize AMX (#24806) b9735 Adrien Gallouët 2026-06-20 12:43:06 +02:00
  • f4043fec01 convert : more consistent handling of rope_parameters (#24833) Sigbjørn Skjæret 2026-06-20 12:42:36 +02:00
  • f449e05537 ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA b9733 Masashi Yoshimura 2026-06-20 08:12:32 +09:00
  • 2b686a9120 server: refactor child --> router communication (#24821) b9732 Xuan-Son Nguyen 2026-06-20 01:02:26 +02:00
  • 4b48a53b6c server : optimize get_token_probabilities (#24796) b9731 Adrien Gallouët 2026-06-19 23:26:54 +02:00
  • e475fa2b5f mtmd, arg: fix utf8 handling on windows (#24779) b9730 Xuan-Son Nguyen 2026-06-19 22:28:38 +02:00
  • 175147e8f6 server: remove all internal mentions about "webui" (#24817) b9729 Xuan-Son Nguyen 2026-06-19 22:12:46 +02:00
  • fabde3bf51 arg: Add comment line support to --api-key-file (#23168) b9728 Mikolaj Kucharski 2026-06-19 15:33:54 +00:00
  • 0d2d9ccbf6 vendor : update cpp-httplib to 0.48.0 (#24787) b9727 Alessandro de Oliveira Faria (A.K.A.CABELO) 2026-06-19 11:16:35 -03:00
  • 8c2d6f6475 server: add --agent arg, remove redundant webui naming compat (#24801) b9726 Xuan-Son Nguyen 2026-06-19 16:06:13 +02:00
  • 38724ab593 docker : build the UI (#24794) b9725 Aldehir Rojas 2026-06-19 08:32:31 -05:00
  • e2e7a9b2d0 mtmd: several bug fixes (#24784) b9724 Xuan-Son Nguyen 2026-06-19 12:18:36 +02:00
  • b14e3fb90c spec: support eagle3 for qwen3.5 & 3.6 (#24593) b9723 Ruixiang Wang 2026-06-19 12:08:50 +02:00
  • 5a7462237e remove duplicated init calls 0cc4m/server-memory-limit Ruben Ortlam 2026-05-24 10:15:09 +02:00
  • 79210e3046 cleanup unused variable Ruben Ortlam 2026-05-21 11:03:16 +02:00
  • 84c4214b39 precompute name->buft map, map GPU host types to CPU buft Ruben Ortlam 2026-05-18 14:22:21 +02:00
  • dbc5f7ec82 move model memory estimation to subprocess Ruben Ortlam 2026-05-13 17:50:11 +02:00
  • 384a495a00 extract duplicated check into helper function Ruben Ortlam 2026-05-13 15:29:24 +02:00
  • 997491a644 replace device memory map with buft memory map. Use llama_get_memory_breakdown Ruben Ortlam 2026-05-13 15:13:13 +02:00
  • a35afd504f cont : clean-up Georgi Gerganov 2026-04-16 14:32:47 +03:00
  • 3046b8853a also strip models memory margin from child processes Ruben Ortlam 2026-04-13 10:14:53 +02:00
  • 216aaf1ad6 improve variable naming, fix style Ruben Ortlam 2026-04-07 13:35:02 +02:00
  • ff41b3dbf7 improve memory_per_device map naming Ruben Ortlam 2026-04-07 13:28:49 +02:00
  • 0e2f08a535 fix model count exceeded check Ruben Ortlam 2026-04-02 11:39:36 +02:00
  • 669948ce12 move llama_context_device_memory function to llama-ext.h Ruben Ortlam 2026-04-02 11:39:07 +02:00
  • 09d8eb95a4 add server memory debug logging Ruben Ortlam 2026-04-02 10:07:04 +02:00
  • c749b6882c use memory margin instead of total size limit, apply to each device separately Ruben Ortlam 2026-04-02 09:24:53 +02:00
  • 4ed48154b0 only set model memory_mb if not previously calculated Ruben Ortlam 2026-03-31 17:37:16 +02:00
  • 6178b8755d use no_alloc to get memory requirements for model load Ruben Ortlam 2026-03-31 16:18:03 +02:00
  • 340c867179 estimate with to-be-loaded model size included Ruben Ortlam 2026-03-29 12:18:51 +02:00
  • f38c4f9419 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold Ruben Ortlam 2026-03-29 10:00:49 +02:00
  • 159d093a43 server: fix non-bound n_discard value (ctx shifting) (#24786) b9722 Xuan-Son Nguyen 2026-06-19 10:53:44 +02:00
  • 5fd2dc2c41 sync : ggml b9721 Georgi Gerganov 2026-06-19 10:18:14 +03:00
  • 1868af13ac ggml : bump version to 0.15.2 (ggml/1548) Georgi Gerganov 2026-06-19 10:14:26 +03:00
  • 5bd21b8555 pi : remove docs from system prompt (#24791) Georgi Gerganov 2026-06-19 09:34:00 +03:00
  • 80452d65b9 server : consolidate slot selection into get_available_slot (#24755) b9718 Georgi Gerganov 2026-06-19 09:22:34 +03:00
  • 8141e730f1 ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul (#24753) b9717 shalinib-ibm 2026-06-19 11:25:38 +05:30
  • db52540f73 mtmd: add batching support for internvl (#24775) b9716 Xuan-Son Nguyen 2026-06-19 01:16:16 +02:00
  • 3a3edc9ac6 Ggml/cuda col2im 1d (#24417) b9715 Pascal 2026-06-18 22:23:01 +02:00
  • 40f3aafc45 server: add "X-Accel-Buffering": "no" header to streaming endpoints (#24774) b9714 Reguna 2026-06-19 04:01:24 +08:00
  • a6b3260a42 mtmd: add batching for mtmd-cli, add video tests (#24778) b9713 Xuan-Son Nguyen 2026-06-18 21:55:04 +02:00
  • 959ce58197 improve Xuan Son Nguyen 2026-06-18 19:29:43 +02:00
  • 39fffcda7b Merge branch 'master' into xsn/mtmd_ds_ocr_tiles Xuan Son Nguyen 2026-06-18 18:59:26 +02:00
  • 32eddaf2ea cmake : fix ui build with read-only source (#24752) b9712 o7si 2026-06-19 00:59:18 +08:00
  • 060ce1bf72 mtmd: refactor llava-uhd overview image handling (always use ov_img_first) (#24769) b9711 Xuan-Son Nguyen 2026-06-18 18:53:49 +02:00
  • d2c67959b3 hexagon: support for op-trace (fine-grain tracing of HVX/HMX/DMA events) (#24592) Max Krasnyansky 2026-06-18 08:35:02 -07:00
  • 7b6c5a2aed docs: fix export-lora --lora-scaled syntax [no release] (#24703) Kangjia Gao 2026-06-18 22:46:17 +08:00
  • 4ea849efc7 rm debugging printf Xuan Son Nguyen 2026-06-18 16:41:12 +02:00
  • 9158400c42 adapt to new preprocessor api Xuan Son Nguyen 2026-06-18 16:32:08 +02:00
  • ea4d61c4f5 Merge branch 'master' into xsn/mtmd_ds_ocr_tiles Xuan Son Nguyen 2026-06-18 16:20:15 +02:00
  • fe7c8b2414 server: (router) fix stopping_thread potentially hang (#24728) Xuan-Son Nguyen 2026-06-18 15:41:09 +02:00
  • e1efd0991d server: add "schema" and validation (#24150) b9707 Xuan-Son Nguyen 2026-06-18 15:40:58 +02:00
  • 08023072ef server : add last-5-seconds generation speed display (#24291) Aarni Koskela 2026-06-18 15:02:20 +03:00
  • 20832179e2 ui: provide touch accessible model selection UI (#24604) Amos Wong 2026-06-18 19:14:20 +08:00
  • 10786217e9 server : return HTTP 400 on invalid grammar (#24144) (#24154) b9704 Anuj Attri 2026-06-18 06:49:14 -04:00
  • 552258c535 server: (router) rework -hf preset repo (#24739) b9703 Xuan-Son Nguyen 2026-06-18 12:45:23 +02:00
  • 968c43891a server: fix router args not being forwarded to child instances (#24760) b9702 Xuan-Son Nguyen 2026-06-18 12:15:46 +02:00
  • 24bba7b98e mtmd: refactor preprocessor, add mtmd_image_preproc_out (#24736) b9701 Xuan-Son Nguyen 2026-06-18 12:04:39 +02:00
  • 9724f664e8 [SYCL] rename GGML_SYCL_SUPPORT_LEVEL_ZERO (#24719) b9700 Neo Zhang 2026-06-18 16:18:26 +08:00
  • dd69db2924 sycl : support MUL_MAT and OUT_PROD with Q1_0 (#24721) b9699 Neo Zhang 2026-06-18 16:17:37 +08:00
  • 6ec59ddaea app : enable self-update only when built with llama-install.sh (#24754) b9698 Adrien Gallouët 2026-06-18 09:57:59 +02:00
  • 32e806b9c1 ci : fix check-release message parsing (#24751) b9697 Sigbjørn Skjæret 2026-06-18 09:32:56 +02:00
  • 6f1034b32a [SYCL] support OPs: conv_2d, conv_2d_dw, conv2d_transpose (#24600) Neo Zhang 2026-06-18 14:40:03 +08:00
  • 0b73fc79fe ui: Update code formatting command in pre-commit hook (#24685) Aleksander Grygier 2026-06-18 08:33:50 +02:00
  • 4a79037b8b ci : fix Windows x64 (OpenVINO) release link (#24731) b9694 Ravi Panchumarthy 2026-06-17 23:30:08 -07:00
  • cae0a3b0b0 metal : check for BF16 support in concat kernel (#24747) b9693 Georgi Gerganov 2026-06-18 09:16:06 +03:00
  • f3e1828164 mtmd: llava_uhd should no longer use batch dim (#24732) b9692 Xuan-Son Nguyen 2026-06-17 22:40:50 +02:00
  • 2e88c49c90 ggml-cpu: Conditionally enable power11 backend based on compiler support (#24687) b9691 shalinib-ibm 2026-06-18 00:15:19 +05:30
  • 0843245cb1 metal : implement rope_back operator (#24725) b9690 Georgi Gerganov 2026-06-17 20:36:05 +03:00
  • 8d2e580632 metal : add f16 and bf16 support for concat operator (#24724) b9689 Georgi Gerganov 2026-06-17 19:38:55 +03:00
  • 4b4d13ae72 server: (router) add model management API (#23976) b9688 Xuan-Son Nguyen 2026-06-17 18:04:58 +02:00
  • 37db4fa4be improve test 0cc4m/test-backend-copy Ruben Ortlam 2026-06-17 17:42:56 +02:00
  • b4024af6c2 llama : skip main_gpu validation when no devices are available (#23405) b9687 Dev-iL 2026-06-17 17:30:26 +03:00
  • 1a2dea29b9 spec: fix segfault error on long prompts for eagle3 (#24707) b9686 Ruixiang Wang 2026-06-17 16:29:49 +02:00
  • 74a80dd9c0 [SYCL] add dev2dev memcpy by SYCL API (#24476) b9685 Neo Zhang 2026-06-17 22:21:34 +08:00
  • d1759e4156 [SYCL] Add conv_3d (#24691) b9684 Neo Zhang 2026-06-17 22:20:01 +08:00
  • e804ed3fbe tests: add backend copy test Ruben Ortlam 2026-05-25 11:07:11 +02:00
  • 42874dfd8f clean up logging and timing 0cc4m/vulkan-graph-reuse Ruben Ortlam 2026-06-17 13:47:53 +02:00
  • 71d9373b82 simplify replay submission Ruben Ortlam 2026-06-17 13:32:30 +02:00
  • 8086439a4c webui: export conversations as jsonl (#24688) Julien Chaumond 2026-06-17 13:25:47 +02:00
  • f10a92dd17 fix queue debug utils label Ruben Ortlam 2026-06-17 13:20:52 +02:00
  • 558e221b70 vulkan: record actual memory properties during buffer creation (#24326) b9682 Winston Ma 2026-06-17 17:14:48 +08:00
  • ea21e03955 Revert "cuda: reset cuda context after reading memory size (#23935)" (#24715) Ruben Ortlam 2026-06-17 10:59:35 +02:00