Commit Graph

  • f28bc4c286 llama : make loras compatible with repacking (#12593) Georgi Gerganov 2025-03-27 08:24:10 +02:00
  • f17a3bb4e8 SYCL: implement memset ggml backend buffer interface (#12580) b4967 Akarshan Biswas 2025-03-27 07:16:00 +05:30
  • bd40678df7 HIP: Add support for RDNA4 targets (#12372) b4966 Slobodan Josic 2025-03-26 23:46:30 +01:00
  • b3298fa47a metal : refactor mat-vec code (#12569) Georgi Gerganov 2025-03-26 21:38:38 +02:00
  • 70b063a550 metal : reduce register pressure gg/metal-refactor-mv-2 Georgi Gerganov 2025-03-26 21:24:28 +02:00
  • 2447ad8a98 upgrade to llguidance 0.7.10 (#12576) b4964 Michał Moskal 2025-03-26 11:06:09 -07:00
  • 02082f1519 clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566) b4963 Ivy233 2025-03-26 22:06:04 +08:00
  • df4d20cd53 convert : fix squeeze for ssm_conv tensors (#12573) Georgi Gerganov 2025-03-26 14:21:05 +02:00
  • 5ed38b6852 ggml : fix MUL_MAT_ID repack with Q8_K (#12544) b4961 Georgi Gerganov 2025-03-26 13:02:00 +02:00
  • fd7855f8f5 doc: [MUSA] minor changes (#12583) R0CKSTAR 2025-03-26 15:09:48 +08:00
  • 53af4dba42 convert: fix Mistral3/Gemma3 model hparams init (#12571) Sigbjørn Skjæret 2025-03-25 23:03:10 +01:00
  • c6a1be6c0b metal : fix typo [no ci] Georgi Gerganov 2025-03-25 22:37:56 +02:00
  • e7d14ab26c metal : reduce register pressure Georgi Gerganov 2025-03-25 21:55:15 +02:00
  • 20b256e0fd convert : match ssm_conv tensors by type gg/mamba-fix-squeeze Francis Couture-Harpin 2025-03-25 14:29:22 -04:00
  • 9c60fc4c78 convert : fix squeeze for ssm_conv tensors Georgi Gerganov 2025-03-25 19:54:18 +02:00
  • ef19c71769 run: de-duplicate fmt and format functions and optimize (#11596) b4958 Eric Curtin 2025-03-25 17:46:11 +00:00
  • fe12e20a7f metal : mv q6_K support nr0 > 1 Georgi Gerganov 2025-03-25 17:48:43 +02:00
  • 51dea76888 metal : fix nr constant [no ci] Georgi Gerganov 2025-03-25 14:40:01 +02:00
  • 982c82f1e6 metal : fix comments [no ci] Georgi Gerganov 2025-03-25 14:36:22 +02:00
  • 24a9ea8b44 metal : rename all_sum -> sum_all Georgi Gerganov 2025-03-25 14:34:38 +02:00
  • fcca45c027 metal : refactor mat-vec code Georgi Gerganov 2025-03-24 16:51:30 +02:00
  • 053b3f9aae ggml-cpu : update KleidiAI to v1.5.0 (#12568) b4957 Dan Johansson 2025-03-25 12:10:18 +01:00
  • e2f560175a SYCL: disable Q4_0 reorder optimization (#12560) b4956 Akarshan Biswas 2025-03-25 16:10:18 +05:30
  • 36ee06dd2d docs : add build instructions for KleidiAI (#12563) Dan Johansson 2025-03-25 10:35:20 +01:00
  • 3cd3a39532 ci: [MUSA] add CI and update doc (#12562) R0CKSTAR 2025-03-25 15:45:08 +08:00
  • e94c2bd360 ggml : improve repack templates gg/repack-fix-mul-mat-id Georgi Gerganov 2025-03-25 09:42:31 +02:00
  • 2d77d88e70 context : fix worst-case reserve outputs (#12545) b4953 Georgi Gerganov 2025-03-25 09:19:23 +02:00
  • b8b7885484 SYCL: disable Q4_0 reorder optimization sycl/disable_reorder_opt Akarshan Biswas 2025-03-25 10:14:11 +05:30
  • c95fa362b3 ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547) Akarshan Biswas 2025-03-24 23:05:38 +05:30
  • 2b65ae3029 opencl: simplify kernel embedding logic in cmakefile (#12503) b4951 lhez 2025-03-24 09:20:47 -07:00
  • 48d7021c61 CI: fix SYCL build (#12546) Akarshan Biswas 2025-03-24 18:28:32 +05:30
  • 87cd537a29 ggml : fix MUL_MAT_ID repack with Q8_K Georgi Gerganov 2025-03-24 13:07:10 +02:00
  • 3361e2deba docs: update: improve the Fedoa CUDA guide (#12536) Tei Home 2025-03-24 19:02:26 +08:00
  • 00d53800e0 llama-vocab : add SuperBPE pre-tokenizer (#12532) b4948 compilade 2025-03-24 06:47:24 -04:00
  • 7ea75035b6 CUDA: Fix clang warnings (#12540) b4947 R0CKSTAR 2025-03-24 18:28:34 +08:00
  • c54f6b7988 mmap : skip resource limit checks on AIX (#12541) b4946 Prajwal B Mehendarkar 2025-03-24 15:47:10 +05:30
  • 9b169a4d4e vulkan: fix mul_mat_vec failure in backend tests (#12529) b4945 Jeff Bolz 2025-03-24 01:56:17 -05:00
  • a5b1943912 ggml-quants : fix some edge cases in make_qkxh_nl_quants compilade/optimal-rounding Francis Couture-Harpin 2025-03-23 17:59:37 -04:00
  • 35c2f8b9ff llama-vocab : add SuperBPE pre-tokenizer compilade/superbpe Francis Couture-Harpin 2025-03-23 16:19:03 -04:00
  • 77f9c6bbe5 server : Add verbose output to OAI compatible chat endpoint. (#12246) b4944 Marius Gerdes 2025-03-23 19:30:26 +01:00
  • 18b663d8e4 install : add macports (#12518) Lars Sonchocky-Helldorf 2025-03-23 09:21:48 +01:00
  • 8b8b88f3de ggml-quants : restore Q2_K use of make_qp_quants Francis Couture-Harpin 2025-03-22 18:47:56 -04:00
  • fbdfefe74e llama : gemma3 : use output tensor if it exists in model weight (#12506) b4942 Xuan-Son Nguyen 2025-03-22 23:28:19 +01:00
  • a41139723d Merge branch 'master' into compilade/optimal-rounding Francis Couture-Harpin 2025-03-22 15:05:11 -04:00
  • af23abd3cb ggml-quants : remove slower qsort-based cumulative search Francis Couture-Harpin 2025-03-22 12:07:28 -04:00
  • 3e4b675c9f ggml-quants : use a max-heap for TQ1_0 and TQ2_0 quantization Francis Couture-Harpin 2025-03-22 12:03:26 -04:00
  • ba932dfb50 ggml : fix quantized cpy op (#12310) Georgi Gerganov 2025-03-22 16:23:26 +02:00
  • fac63a3d78 musa: refine compute capability (#12493) b4940 R0CKSTAR 2025-03-22 17:11:37 +08:00
  • eddfb43850 vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505) b4939 Jeff Bolz 2025-03-22 03:40:11 -05:00
  • 4375415b4a Vulkan: RTE rounding for cpy to quant (#12480) b4938 stduhpf 2025-03-21 20:34:50 +01:00
  • 30c42ef5cb vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) b4937 Eve 2025-03-21 19:27:47 +00:00
  • f86b8ff210 ggml-quants : use qkxh in more places Francis Couture-Harpin 2025-03-21 14:05:58 -04:00
  • af04481e6b model : do not repack if a GPU device is present (#12498) b4936 Georgi Gerganov 2025-03-21 16:14:29 +02:00
  • 960e726077 chore : cleanup llama_model_loader::TENSOR_ usage (#12492) b4935 Sigbjørn Skjæret 2025-03-21 10:21:36 +01:00
  • ea1518e839 llama-tts : avoid crashes related to bad model file paths (#12482) b4934 marcoStocchi 2025-03-21 10:12:45 +01:00
  • 1aa87ee53d [SYCL] Fix build on Windows when ccache enabled (#9954) (#9976) b4933 蕭澧邦 2025-03-21 14:58:47 +08:00
  • 9ffcc9e374 sycl: cleanup oneDNN related code (#12097) b4932 Svetlozar Georgiev 2025-03-21 02:15:56 +00:00
  • 3be115100f ggml-quants : use a max-heap for linear quants like Q3_K Francis Couture-Harpin 2025-03-20 19:21:45 -04:00
  • b8b173274d server : remove old commented code [no ci] xsn/private_batch_api_pooling_none Georgi Gerganov 2025-03-20 18:19:55 +02:00
  • e04643063b webui : Prevent rerendering on textarea input (#12299) Woof Dog 2025-03-20 14:57:43 +00:00
  • 8a23b4a54a server : avoid common_batch Georgi Gerganov 2025-03-20 16:52:24 +02:00
  • dbb3a4739e llama : make Qwen2MoE QKV bias optional (#12477) b4930 Sigbjørn Skjæret 2025-03-20 12:49:59 +01:00
  • 3d82dbcbce ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332) b4929 Srihari-mcw 2025-03-20 17:05:34 +05:30
  • 76fd7d6f5b perplexity : avoid common_batch Georgi Gerganov 2025-03-20 12:21:40 +02:00
  • 732b5fbf5e convert : avoid calls to tokenizer.added_tokens_decoder (#12473) Bartowski 2025-03-20 02:36:37 -04:00
  • 568013d0cd context : clear sets containing encoder output sequence ids before storing new values (#12470) b4927 fairydreaming 2025-03-19 21:01:57 +01:00
  • 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) b4926 Gaurav Garg 2025-03-20 01:22:06 +05:30
  • a9b59288e2 vulkan: optimize iq1 coopmat2 dequant functions (#12427) b4925 Jeff Bolz 2025-03-19 13:56:23 -05:00
  • 8b80d68338 embedding : avoid common_batch Georgi Gerganov 2025-03-19 14:29:04 +02:00
  • 6f54ee660c retrieval : avoid common_batch Georgi Gerganov 2025-03-19 13:50:15 +02:00
  • 0fd8487b14 Fix visionOS build and add CI (#12415) b4924 Guus Waals 2025-03-19 10:15:23 +00:00
  • 32c2c41d5e android : fix permission Xuan Son Nguyen 2025-03-19 10:49:30 +01:00
  • 96ca6e8d23 swift : adapt to new API Georgi Gerganov 2025-03-19 10:48:42 +02:00
  • b0db7fc2c6 android : adapt to new API Georgi Gerganov 2025-03-19 10:16:55 +02:00
  • 23d7407314 Merge pull request #15 from ggml-org/xsn/private_batch_api Xuan-Son Nguyen 2025-03-19 09:15:09 +01:00
  • 108e53c2f1 llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456) b4923 Sigbjørn Skjæret 2025-03-19 09:08:49 +01:00
  • a686171ea7 convert : Support chat_template.json (#12460) Sigbjørn Skjæret 2025-03-19 08:58:13 +01:00
  • c446b2edd2 vulkan: Submit once enough matmul work has been recorded (#12406) b4921 Jeff Bolz 2025-03-19 02:26:26 -05:00
  • 7a3c178d78 speculative : adapt to new llama API xsn/private_batch_api Georgi Gerganov 2025-03-18 16:10:26 +02:00
  • d84635b1b0 opencl: improve profiling (#12442) b4920 lhez 2025-03-18 12:54:55 -07:00
  • 75422e8bc4 graph : normalize Q, K, V shapes + sync cross attention (#12449) b4919 Georgi Gerganov 2025-03-18 21:35:19 +02:00
  • bb115d2bf7 musa: override warp_size of musa device to 32 (#12445) R0CKSTAR 2025-03-19 02:28:26 +08:00
  • 29fff308c7 llama : support converting Mistral Small text-only (#12450) Xuan-Son Nguyen 2025-03-18 19:16:19 +01:00
  • c6af2161b2 speculative : fix seg fault in certain cases (#12454) b4916 Georgi Gerganov 2025-03-18 19:35:11 +02:00
  • 99aa304fb9 llama : add support for EXAONE tied word embeddings (#12451) b4915 Xuan-Son Nguyen 2025-03-18 17:24:33 +01:00
  • dc4bb64290 Merge branch 'master' into xsn/private_batch_api Xuan Son Nguyen 2025-03-18 15:45:22 +01:00
  • 8551c44d84 context : always use non-causal attention for encoder graphs (#12447) b4914 Georgi Gerganov 2025-03-18 13:05:49 +02:00
  • 35cae5ba05 SYCL: using graphs is configurable by environment variable and compile option (#12371) b4913 Łukasz Ślusarczyk 2025-03-18 11:16:31 +01:00
  • 810e0af3f5 server : fix warmup draft cache type (#12446) b4912 Georgi Gerganov 2025-03-18 12:05:42 +02:00
  • 29acf2cf05 context : move the change to llama_context::encode() gg/context-fix-enc-attn-type Georgi Gerganov 2025-03-18 11:55:19 +02:00
  • eba92d64c3 cmake : fix PowerPC build (#12241) b4911 Prajwal B Mehendarkar 2025-03-18 15:07:33 +05:30
  • a0554c3cdc context : always use non-causal attention for encoder graphs Georgi Gerganov 2025-03-18 11:14:48 +02:00
  • d9a14523bb ggml : add SVE support for q6_K_q8_K (#12361) b4910 fj-y-saito 2025-03-18 17:14:39 +09:00
  • fd123cfead Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434) b4909 0cc4m 2025-03-18 07:21:40 +01:00
  • a53f7f7b88 fixed compilation warnings in ggml-sycl (#12424) b4908 Łukasz Ślusarczyk 2025-03-18 01:51:25 +01:00
  • 7dfad387e3 llama: Add support for RWKV v7 architecture (#12412) b4907 Molly Sophia 2025-03-18 07:27:50 +08:00
  • 60c902926c docs : bring llama-cli conversation/template docs up-to-date (#12426) Sigbjørn Skjæret 2025-03-17 21:14:32 +01:00
  • b1b132efcb cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394) b4905 Gaurav Garg 2025-03-17 23:55:13 +05:30
  • 01e8f2138b ggml-vulkan: remove unused find_program(glslc) (#12416) Guus Waals 2025-03-18 00:35:43 +08:00
  • 484a8ab513 vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312) b4903 Jeff Bolz 2025-03-17 09:26:18 -05:00