Commit Graph

  • 317709b2a8 make portability_enumeration_ext apple only (#5757) b2294 Eve 2024-02-28 19:33:37 +00:00
  • 08c5ee87e4 llama : remove deprecated API (#5770) b2293 Georgi Gerganov 2024-02-28 18:43:38 +02:00
  • 78aacf3634 awq-py : remove (#5768) Georgi Gerganov 2024-02-28 17:36:53 +02:00
  • 8c0e8f4e73 sync : ggml b2291 Georgi Gerganov 2024-02-28 11:17:32 +02:00
  • 2774b0c974 add google magika inference example (ggml/748) slaren 2024-02-25 20:41:35 +01:00
  • 5f70671856 Introduce backend GUIDs (ggml/743) UEXTM.com 2024-02-24 11:27:36 -05:00
  • a693bea1e6 server : hit Ctrl+C twice to exit (#5734) b2288 Xuan Son Nguyen 2024-02-28 09:55:37 +01:00
  • adcb12a9ba llama : fix non-quantization of expert gating tensors (#5754) b2287 compilade 2024-02-28 03:52:56 -05:00
  • 177628bfd8 llama : improve BERT tokenization (#5740) b2286 Douglas Hanley 2024-02-28 02:51:11 -06:00
  • 6c4416868d readme : add link to LLaVA 1.6 models (#5758) Daniel Bevenius 2024-02-28 09:39:39 +01:00
  • efc72253f7 server : add "/chat/completions" alias for "/v1/...` (#5722) b2284 Jorge A 2024-02-28 01:39:15 -07:00
  • 7c4263d426 ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760) b2283 Kawrakow 2024-02-28 10:37:02 +02:00
  • f0cbb6ddf6 iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) ik/i-quants-64 Iwan Kawrakow 2024-02-28 08:28:10 +02:00
  • 47d52b2b24 Q2_K: fixed bug in imatrix quantization for QK_K = 64 Iwan Kawrakow 2024-02-28 08:15:52 +02:00
  • 2540a290ed Make CUDA compile with QK_K = 64 Iwan Kawrakow 2024-02-27 21:35:11 +02:00
  • de64e061da QK_K = 64 tests pass on ARM_NEON and Metal Iwan Kawrakow 2024-02-27 20:12:54 +02:00
  • cb49e0f8c9 Attempt to fix android build (#5752) b2282 Kawrakow 2024-02-27 19:16:49 +02:00
  • 28e6146c11 iq2_xs: attempt to fix AVX dot product for QK_K = 64 Iwan Kawrakow 2024-02-27 18:41:31 +02:00
  • 13ba37f1aa WIP: make i-quants work for QK_K = 64 Iwan Kawrakow 2024-02-27 17:30:11 +02:00
  • 0becb22ac0 IQ4_XS: a 4.25 bpw quantization (#5747) b2281 Kawrakow 2024-02-27 16:34:24 +02:00
  • 14d757066b llama : add llama_kv_cache_compress (EXPERIMENTAL) gg/kv-compress Georgi Gerganov 2024-02-25 22:16:13 +02:00
  • c24a2a6e60 cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744) b2280 Engininja2 2024-02-27 07:22:45 -06:00
  • 1f30b7a9f1 ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742) b2279 Engininja2 2024-02-27 06:50:18 -06:00
  • 9d533a77d0 llama : fix defrag bugs + add parameter (#5735) b2278 Georgi Gerganov 2024-02-27 14:35:51 +02:00
  • cbbd1efa06 Makefile: use variables for cublas (#5689) b2277 le.chang 2024-02-27 10:03:06 +08:00
  • b11a93df41 fix server hangs on empty prompt (#5733) b2276 Xuan Son Nguyen 2024-02-26 23:15:48 +01:00
  • a33e6a0d2a Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721) b2275 Kawrakow 2024-02-26 18:28:38 +02:00
  • 47bb7b48c7 CUDA: fix DEBUG_CUDA_MALLOC (#5729) b2274 Johannes Gäßler 2024-02-26 15:36:38 +01:00
  • c4d7f81786 readme : update ui list (#5731) Artem 2024-02-26 17:15:28 +03:00
  • e849078c6e [SYCL] Add support for soft_max ALiBi (#5639) b2272 AidanBeltonS 2024-02-26 14:02:11 +00:00
  • 67fd33132f unicode : reuse iterator (#5726) b2271 Georgi Gerganov 2024-02-26 14:02:12 +02:00
  • 4804215cb8 server: CI fix trailing space (#5728) b2270 Pierrick Hymbert 2024-02-26 11:41:34 +01:00
  • 8a533f0d90 server: CI tests reduce build matrix (#5725) b2269 Pierrick Hymbert 2024-02-26 09:56:10 +01:00
  • 269de86ba0 llama : fix Gemma rope type (#5691) b2268 Georgi Gerganov 2024-02-26 08:30:17 +02:00
  • c393733988 flake.lock: Update b2267 github-actions[bot] 2024-02-25 00:17:11 +00:00
  • e3965cf35a server: tests - slow inference causes timeout on the CI (#5715) b2266 Pierrick Hymbert 2024-02-25 22:48:33 +01:00
  • 8b350356b2 server: docs - refresh and tease a little bit more the http server (#5718) Pierrick Hymbert 2024-02-25 21:46:29 +01:00
  • bf08e00643 llama : refactor k-shift implementation + KV defragmentation (#5691) b2264 Georgi Gerganov 2024-02-25 22:12:24 +02:00
  • f7625019c5 server : fix crash when system prompt is bigger than batch size (#5714) b2263 compilade 2024-02-25 13:43:50 -05:00
  • abbabc5e51 ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711) b2262 Radosław Gryta 2024-02-25 19:43:00 +01:00
  • f1a98c5254 make : fix nvcc version is empty (#5713) b2261 kwin1412 2024-02-26 00:46:49 +08:00
  • 7d548a1827 readme : add Msty to UI list (#5618) Ashok Gelal 2024-02-25 10:57:34 -05:00
  • 930b178026 server: logs - unified format and --log-format option (#5700) b2259 Pierrick Hymbert 2024-02-25 13:50:32 +01:00
  • d52d7819b8 server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708) b2258 Pierrick Hymbert 2024-02-25 13:49:43 +01:00
  • 1289408817 cmake : fix compilation for Android armeabi-v7a (#5702) b2257 Radosław Gryta 2024-02-25 11:53:11 +01:00
  • ab336a9d5e code : normalize enum names (#5697) b2256 Georgi Gerganov 2024-02-25 12:09:09 +02:00
  • 69917dfa55 py : fix StableLM conversion after config.json changes (#5703) Anas Ahouzi 2024-02-25 10:54:04 +01:00
  • 9e359a4f47 server: continue to update other slots on embedding concurrent request (#5699) b2254 Pierrick Hymbert 2024-02-24 19:16:04 +01:00
  • 4c4cb30736 IQ3_S: a much better alternative to Q3_K (#5676) b2253 Kawrakow 2024-02-24 16:23:52 +02:00
  • 525213d2f5 server: init functional tests (#5566) b2252 Pierrick Hymbert 2024-02-24 12:28:55 +01:00
  • fd43d66f46 server : add KV cache quantization options (#5684) b2251 AlpinDale 2024-02-23 19:31:54 +00:00
  • 54fbcd2ce6 convert : fix missing ftype for gemma (#5690) Jared Van Bortel 2024-02-23 13:39:14 -05:00
  • 608f449880 swift : fix build gg/float-pos Georgi Gerganov 2024-02-23 19:02:09 +02:00
  • fff1e8a54a batched.swift : fix build Georgi Gerganov 2024-02-23 16:15:37 +02:00
  • 8772658b11 ggml : add I32 <-> F32 conversion Georgi Gerganov 2024-02-23 14:14:49 +02:00
  • fc775366f1 llama : switch to floating-point token positions Georgi Gerganov 2024-02-23 12:18:30 +02:00
  • 15499eb942 mpt : do not duplicate token_embd.weight on disk (#5670) b2249 Jared Van Bortel 2024-02-22 17:05:23 -05:00
  • 96633eeca1 gemma : use more bits for the token_embd.weight tensor (#5650) b2248 Georgi Gerganov 2024-02-22 23:23:46 +02:00
  • 847eedbdb2 py : add Gemma conversion from HF models (#5647) b2247 Georgi Gerganov 2024-02-22 23:22:48 +02:00
  • 7e4f339c40 ggml : always define ggml_fp16_t as uint16_t (#5666) b2246 Georgi Gerganov 2024-02-22 23:21:39 +02:00
  • 334f76fa38 sync : ggml b2245 Georgi Gerganov 2024-02-22 23:21:05 +02:00
  • efd56b1c21 ggml : 32-bit arm compat (whisper/1891) Georgi Gerganov 2024-02-22 18:31:40 +02:00
  • 201294ae17 nix: init singularity and docker images (#5056) b2243 Someone 2024-02-22 19:44:10 +00:00
  • 5a9e2f60ba py : minor fixes (#5668) Georgi Gerganov 2024-02-22 20:13:25 +02:00
  • 373ee3fbba Add Gemma chat template (#5665) b2241 Xuan Son Nguyen 2024-02-22 19:10:21 +01:00
  • 56c047156a py : minor fixes gg/py-minor-fixes Georgi Gerganov 2024-02-22 19:22:56 +02:00
  • 4cb4d8b22d workflows: nix: hardcode cachix ids, build unconditionally (#5663) b2240 Someone 2024-02-22 16:32:09 +00:00
  • 3a03541ced minor : fix trailing whitespace (#5638) b2239 Georgi Gerganov 2024-02-22 13:54:03 +02:00
  • 56d03d92be readme : update hot topics Georgi Gerganov 2024-02-22 10:35:54 +02:00
  • a46f50747b server : fallback to chatml, add AlphaMonarch chat template (#5628) b2237 Xuan Son Nguyen 2024-02-22 09:33:24 +01:00
  • c5688c6250 server : clarify some params in the docs (#5640) Alexey Parfenov 2024-02-22 08:27:32 +00:00
  • 4ef245a92a mpt : add optional bias tensors (#5638) b2235 Dat Quoc Nguyen 2024-02-22 18:15:13 +10:00
  • 973053d8b0 llama : fix loading models with shared tok_embd and output (#5651) b2234 slaren 2024-02-22 00:42:09 +01:00
  • 7c8bcc11dc Add docs for llama_chat_apply_template (#5645) b2233 Xuan Son Nguyen 2024-02-22 00:31:00 +01:00
  • 5271c75666 llama : fix K-shift with quantized K (wip) sl/fix-quant-kv-shift slaren 2024-02-22 00:28:39 +01:00
  • 7fe4678b02 llama : fix session save/load with quantized KV (#5649) b2232 slaren 2024-02-21 22:52:39 +01:00
  • ba2135ccae gemma : allow offloading the output tensor (#5646) b2231 slaren 2024-02-21 22:18:23 +01:00
  • 89febfed93 examples : do not assume BOS when shifting context (#5622) b2230 Jared Van Bortel 2024-02-21 10:33:54 -05:00
  • 5022cf242d sync : ggml Georgi Gerganov 2024-02-21 16:52:39 +02:00
  • 1ecea255eb server: health: fix race condition on slots data using tasks queue (#5634) b2228 Pierrick Hymbert 2024-02-21 15:47:48 +01:00
  • a00a35cef9 readme : add LocalAI to the availables UI (#5629) Ettore Di Giacinto 2024-02-21 15:39:10 +01:00
  • eccd7a26dd sync : ggml (#5633) b2226 Georgi Gerganov 2024-02-21 16:17:10 +02:00
  • c14f72db9c readme : update hot topics Georgi Gerganov 2024-02-21 15:39:54 +02:00
  • cc6cac08e3 llava : add --skip-unknown to 1.6 convert.py (#5632) Daniel Bevenius 2024-02-21 14:36:57 +01:00
  • 580111d42b llama : add gemma model (#5631) b2223 postmasters 2024-02-21 05:08:22 -08:00
  • 88c46cbdac [SYCL] conext add name (#5624) b2222 Meng, Hengyu 2024-02-21 17:52:06 +08:00
  • a14679cc30 IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590) b2221 Kawrakow 2024-02-21 11:39:52 +02:00
  • 6560bed3f0 server : support llava 1.6 (#5553) b2220 CJ Pais 2024-02-20 11:07:22 -08:00
  • 06bf2cf8c4 make : fix debug build with CUDA (#5616) b2219 slaren 2024-02-20 20:06:17 +01:00
  • 4ed8e4fbef llava : add explicit instructions for llava-1.6 (#5611) Daniel Bevenius 2024-02-20 18:30:27 +01:00
  • 941de11759 convert : get general.name from model dir, not its parent Jared Van Bortel 2024-02-20 11:16:54 -05:00
  • 9c405c9f9a Server: use llama_chat_apply_template (#5593) b2217 Xuan Son Nguyen 2024-02-20 15:58:27 +01:00
  • 5207b3fbc5 readme : update UI list (#5605) Dane Madsen 2024-02-20 21:00:23 +11:00
  • 8dbbd75754 metal : add build system support for embedded metal library (#5604) b2215 Haoxiang Fei 2024-02-19 22:58:36 -11:00
  • c0a8c6db37 server : health endpoint configurable failure on no slot (#5594) b2214 Pierrick Hymbert 2024-02-20 08:48:19 +01:00
  • b9111bd209 Update ggml_sycl_op_mul_mat_vec_q (#5502) b2213 AidanBeltonS 2024-02-20 07:01:25 +00:00
  • 633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries b2212 Mathijs de Bruin 2024-02-13 20:28:02 +00:00
  • 22f83f0c38 Enable Vulkan MacOS CI 0cc4m 2024-02-10 22:18:33 +01:00
  • bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 0cc4m 2024-02-14 20:57:17 +01:00
  • f50db6ae0b Add check for VK_KHR_portability_enumeration for MoltenVK support 0cc4m 2024-02-10 22:14:52 +01:00