Commit Graph

  • a95631ee97 readme : update API notes Georgi Gerganov 2024-06-26 19:26:13 +03:00
  • 65f9293d14 devops : remove clblast + LLAMA_CUDA -> GGML_CUDA gg/fix-devops Georgi Gerganov 2024-06-26 18:37:55 +03:00
  • c4ded1a8fb llama : make pos_bias contiguous for CUDA Stanisław Szymczyk 2024-06-26 17:46:39 +02:00
  • bad0cafee9 llama : updated llm_build_ffn() calls to new API in build_t5() Stanisław Szymczyk 2024-06-26 17:38:13 +02:00
  • f3f65429c4 llama : reorganize source code + improve CMake (#8006) Georgi Gerganov 2024-06-26 18:33:02 +03:00
  • 1c8d37a267 Merge branch 'ggerganov:master' into t5-clean-3 fairydreaming 2024-06-26 17:31:15 +02:00
  • 1e6e363d7f test zero max buffer size sl/zero-max-size slaren 2024-06-26 17:11:09 +02:00
  • 45681a57dd llama : add inference support and model types for T5 and FLAN-T5 model families Stanisław Szymczyk 2024-06-26 15:03:01 +02:00
  • 8854044561 Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115) Isaac McFadyen 2024-06-26 02:29:28 -04:00
  • c8771ab5f8 CUDA: fix misaligned shared memory read (#8123) Johannes Gäßler 2024-06-26 08:28:02 +02:00
  • 494165f3b6 llama : extend llm_build_ffn() to support _scale tensors (#8103) b3233 Eddie-Wang 2024-06-26 14:27:46 +08:00
  • 9b2f16f805 json: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863) b3232 Olivier Chafik 2024-06-26 01:46:35 +01:00
  • 6777c544bd json: fix additionalProperties, allow space after enum/const (#7840) b3231 Olivier Chafik 2024-06-26 01:45:58 +01:00
  • 163d50adaf fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] (#8060) b3230 jukofyork 2024-06-25 21:47:40 +01:00
  • 6fcbf68235 llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763) b3229 fairydreaming 2024-06-25 21:14:35 +02:00
  • e6bf007744 llama : return nullptr from llama_grammar_init (#8093) b3228 Daniel Bevenius 2024-06-25 21:07:28 +02:00
  • 84631fe150 json: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797) b3227 Olivier Chafik 2024-06-25 20:06:20 +01:00
  • dd047b476c disable docker CI on pull requests (#8110) b3226 slaren 2024-06-25 19:20:06 +02:00
  • 925c30956d Add healthchecks to llama-server containers (#8081) joecryptotoo 2024-06-25 08:13:27 -07:00
  • 4c67d7cef5 add space in "-1" caitianchi 2024-06-25 20:06:55 +08:00
  • e68c8bc1e3 change n_layer caitianchi 2024-06-25 20:05:52 +08:00
  • c8ad35955a Gguf dump start data offset via --data-offset and some extra refactor (#8054) Brian 2024-06-25 22:03:25 +10:00
  • 49c03c79cd cvector: better prompt handling, add "mean vector" method (#8069) b3223 Xuan Son Nguyen 2024-06-25 13:59:54 +02:00
  • 48e6b92cc3 Add chat template support for llama-cli (#8068) b3222 Xuan Son Nguyen 2024-06-25 13:56:49 +02:00
  • 3791ad2193 SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950) HanishKVC 2024-06-25 16:57:35 +05:30
  • 8f0350578d fix quality problem in pr code caitianchi 2024-06-25 18:51:06 +08:00
  • f702a90e24 Update control vector help (#8104) b3220 HatsuneMikuUwU33 2024-06-25 10:44:48 +02:00
  • 083bacce14 [SYCL] Re-enabled mul_mat_batched_sycl (#8095) b3219 Meng, Hengyu 2024-06-25 10:19:20 +08:00
  • 2df373ac40 CUDA: fix matrix multiplication algorithm choice (#8102) b3218 Johannes Gäßler 2024-06-25 01:22:33 +02:00
  • 3b099bcd9c CUDA: fix MMQ writeback for int8 tensor cores (#8100) Johannes Gäßler 2024-06-24 22:15:33 +02:00
  • a818f3028d CUDA: use MMQ instead of cuBLAS by default (#8075) b3216 Johannes Gäßler 2024-06-24 17:43:42 +02:00
  • d62e4aaa02 gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090) fairydreaming 2024-06-24 14:13:39 +02:00
  • 9a590c8226 CUDA: optimize MMQ int8 tensor core performance (#8062) Johannes Gäßler 2024-06-24 12:41:23 +02:00
  • 52fc8705a0 Option to split during conversion (#6942) Christian Zhou-Zheng 2024-06-24 05:42:03 -04:00
  • 8cb508d0d5 disable publishing the full-rocm docker image (#8083) b3212 slaren 2024-06-24 07:36:11 +02:00
  • 646ef4a9cf embedding : more cli arguments (#7458) b3211 Yann Follet 2024-06-24 13:30:24 +08:00
  • de0d6a68ac gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763) fairydreaming 2024-06-24 07:06:05 +02:00
  • cb8cfb9d4d Merge pull request #15 from OpenBMB/master tc-mb 2024-06-24 11:29:30 +08:00
  • 77beb4d153 Merge branch 'prepare-PR-of-minicpm-v2.5' into master tc-mb 2024-06-24 11:29:17 +08:00
  • 95f57bb5d5 ggml : remove ggml_task_type and GGML_PERF (#8017) b3209 slaren 2024-06-24 03:07:59 +02:00
  • e112b610a1 llama : add support for BitnetForCausalLM (#7931) b3208 Eddie-Wang 2024-06-24 02:27:57 +08:00
  • 6a2f298bd7 server : fix JSON-Scheme typo (#7975) Aarni Koskela 2024-06-23 18:03:08 +03:00
  • 11318d9aa1 Fix typo in llama_set_embeddings comment (#8077) b3206 Daniel Bevenius 2024-06-23 15:39:45 +02:00
  • b6b9a8e606 fix CI failures (#8066) b3205 slaren 2024-06-23 13:14:45 +02:00
  • 45c0e2e4c1 Refactor Vulkan backend to allow multiple contexts (#7961) b3204 0cc4m 2024-06-23 10:21:25 +02:00
  • b5a5f34efa Removing extra blank lines that were breaking Lint. (#8067) b3203 Clint Herron 2024-06-22 14:28:18 -04:00
  • 3e58b0ee35 cvector: fix CI + correct help message (#8064) b3202 Xuan Son Nguyen 2024-06-22 18:11:30 +02:00
  • adf480c3ab cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052) b3201 HatsuneMikuUwU33 2024-06-22 17:19:37 +02:00
  • 3aa184a8c7 convert-hf : change assert to exception (#8015) 0xspringtime 2024-06-22 09:37:41 -04:00
  • 5b48cd53a8 Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058) b3199 ddh0 2024-06-22 07:16:10 -06:00
  • c5a8d4b749 JSON Schema to GBNF integration tests (#7790) Clint Herron 2024-06-21 23:18:36 -04:00
  • 557b653dc9 vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022) b3197 k.h.lai 2024-06-21 16:28:20 +08:00
  • 7d5e8777ae ggml : AVX IQ quants (#7845) b3196 Eve 2024-06-21 05:57:36 +00:00
  • a927b0f3dd llama : optimize long word tokenization with WPM (#8034) b3195 Georgi Gerganov 2024-06-21 08:51:28 +03:00
  • 80ea089d77 llama : allow pooled embeddings on any model (#7477) b3194 Douglas Hanley 2024-06-21 00:38:22 -05:00
  • 0e64591e82 swiftui : enable stream updating (#7754) b3193 Shuichi Tsutsumi 2024-06-21 14:30:58 +09:00
  • ff0aa3abd1 fix part of mul_mat_id sycl-mul-mat-id Meng, Hengyu 2024-06-21 03:38:00 +00:00
  • b1ef562bc1 requirements : Bump torch and numpy for python3.12 (#8041) Hamdoud Hakem 2024-06-20 21:01:15 +01:00
  • 17b291a6a5 convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) Hamdoud Hakem 2024-06-20 20:59:59 +01:00
  • abd894ad96 common: fix warning (#8036) b3190 Johannes Gäßler 2024-06-20 16:40:13 +02:00
  • de391e4c80 [SYCL] Fix windows build and inference (#8003) b3189 luoyu-intel 2024-06-20 13:19:05 +00:00
  • d50f8897a7 CUDA: stream-k decomposition for MMQ (#8018) b3188 Johannes Gäßler 2024-06-20 14:39:21 +02:00
  • 2075a66a96 metal : fix ggml_metal_supports_op for BF16 (#8021) b3187 Michael de Gans 2024-06-19 22:32:01 -07:00
  • ba58993152 server : fix smart slot selection (#8020) b3186 sasha0552 2024-06-19 23:57:10 +00:00
  • a7854743c5 un-ignore build-info.cmake and build-info.sh (#7996) Michael de Gans 2024-06-19 13:10:42 -07:00
  • 9c77ec1d74 ggml : synchronize threads using barriers (#7993) b3184 slaren 2024-06-19 15:04:15 +02:00
  • a04a953cab codecov : remove (#8004) b3183 Georgi Gerganov 2024-06-19 13:04:36 +03:00
  • 623494a478 [SYCL] refactor (#6408) b3182 Meng, Hengyu 2024-06-19 09:11:51 +08:00
  • 37bef89433 tokenizer : BPE fixes (#7530) b3181 jaime-m-p 2024-06-18 18:40:52 +02:00
  • 91c188d6c2 Only use FIM middle token if it exists (#7648) b3180 Sigbjørn Skjæret 2024-06-18 14:19:45 +02:00
  • 84f6de17f6 Fix no gcc pragma on Windows (#7751) b3179 jojorne 2024-06-18 09:18:32 -03:00
  • 61665277af Allow compiling with CUDA without CUDA runtime installed (#7989) b3178 Ulrich Drepper 2024-06-18 14:00:14 +02:00
  • f3974cabac all matrix multiplication backend sl/test-mul-mat-backend slaren 2024-06-14 21:44:55 +02:00
  • ce6e28cc23 Update ggml-sycl.cpp codeplay/fix-matmul-arith Joe Todd 2024-06-18 09:57:14 +01:00
  • b96f9afb0d chore: clean useless beam search param (#7985) b3177 Frank Mai 2024-06-18 15:11:40 +08:00
  • 1193778105 readme : update UI list (#7943) Abheek Gulati 2024-06-17 23:57:41 -07:00
  • 5326bcceeb ggml : sync b3175 Georgi Gerganov 2024-06-18 09:50:45 +03:00
  • e6ecc2be47 whisper : use ggml_backend_sched (whisper/2239) Georgi Gerganov 2024-06-18 09:37:20 +03:00
  • a94e6ff877 update: support Qwen2-57B-A14B (#7835) b3173 Ștefan-Gabriel Muscalu 2024-06-17 22:08:46 +03:00
  • 5b6da18750 Make updates to type cast based on compiler instead of OS (#7851) b3172 Srihari-mcw 2024-06-17 23:53:17 +05:30
  • 7c26775adb llama : disable FA if KV head size do not match (#7982) b3171 Georgi Gerganov 2024-06-17 19:40:01 +03:00
  • ef79941ac9 llama : disable FA if KV head size do not match gg/fa-req-kq-hs Georgi Gerganov 2024-06-17 19:20:24 +03:00
  • b473e95084 Add Nix and Flox install instructions (#7899) Bryan Honof 2024-06-17 17:37:55 +02:00
  • 99052cd227 sched : offload_op also requires supports_op (#7977) b3169 slaren 2024-06-17 16:51:42 +02:00
  • c637fcd34d fix: divide 0 exception in mamba (#7932) b3168 Frank Mai 2024-06-17 22:11:08 +08:00
  • 6a2f0b3474 Implement non-mapped async IO for CUDA on Windows. (#7896) b3167 Markus Tavenrath 2024-06-17 16:10:15 +02:00
  • a235b7c532 Vectorize q load codeplay/dequant_q4_K_improvements Aidan 2024-06-17 10:30:40 +01:00
  • 604ef6bf15 Store scales in local mem Aidan 2024-06-17 10:26:18 +01:00
  • cb3fb42046 Single load for half2 Aidan 2024-06-17 10:21:16 +01:00
  • 4a481556e6 Remove double lines Aidan 2024-06-17 10:16:10 +01:00
  • 21be9cab94 rpc : fix load/store misaligned addresses (#7948) b3166 Georgi Gerganov 2024-06-17 11:09:20 +03:00
  • 006167aaf6 gguf-dump.py: add --markdown dump output (#7853) Brian 2024-06-17 15:25:20 +10:00
  • df68d4fa5d [SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" (#7946) b3164 Neo Zhang 2024-06-17 11:17:07 +08:00
  • 43b35e38ba Add support for sqrt on CUDA (#7953) b3163 Calvin Laurenson 2024-06-16 15:23:04 -07:00
  • 19b7a836f6 cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231) b3162 Georgi Gerganov 2024-06-11 17:39:01 +03:00
  • b5fcf8ef5c ggml : fix and optimize ppc64le (ggml/849) Hong Bo PENG 2024-06-16 16:53:11 +08:00
  • 398105ff43 ggml : remove duplicate include of ggml-common.h (ggml/853) Daniel Bevenius 2024-06-16 10:51:18 +02:00
  • bc6c457fa3 flake.lock: Update (#7951) b3159 Georgi Gerganov 2024-06-16 19:16:21 +03:00
  • 52399254b3 unicode : avoid char32_t (#7957) b3158 Georgi Gerganov 2024-06-16 14:51:40 +03:00
  • 6fe1c62741 readme : update UI list [no ci] (#7958) hopkins385 2024-06-16 13:51:18 +02:00