Commit Graph

  • e5edb210cd server : update deps (#9183) Georgi Gerganov 2024-08-26 12:16:57 +03:00
  • 0c41e03ceb metal : gemma2 flash attention support (#9159) b3625 slaren 2024-08-26 11:08:59 +02:00
  • f12ceaca0c ggml-ci : try to improve build time (#9160) slaren 2024-08-26 11:03:30 +02:00
  • 6494509801 backup sycl-onednn-convolution Meng, Hengyu 2024-08-26 08:58:54 +00:00
  • ccb45186d0 docs : remove references gg/remove-k-quants-per-iter Georgi Gerganov 2024-08-26 09:52:02 +03:00
  • e48fd74b45 ggml : remove k_quants_per_iteration macro Georgi Gerganov 2024-07-04 21:19:09 +03:00
  • 436787f170 llama : fix time complexity of string replacement (#9163) b3623 Justine Tunney 2024-08-25 23:09:53 -07:00
  • 93bc3839f9 common: fixed not working find argument --n-gpu-layers-draft (#9175) b3622 Herman Semenov 2024-08-25 22:54:37 +00:00
  • f91fc5639b CUDA: fix Gemma 2 numerical issues for FA (#9166) b3621 Johannes Gäßler 2024-08-25 22:11:48 +02:00
  • e11bd856d5 CPU/CUDA: Gemma 2 FlashAttention support (#8542) b3620 Johannes Gäßler 2024-08-24 21:34:59 +02:00
  • 8f824ffe8e quantize : fix typo in usage help of quantize.cpp (#9145) b3619 João Dinis Ferreira 2024-08-24 07:22:45 +01:00
  • 3ba780e2a8 lora : fix llama conversion script with ROPE_FREQS (#9117) b3618 Xuan Son Nguyen 2024-08-23 12:58:53 +02:00
  • b180cb352b backup Meng, Hengyu 2024-08-20 08:23:41 +00:00
  • a07c32ea54 llama : use F32 precision in GLM4 attention and no FA (#9130) gguf-v0.10.0 b3617 piDack 2024-08-23 15:27:17 +08:00
  • cb6d9962c4 Merge branch 'master' into compilade/bitnet-ternary Francis Couture-Harpin 2024-08-22 16:42:24 -04:00
  • 38913dc8dd convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present Francis Couture-Harpin 2024-08-22 14:31:12 -04:00
  • 11b84eb457 [SYCL] Add a space to supress a cmake warning (#9133) b3616 Akarshan Biswas 2024-08-22 19:39:47 +05:30
  • fa358e7071 llama : add missing break Francis Couture-Harpin 2024-08-22 01:13:43 -04:00
  • 1731d4238f [SYCL] Add oneDNN primitive support (#9091) b3615 luoyu-intel 2024-08-22 12:50:10 +08:00
  • e04910dc48 llama : remove unused variable Francis Couture-Harpin 2024-08-21 23:06:22 -04:00
  • aff96920f9 llama : fix Mamba-2 conv state saving Francis Couture-Harpin 2024-08-21 16:28:07 -04:00
  • 2bfe9de6d3 llama : support running Mamba-Codestral-7B-v0.1 Francis Couture-Harpin 2024-08-18 22:43:39 -04:00
  • dceff23fae ggml : SIMD ggml_ssm_scan for Mamba-2 Francis Couture-Harpin 2024-08-18 21:49:39 -04:00
  • 1f0fea70fb llama : initial Mamba-2 support Francis Couture-Harpin 2024-08-01 10:43:42 -04:00
  • a1631e53f6 llama : simplify Mamba with advanced batch splits (#8526) b3614 compilade 2024-08-21 17:58:11 -04:00
  • 8062650343 llama : fix simple splits when the batch contains embeddings compilade/batch-splits Francis Couture-Harpin 2024-08-21 15:09:03 -04:00
  • fc54ef0d1c server : support reading arguments from environment variables (#9105) b3613 Xuan Son Nguyen 2024-08-21 11:04:34 +02:00
  • 80d9d2a551 Merge branch 'master' into compilade/batch-splits Francis Couture-Harpin 2024-08-21 04:17:29 -04:00
  • b40eb84895 llama : support for falcon-mamba architecture (#9074) b3612 Younes Belkada 2024-08-21 12:06:36 +04:00
  • f63f603c87 llava : zero-initialize clip_ctx structure fields with aggregate initialization 908) b3611 fairydreaming 2024-08-21 09:45:49 +02:00
  • 8455340b87 llama : std::move llm_bigram_bpe from work_queue (#9062) b3610 Daniel Bevenius 2024-08-21 09:32:58 +02:00
  • 1be5ea7d97 llama : add llama_model_is_recurrent to simplify figuring that out Francis Couture-Harpin 2024-08-20 23:55:14 -04:00
  • b264eddbb2 llama : fix Mamba pooled embeddings with multiple sequences Francis Couture-Harpin 2024-08-20 23:29:48 -04:00
  • 652e9b0d61 llama : fix T5 segfault again Francis Couture-Harpin 2024-08-20 21:37:43 -04:00
  • 347247a24e imatrix : fix segfault when using a single chunk per batch Francis Couture-Harpin 2024-08-20 15:35:56 -04:00
  • bce54642c8 imatrix : allow processing multiple chunks per batch Francis Couture-Harpin 2024-08-20 15:17:24 -04:00
  • 2f3c1466ff llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984) b3609 Changyeon Kim 2024-08-21 04:00:00 +09:00
  • 50addec9a5 [SYCL] fallback mmvq (#9088) b3608 Meng, Hengyu 2024-08-20 23:50:17 +08:00
  • 4f8d19ff17 [SYCL] Fix SYCL im2col and convert Overflow with Large Dims (#9052) b3607 zhentaoyu 2024-08-20 23:06:51 +08:00
  • 90db8146d5 tests : add missing comma in grammar integration tests (#9099) b3606 fairydreaming 2024-08-20 11:09:55 +02:00
  • cfac111e2b cann: add doc for cann backend (#8867) wangshuai09 2024-08-19 16:46:38 +08:00
  • 1b6ff90ff8 rpc : print error message when failed to connect endpoint (#9042) b3604 Radoslav Gerganov 2024-08-19 10:11:45 +03:00
  • 18eaf29f4c rpc : prevent crashes on invalid input (#9040) b3603 Radoslav Gerganov 2024-08-19 10:10:21 +03:00
  • 554b049068 flake.lock: Update (#9068) Georgi Gerganov 2024-08-18 17:43:32 +03:00
  • 2339a0be1c tests : add integration test for lora adapters (#8957) ltoniazzi 2024-08-18 10:58:04 +01:00
  • 2fb9267887 Fix incorrect use of ctx_split for bias tensors (#9063) b3600 Yoshi Suhara 2024-08-17 06:34:21 -07:00
  • 9127800d83 wip sl/prepare-next-graph slaren 2024-08-17 01:51:06 +02:00
  • 8b3befc0e2 server : refactor middleware and /health endpoint (#9056) b3599 Xuan Son Nguyen 2024-08-16 17:19:05 +02:00
  • d565bb2fd5 llava : support MiniCPM-V-2.6 (#8967) b3598 tc-mb 2024-08-16 21:34:41 +08:00
  • ee2984bdaf py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928) Farbod Bijary 2024-08-16 14:06:30 +03:30
  • c8ddce8560 Fix inference example lacks required parameters (#9035) Aisuko 2024-08-16 19:08:59 +10:00
  • 23fd453544 gguf-py : bump version from 0.9.1 to 0.10.0 (#9051) b3595 compilade 2024-08-16 02:36:11 -04:00
  • c679e0cb5c llama : add EXAONE model support (#9025) Minsoo Cheong 2024-08-16 15:35:18 +09:00
  • fb487bb567 common : add support for cpu_get_num_physical_cores() on Windows (#8771) b3593 Liu Jia 2024-08-16 14:23:12 +08:00
  • 2a24c8caa6 Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922) b3592 Yoshi Suhara 2024-08-15 19:23:33 -07:00
  • e3f6fd56b1 ggml : dynamic ggml_sched_max_splits based on graph_size (#9047) b3591 Nico Bosshard 2024-08-16 04:22:55 +02:00
  • 4b9afbbe90 retrieval : fix memory leak in retrieval query handling (#8955) b3590 gtygo 2024-08-15 15:40:12 +08:00
  • 37501d9c79 server : fix duplicated n_predict key in the generation_settings (#8994) b3589 Riceball LEE 2024-08-15 15:28:05 +08:00
  • 4af8420afb common : remove duplicate function llama_should_add_bos_token (#8778) b3588 Zhenwei Jin 2024-08-15 15:23:23 +08:00
  • 6bda7ce6c3 llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850) b3587 Esko Toivonen 2024-08-15 10:17:12 +03:00
  • d5492f0525 ci : disable bench workflow (#9010) Georgi Gerganov 2024-08-15 10:11:11 +03:00
  • 234b30676a server : init stop and error fields of the result struct (#9026) b3585 Jiří Podivín 2024-08-15 08:21:57 +02:00
  • 702e1995a1 Merge branch 'master' into compilade/batch-splits Francis Couture-Harpin 2024-08-14 20:46:28 -04:00
  • 5fd89a70ea Vulkan Optimizations and Fixes (#8959) b3584 0cc4m 2024-08-14 18:32:53 +02:00
  • 62d7b6c87f cuda : re-add q4_0 gg/hf-test Georgi Gerganov 2024-08-14 13:37:03 +03:00
  • 503983a69a cuda : build only necessary templates Georgi Gerganov 2024-08-14 10:29:23 +03:00
  • ae41fd2e65 make : force CPU extensions [no ci] Georgi Gerganov 2024-08-13 16:59:12 +03:00
  • 98a532d474 server : fix segfault on long system prompt (#8987) b3583 compilade 2024-08-14 02:51:02 -04:00
  • 43bdd3ce18 cmake : remove unused option GGML_CURL (#9011) b3582 Georgi Gerganov 2024-08-14 09:14:49 +03:00
  • 93ec58b932 server : fix typo in comment compilade/fix-server-long-system-prompt Francis Couture-Harpin 2024-08-13 22:12:26 -04:00
  • af2f84c964 Merge branch 'master' into compilade/fix-server-long-system-prompt Francis Couture-Harpin 2024-08-13 22:06:11 -04:00
  • c1b738ef43 server : fix parallel generation with very small batch sizes Francis Couture-Harpin 2024-08-13 22:03:57 -04:00
  • 35cc5567c8 ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support Francis Couture-Harpin 2024-08-13 18:00:06 -04:00
  • 82b240406d Merge branch 'master' into compilade/bitnet-ternary Francis Couture-Harpin 2024-08-13 17:36:09 -04:00
  • 69f772682e ggml-quants : allow using ARM dot product instructions for TQ1_0 Francis Couture-Harpin 2024-08-13 17:21:19 -04:00
  • 895004f3f8 convert : allow direct conversion to TQ1_0 and TQ2_0 Francis Couture-Harpin 2024-08-13 17:17:43 -04:00
  • 06943a69f6 ggml : move rope type enum to ggml.h (#8949) b3581 Daniel Bevenius 2024-08-13 21:13:15 +02:00
  • 828d6ff7d7 export-lora : throw error if lora is quantized (#9002) b3580 Xuan Son Nguyen 2024-08-13 11:41:14 +02:00
  • 33a5c8e37c llama : prepare next graph while the current one is being evaluated slaren 2024-08-13 02:39:52 +02:00
  • fc4ca27b25 ci : fix github workflow vulnerable to script injection (#9008) b3579 Diogo Teles Sant'Anna 2024-08-12 13:28:23 -03:00
  • 1f67436c5e ci : enable RPC in all of the released builds (#9006) b3578 Radoslav Gerganov 2024-08-12 19:17:03 +03:00
  • 0fd93cdef5 llama : model-based max number of graph nodes calculation (#8970) b3577 Nico Bosshard 2024-08-12 17:13:59 +02:00
  • 84eb2f4fad docs: introduce gpustack and gguf-parser (#8873) b3576 Frank Mai 2024-08-12 20:45:50 +08:00
  • 1262e7ed13 grammar-parser : fix possible null-deref (#9004) b3575 DavidKorczynski 2024-08-12 13:36:41 +01:00
  • df5478fbea ggml: fix div-by-zero (#9003) b3574 DavidKorczynski 2024-08-12 13:21:41 +01:00
  • 2589292cde Fix a spelling mistake (#9001) b3573 Liu Jia 2024-08-12 17:46:03 +08:00
  • d3ae0ee8d7 py : fix requirements check '==' -> '~=' (#8982) Georgi Gerganov 2024-08-12 11:02:01 +03:00
  • 5ef07e25ac server : handle models with missing EOS token (#8997) b3571 Georgi Gerganov 2024-08-12 10:21:50 +03:00
  • 3a0bf17d57 gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0 Francis Couture-Harpin 2024-08-12 00:06:48 -04:00
  • faaac59d16 llama : support NUL bytes in tokens compilade/nul-str-token Francis Couture-Harpin 2024-08-11 21:00:03 -04:00
  • d911cd1f13 Merge branch 'master' into compilade/bitnet-ternary Francis Couture-Harpin 2024-08-11 15:52:29 -04:00
  • 4134999e01 gguf-py : Numpy dequantization for most types (#8939) b3570 compilade 2024-08-11 14:45:41 -04:00
  • 7eda5583fa server : fix segfault on long system prompt Francis Couture-Harpin 2024-08-11 14:18:17 -04:00
  • 8cd1bcfd3f flake.lock: Update (#8979) Georgi Gerganov 2024-08-11 16:58:58 +03:00
  • a21c6fd450 update guide (#8909) b3568 Neo Zhang 2024-08-11 16:37:43 +08:00
  • 33309f661a llama : check all graph nodes when searching for result_embd_pooled (#8956) b3567 fairydreaming 2024-08-11 10:35:26 +02:00
  • 7c5bfd57f8 Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943) b3566 Markus Tavenrath 2024-08-11 10:09:09 +02:00
  • 6e02327e8b metal : fix uninitialized abort_callback (#8968) b3565 slaren 2024-08-10 15:42:10 +02:00
  • 7eb23840ed llama : default n_swa for phi-3 (#8931) b3564 Xuan Son Nguyen 2024-08-10 13:04:40 +02:00
  • 7c3f55c100 Add support for encoder-only T5 models (#8900) b3563 fairydreaming 2024-08-10 11:43:26 +02:00