Commit Graph

  • e11b2e6e1e Qwen2 : assume tied weights if lm_head/output weights is missing (#6738) b2692 Ren Xuancheng 2024-04-18 19:38:04 +08:00
  • 105332cc17 metal : add BS=1 kernel for flash attention (#6508) Georgi Gerganov 2024-04-18 14:33:07 +03:00
  • 260cdb2d08 llama-bench : add -fa,--flash-attn arg Georgi Gerganov 2024-04-18 14:28:19 +03:00
  • 87968de9a9 fix KQ FP32 precision fpr parallel_blocks > 1 Johannes Gäßler 2024-04-17 17:31:03 +02:00
  • 2f538b9547 Add __hgt2_mask implementation for CUDA 11 Johannes Gäßler 2024-04-17 16:29:28 +02:00
  • 0bc67dd1c8 Calculate KQ as FP32 if KQV has GGML_PREC_F32 Johannes Gäßler 2024-04-16 16:22:29 +02:00
  • a5b0e2dea0 store temp KQ in registers Johannes Gäßler 2024-04-16 15:58:21 +02:00
  • ef9e1593f3 flush softmax exp below threshold to 0 Johannes Gäßler 2024-04-15 16:05:07 +02:00
  • 6a3b84236d fix flash_attn_vec_f16 race condition Johannes Gäßler 2024-04-13 22:05:43 +02:00
  • 34f93bbb39 CUDA: refactor host code, dyn. par. blocks Johannes Gäßler 2024-04-09 11:39:16 +02:00
  • c71bfd736e llama : fix compatibility with old 2 expert models (#6735) b2691 slaren 2024-04-18 09:04:47 +02:00
  • 5668c79ea0 server: bench: enable flash_attn param Pierrick HYMBERT 2024-04-17 23:26:29 +02:00
  • 3b8f1ec4b1 llamafile : tmp disable + build sgemm.o when needed (#6716) b2690 Georgi Gerganov 2024-04-17 23:58:26 +03:00
  • 8dd1ec8b3f readme : add UI (#6724) Yaroslav 2024-04-17 14:47:50 +02:00
  • 405385726e server: support flash_attn param Pierrick HYMBERT 2024-04-17 14:05:02 +02:00
  • 599ce84a71 llama : flash_attn cparam + fix defrag Georgi Gerganov 2024-04-17 12:00:35 +03:00
  • 2c41180e88 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-04-17 10:13:09 +03:00
  • facb8b56f8 convert : fix autoawq gemma (#6704) b2688 Zheng.Deng 2024-04-17 04:51:07 +08:00
  • 532c1737a1 llama : make general.name optional (#6709) b2687 Georgi Gerganov 2024-04-16 23:50:38 +03:00
  • 666867b799 ggml : fix llamafile sgemm wdata offsets (#6710) b2686 Georgi Gerganov 2024-04-16 23:50:22 +03:00
  • f02ea667c1 ggml : temporary disable llamafile sgemm until fixed gg/disable-sgemm Georgi Gerganov 2024-04-16 22:41:03 +03:00
  • 8cc91dc63c ggml : add llamafile sgemm (#6414) b2685 Justine Tunney 2024-04-16 14:55:30 -04:00
  • dbceec87c0 llama : add StableLM2 12B (#6635) b2684 Ashish 2024-04-16 08:48:35 -07:00
  • f4dea7da18 llama : add qwen2moe (#6074) b2683 Shijie 2024-04-16 23:40:48 +08:00
  • eedd42e376 KV Cache defrag hash overflow - TMP Fix by @slaren #6685 hp/tmp/kv-cache-defrag Pierrick HYMBERT 2024-04-16 10:24:34 +02:00
  • 8a56075b07 gritlm : add --outdir option to hf.sh script (#6699) Daniel Bevenius 2024-04-16 08:34:06 +02:00
  • 58227ffdeb perplexity : require positive --ctx-size arg (#6695) b2681 Georgi Gerganov 2024-04-16 09:28:33 +03:00
  • 4fbd8098e6 gguf : add special tokens metadata for FIM/Infill (#6689) b2680 Daniel Bevenius 2024-04-16 08:13:13 +02:00
  • 7593639ce3 main: add --json-schema / -j flag (#6659) b2679 Olivier Chafik 2024-04-15 18:35:21 +01:00
  • 132f55795e llama : fix restoring the number of outputs from state files (#6687) b2678 compilade 2024-04-15 08:56:55 -04:00
  • 3272896d79 server : revert "minor layout improvements" (#6684) Pierrick Hymbert 2024-04-15 14:18:47 +02:00
  • 7fc16a2c32 swift : linux support (#6590) b2676 Steven Prichard 2024-04-15 05:14:46 -05:00
  • 17e98d4c96 fix mul_mat_id() for new input, make the ut pass (#6682) b2675 Neo Zhang Jianyu 2024-04-15 17:12:26 +08:00
  • 1958f7e06c llama : add missing kv clear in llama_beam_search (#6664) b2674 David Renshaw 2024-04-14 15:24:15 -04:00
  • 04fbc5f23e Add Command R chat template (#6650) b2673 Chao Jiang 2024-04-15 00:16:34 +08:00
  • f184dd9208 flake.lock: Update (#6669) Georgi Gerganov 2024-04-14 16:55:30 +03:00
  • 422c2aff1c Added support for GGML_OP_CLAMP in Metal (#6662) b2671 Dave 2024-04-14 07:14:19 -04:00
  • 8800226d65 Fix --split-max-size (#6655) b2670 Sigbjørn Skjæret 2024-04-14 13:12:59 +02:00
  • e689fc4e91 [bug fix] convert github repository_owner to lowercase (#6673) b2669 Jaemin Son 2024-04-14 20:12:36 +09:00
  • a4ec34e1cd convert : enable the --use-temp-file cli flag (#6645) James A Capozzoli 2024-04-14 04:40:18 -04:00
  • de17e3f745 fix memcpy() crash, add missed cmd in guide, fix softmax (#6622) b2667 Neo Zhang Jianyu 2024-04-14 10:42:29 +08:00
  • b5e7285baf CUDA: fix matrix multiplication logic for tests (#6667) b2666 Johannes Gäßler 2024-04-14 00:21:55 +02:00
  • 4bd0f93e4a model: support arch DbrxForCausalLM (#6515) b2665 Pierrick Hymbert 2024-04-13 11:33:52 +02:00
  • ab9a3240a9 JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555) b2664 Olivier Chafik 2024-04-12 19:43:38 +01:00
  • fbbc030ba9 metal : unify mul_mv_id kernels (#6556) b2663 slaren 2024-04-12 18:13:20 +02:00
  • 4cc120c744 infill : add download instructions for model (#6626) Daniel Bevenius 2024-04-12 14:11:46 +02:00
  • 24ee66ed0d server : coherent log output for KV cache full (#6637) b2661 Pierrick Hymbert 2024-04-12 13:49:21 +02:00
  • 91c736015b llama : add gguf_remove_key + remove split meta during quantize (#6591) b2660 jiez 2024-04-12 18:45:06 +08:00
  • 907df4459c Hack test-bench Aidan 2024-04-11 17:20:32 +01:00
  • 5c4d767ac0 chore: Fix markdown warnings (#6625) Rene Leonhardt 2024-04-12 10:52:36 +02:00
  • ef21ce4ccb imatrix : remove invalid assert (#6632) b2658 Georgi Gerganov 2024-04-12 11:49:58 +03:00
  • 8b495540fa imatrix : remove invalid assert gg/imatrix-remove-assert Georgi Gerganov 2024-04-12 11:45:12 +03:00
  • dee7f8d692 Correct free memory and total memory. (#6630) b2657 MasterYi1024 2024-04-12 16:28:12 +08:00
  • 81da18e71c eval-callback: use ggml_op_desc to pretty print unary operator name (#6631) b2656 Pierrick Hymbert 2024-04-12 10:26:47 +02:00
  • 9ed2737acc ci : disable Metal for macOS-latest-cmake-x64 (#6628) b2655 Georgi Gerganov 2024-04-12 11:15:05 +03:00
  • 04a5ac211e Optimization: eliminate addition of redundant stacks when advancing grammar. (#6616) Clint Herron 2024-04-11 21:44:50 -04:00
  • f7001ccc5a As suggested by @slaren, disabling Metal for test to fix CI build on OSX from #6576 (#6619) Clint Herron 2024-04-11 17:44:48 -04:00
  • a474f50ebb Refactor Error Handling for CUDA (#6575) Nikolas 2024-04-11 21:56:29 +02:00
  • cbaadc9294 grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609) Olivier Chafik 2024-04-11 19:47:34 +01:00
  • 1bbdaf6ecd ci: download artifacts to release directory (#6612) Hugo Roussel 2024-04-11 19:52:21 +02:00
  • f4183afe6a scripts : add --outdir option to hf.sh (#6600) Daniel Bevenius 2024-04-11 15:22:47 +02:00
  • b804b1ef77 eval-callback: Example how to use eval callback for debugging (#6576) Pierrick Hymbert 2024-04-11 14:51:07 +02:00
  • 8228b66dbc gguf : add option to not check tensor data (#6582) b2647 Daniel Bevenius 2024-04-10 20:16:48 +02:00
  • b3a96f27f0 minor layout improvements (#6572) b2646 Ralph Soika 2024-04-10 19:18:25 +02:00
  • 4f407a0a35 llama : add model types for mixtral (#6589) b2645 slaren 2024-04-10 17:24:14 +02:00
  • 65c64dc36f convert.py : add consolidated.safetensors for mixtral 8x22b (#6587) slaren 2024-04-10 15:23:12 +02:00
  • 67fac4b95f docs : how to add a model (#6565) Pierrick Hymbert 2024-04-10 08:58:48 +02:00
  • 29122d32ac readme : fix ROCm link (#6579) Artem Zinnatullin 2024-04-10 00:49:12 -06:00
  • b231b37b09 readme : update UI list (#6560) sjxx 2024-04-10 14:34:00 +08:00
  • d66849f628 Merge branch 'master' into compilade/refactor-kv-cache Francis Couture-Harpin 2024-04-09 20:22:19 -04:00
  • ba5e134e07 readme: fix typo in amdgpu target name (#6573) Jiří Sejkora 2024-04-10 00:23:02 +02:00
  • 0c8b3b2095 llama : correctly handle more edge cases for the rs cache Francis Couture-Harpin 2024-04-09 17:35:22 -04:00
  • 1b67731e18 BERT tokenizer fixes (#6498) Jared Van Bortel 2024-04-09 13:44:08 -04:00
  • c4a3a4ff47 sync : ggml b2638 Georgi Gerganov 2024-04-09 20:29:06 +03:00
  • 400d5d722d server : detect search query to start webchat (#6554) Ed Lee 2024-04-09 01:31:47 -07:00
  • 5dc9dd7152 llama : add Command R Plus support (#6491) b2636 Carolinabanana 2024-04-09 09:16:13 +01:00
  • e11a8999b5 license : update copyright notice + add AUTHORS (#6405) Georgi Gerganov 2024-04-09 09:23:19 +03:00
  • 072e0a4d3b scipts : add LICENSE and gen-authors.sh to sync gg/authors Georgi Gerganov 2024-04-09 09:19:33 +03:00
  • 0e0d4e821f authors : update Georgi Gerganov 2024-04-09 09:14:03 +03:00
  • cc4a95426d llama : fix attention layer count sanity check (#6550) Georgi Gerganov 2024-04-08 22:25:49 +03:00
  • cecd8d3c98 Comment explaining a decision (#6531) b2633 kunnis 2024-04-08 10:44:19 -05:00
  • 0028010d01 llama : state checkpoints for recurrent models Francis Couture-Harpin 2024-04-08 09:54:35 -04:00
  • b73e564b16 quantize : fix precedence of cli args (#6541) b2632 Georgi Gerganov 2024-04-08 16:23:01 +03:00
  • e3c337d87c llama : support negative ith in llama_get_ API (#6519) Rick G 2024-04-08 06:02:30 -07:00
  • beea6e1b16 llama : save and restore kv cache for single seq id (#6341) b2630 Jan Boon 2024-04-08 20:43:30 +08:00
  • 87fb5b4234 remove row=1 cond (#6532) b2629 Abhilash Majumder 2024-04-08 13:56:01 +05:30
  • d752327c33 Adding KodiBot to UI list (#6535) Firat 2024-04-08 00:48:29 -07:00
  • 855f54402e Change Windows AMD example to release build to make inference much faster. (#6525) Mark Fairbairn 2024-04-07 19:52:19 +01:00
  • b909236c0b flake.lock: Update (#6517) Georgi Gerganov 2024-04-07 21:25:30 +03:00
  • e0717e751e Add GritLM as supported models. (#6513) DAN™ 2024-04-07 13:33:59 -04:00
  • c37247796b sync : ggml Georgi Gerganov 2024-04-07 17:05:51 +03:00
  • f77261a7c5 ggml: bypass code incompatible with CUDA < 11.1 (whisper/2020) Slava Primenko 2024-04-04 14:49:24 +02:00
  • 43e8995e75 scripts : sync ggml-cuda folder Georgi Gerganov 2024-04-07 16:08:12 +03:00
  • 9472bce308 Run make to build the project (#6457) limitedAtonement 2024-04-07 07:05:40 -04:00
  • d4f220a5cc support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (#6521) b2620 Neo Zhang Jianyu 2024-04-07 10:55:59 +08:00
  • 54ea0698fb sync : ggml b2619 Georgi Gerganov 2024-04-06 17:43:15 +03:00
  • b66aec675c backend : fix typo in scheduler documentation (ggml/781) Daniel Bevenius 2024-04-03 22:57:20 +02:00
  • 57dd02c44b Tests: Added integration tests for GBNF parser (#6472) Clint Herron 2024-04-06 10:31:33 -04:00
  • 75cd4c7729 ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495) Pierrick Hymbert 2024-04-06 05:40:47 +02:00
  • a8bd14d557 gguf.py : add licence and version to gguf writer (#6504) b2615 Brian 2024-04-06 05:41:38 +11:00