Commit Graph

  • da0400344b ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (#3370) b1280 slaren 2023-09-28 12:08:28 +02:00
  • e519621010 convert : remove bug in convert.py permute function (#3364) Zhang Peiyuan 2023-09-28 02:45:20 +08:00
  • ac43576124 make-ggml.py : compatibility with more models and GGUF (#3290) Richard Roberson 2023-09-27 10:25:12 -06:00
  • 20c7e1e804 gguf : fix a few general keys (#3341) b1277 Cebtenzzre 2023-09-27 12:18:07 -04:00
  • dc6897404e metal : reusing llama.cpp logging (#3152) b1276 Rickard Hallerbäck 2023-09-27 17:48:33 +02:00
  • 527e57cfd8 build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (#3342) b1275 Jag Chadha 2023-09-27 11:34:32 -04:00
  • ffe88a36a9 readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (#3340) BarfingLemurs 2023-09-27 11:30:36 -04:00
  • c1596f633f llama : fix kv cache heuristic when context is less than 32 Georgi Gerganov 2023-09-27 18:12:43 +03:00
  • 72e7ef4e53 simple : fixes cam-simple-fix slaren 2023-09-26 23:19:36 +02:00
  • 99c5c9a0d8 Upload immediately to device. master-99c5c9a Adam Treat 2023-09-26 11:58:39 -04:00
  • 99115f3fa6 cmake : fix build-info.h on MSVC (#3309) b1273 DAN™ 2023-09-25 18:45:33 -04:00
  • 1726f9626f docs: Fix typo CLBlast_DIR var. (#3330) 2f38b454 2023-09-26 02:24:52 +08:00
  • a98b1633d5 nix : add cuda, use a symlinked toolkit for cmake (#3202) Erik Scholz 2023-09-25 13:48:30 +02:00
  • c091cdfb24 llama-bench : add README (#3317) slaren 2023-09-23 21:48:24 +02:00
  • 51a7cf5c6e examples : fix RoPE defaults to match PR #3240 (#3315) b1269 Cebtenzzre 2023-09-23 05:28:50 -04:00
  • bedb92b603 scripts : use /usr/bin/env in shebang (#3313) Kevin Ji 2023-09-22 23:52:23 -04:00
  • bc9d3e3971 Update README.md (#3289) Lee Drake 2023-09-21 13:00:24 -06:00
  • 36b904e200 ggml-opencl.cpp: Make private functions static (#3300) b1266 shibe2 2023-09-21 22:10:26 +04:00
  • 8845160058 simple : add README.md Georgi Gerganov 2023-09-21 20:10:14 +02:00
  • 5a3369d8e8 llama : llama.h formatting + comments Georgi Gerganov 2023-09-21 19:51:32 +02:00
  • 324f3403d5 zig : fix for updated c lib (#3259) Edward Taylor 2023-09-21 21:08:20 +12:00
  • f56c418ab0 embedding : update README.md (#3224) yuiseki 2023-09-21 17:57:40 +09:00
  • 8185710a80 CUDA: use only 1 thread if fully offloaded (#2915) b1263 Johannes Gäßler 2023-09-21 10:43:53 +02:00
  • 7eb41179ed readme : update hot topics Georgi Gerganov 2023-09-20 20:48:22 +03:00
  • b2debf65f2 parallel : add disabled experimental batch chunking in powers of two Georgi Gerganov 2023-09-20 20:14:05 +03:00
  • a5661d7e71 llama : allow gguf RoPE keys to be overridden with defaults (#3240) b1261 Cebtenzzre 2023-09-20 12:12:47 -04:00
  • ded9b43cad parallel : fix cases where the input prompts can overflow the batch Georgi Gerganov 2023-09-20 19:09:25 +03:00
  • 65c2c1c5ab benchmark-matmult : do not use integer abs() on a float (#3277) b1260 Cebtenzzre 2023-09-20 12:06:08 -04:00
  • ee1d670cc6 parallel : fix bug (extra BOS) + smaller token_prev array Georgi Gerganov 2023-09-20 17:32:21 +03:00
  • 80834daecf flake : Restore default package's buildInputs (#3262) kang 2023-09-20 22:48:22 +09:00
  • 1be2b8c19b ggml : revert change to ggml_cpy, add ggml_cont_Nd instead (#3275) slaren 2023-09-20 15:12:51 +02:00
  • a40f2b656f CI: FreeBSD fix (#3258) b1258 Alon 2023-09-20 15:06:36 +03:00
  • 2f3a46fccf train : make KQ_pos memory buffer permanent via dummy scale op Georgi Gerganov 2023-09-20 14:14:50 +03:00
  • 54206962c7 llama : disable MPI for now Georgi Gerganov 2023-09-20 14:06:41 +03:00
  • e04dc51988 ggml-cuda : add rope f16, restore performance with parallel decoding (#3272) slaren 2023-09-20 13:00:28 +02:00
  • db0fc2da06 simple : improve comments + free batch Georgi Gerganov 2023-09-20 13:54:20 +03:00
  • b377bf2266 simple : add parallel decoding support Georgi Gerganov 2023-09-20 13:06:34 +03:00
  • addae65fd4 llama : improve llama_batch API + simplify parallel example Georgi Gerganov 2023-09-20 10:46:18 +03:00
  • d119c04c15 examples : fix benchmark-matmult (#1554) b1257 Georgi Gerganov 2023-09-20 10:02:39 +03:00
  • a1327c71c6 parallel : rename hot-plug to continuous-batching Georgi Gerganov 2023-09-20 09:24:02 +03:00
  • e1067efbfa llama : fix n_kv to never become 0 Georgi Gerganov 2023-09-20 09:17:05 +03:00
  • 7b7472ee26 parallel : minor Georgi Gerganov 2023-09-20 00:35:10 +03:00
  • 6028879f56 parallel : print misses on each request Georgi Gerganov 2023-09-19 23:50:05 +03:00
  • eed3fd4234 parallel : count cache misses Georgi Gerganov 2023-09-19 23:47:47 +03:00
  • 8a9aca37c1 parallel : remove question with short answers Georgi Gerganov 2023-09-19 23:34:30 +03:00
  • 4b5f3cd6bf parallel : process system prompt once + configurable paramters + llama API Georgi Gerganov 2023-09-19 17:00:42 +03:00
  • 82e20e9ba0 parallel : remove new line from prompt Georgi Gerganov 2023-09-19 13:54:41 +03:00
  • d37081ae5d llama : silence errors KV cache errors Georgi Gerganov 2023-09-19 13:39:52 +03:00
  • 16090a5dde parallel : fix sequence termination criteria Georgi Gerganov 2023-09-19 13:29:29 +03:00
  • 806d397c1a parallel : try smaller batches when the KV cache is fragmented Georgi Gerganov 2023-09-19 13:21:36 +03:00
  • ddad227782 llama : fix cell_max logic + rename functions Georgi Gerganov 2023-09-19 13:21:12 +03:00
  • 36714e16d0 parallel : various improvements Georgi Gerganov 2023-09-19 12:29:37 +03:00
  • 467e307931 simple : fix token counting Georgi Gerganov 2023-09-19 11:45:33 +03:00
  • 25bd254089 make : add parallel to build + fix static functions in llama.cpp Georgi Gerganov 2023-09-19 11:37:02 +03:00
  • 7e2b9974d1 ggml-cuda : update rope implementation for parallel decoding (#3254) slaren 2023-09-19 10:31:36 +02:00
  • daf4c6d360 llama : fix worst case graph build Georgi Gerganov 2023-09-19 11:05:08 +03:00
  • fa0e677820 llama : extend batch API to select which logits to output Georgi Gerganov 2023-09-19 00:24:13 +03:00
  • 897caccdf4 fixes : speculative KV cache + llama worst-case graph Georgi Gerganov 2023-09-18 22:00:02 +03:00
  • 466b513851 parallel : disable hot-plug to avoid cache fragmentation Georgi Gerganov 2023-09-18 21:34:20 +03:00
  • 0161372b9a parallel : example for serving multiple users in parallel Georgi Gerganov 2023-09-18 20:30:05 +03:00
  • 1f17ea631c speculative : fix KV cache management Georgi Gerganov 2023-09-18 19:01:20 +03:00
  • 7c1bdd0e8a llama : apply K-cache roping for Falcon and Baichuan Georgi Gerganov 2023-09-18 18:26:05 +03:00
  • 0cbf3bfef8 llama : add llama_kv_cache_shift_seq + no more context swaps Georgi Gerganov 2023-09-18 18:00:25 +03:00
  • 86c90e34f5 metal : disable concurrency optimization Georgi Gerganov 2023-09-18 18:00:01 +03:00
  • f015b26689 llama : more robust cell_max heuristic + wip shift Georgi Gerganov 2023-09-18 17:15:25 +03:00
  • 8781013ef6 make : restore build-info.h dependency for several targets (#3205) b1256 Cebtenzzre 2023-09-18 10:03:53 -04:00
  • 4d76d762ef llama : extend llama_kv_cache API Georgi Gerganov 2023-09-18 15:53:03 +03:00
  • 6952a460b9 llama : add cell_max heuristic for more efficient kv_cache Georgi Gerganov 2023-09-18 15:31:24 +03:00
  • 9f42e75489 llama : add new llama_decode() API that works with llama_batch Georgi Gerganov 2023-09-18 14:23:52 +03:00
  • 58bb5110ca Merge branch 'master' into custom-attention-mask Georgi Gerganov 2023-09-18 11:15:18 +03:00
  • d29e76937c llama : unified KV cache + batch inference API Georgi Gerganov 2023-09-18 10:08:22 +03:00
  • 7ddf185537 ci : switch cudatoolkit install on windows to networked (#3236) b1255 Erik Scholz 2023-09-18 02:21:47 +02:00
  • ee66942d7e CUDA: fix peer access logic (#3231) b1254 Johannes Gäßler 2023-09-17 23:35:20 +02:00
  • 784d14ed31 llama : store non-RoPEd K cache (WIP) custom-attention-mask-no-roped-cache Georgi Gerganov 2023-09-17 23:12:28 +03:00
  • fad56936d4 metal : add rope_f16 kernel + optimize cpy kernels Georgi Gerganov 2023-09-17 23:09:48 +03:00
  • 1fb033fd85 ggml : ggml_rope now takes a vector with positions instead of n_past Georgi Gerganov 2023-09-17 21:12:51 +03:00
  • 3b4bab6a38 llama : replace ggml_diag_mask_inf with ggml_add (custom -inf mask) Georgi Gerganov 2023-09-17 19:42:39 +03:00
  • c5df72e848 tests : verify that RoPE is "additive" Georgi Gerganov 2023-09-17 17:54:14 +03:00
  • 111163e246 CUDA: enable peer access between devices (#2470) b1253 Johannes Gäßler 2023-09-17 16:37:53 +02:00
  • 8b428c9bc8 llama.cpp : show model size and BPW on load (#3223) b1252 slaren 2023-09-17 14:33:28 +02:00
  • 578d8c8f5c CUDA: fix scratch malloced on non-main device (#3220) b1251 Johannes Gäßler 2023-09-17 14:16:22 +02:00
  • b541b4f0b1 Enable BUILD_SHARED_LIBS=ON on all Windows builds (#3215) b1250 IsaacDynamo 2023-09-16 19:35:25 +02:00
  • 0631ea363c Don't crash on available devices if we can't even create an instance. master-0631ea3 Adam Treat 2023-09-16 12:17:29 -04:00
  • 5dbc2b3213 Enable build with CUDA 11.0 (make) (#3132) b1249 Vlad 2023-09-16 17:55:43 +03:00
  • b08e75baea Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 (#3170) b1248 goerch 2023-09-16 13:41:33 +02:00
  • e6616cf0db examples : add compiler version and target to build info (#2998) b1247 Cebtenzzre 2023-09-15 16:59:49 -04:00
  • 3aefaab9e5 check C++ code with -Wmissing-declarations (#3184) b1246 Cebtenzzre 2023-09-15 15:38:27 -04:00
  • 69eb67e282 fix build numbers by setting fetch-depth=0 (#3197) b1245 Cebtenzzre 2023-09-15 15:18:15 -04:00
  • 4fe09dfe66 llama : add support for StarCoder model architectures (#3187) Meng Zhang 2023-09-16 03:02:13 +08:00
  • 80291a1d02 common : do not use GNU zero-length __VA_ARGS__ extension (#3195) Cebtenzzre 2023-09-15 14:02:01 -04:00
  • c6f1491da0 metal : fix bug in soft_max kernels (out-of-bounds access) (#3194) Georgi Gerganov 2023-09-15 20:17:24 +03:00
  • e3d87a6c36 convert : make ftype optional in simple scripts (#3185) Cebtenzzre 2023-09-15 12:29:02 -04:00
  • 8c00b7a6ff sync : ggml (Metal F32 support + reduce ggml-alloc size) (#3192) Georgi Gerganov 2023-09-15 19:06:03 +03:00
  • 92a4f86879 llama : make starcoder graph build more consistent with others support-starcoder-fix Georgi Gerganov 2023-09-15 17:57:10 +03:00
  • f82328ab65 metal : fix out-of-bounds access in soft_max kernels Georgi Gerganov 2023-09-15 17:56:49 +03:00
  • 7e50d34be6 cmake : fix building shared libs for clang (rocm) on windows (#3176) Engininja2 2023-09-15 06:24:30 -06:00
  • 6c353dc7c2 cleanup useless code Meng Zhang 2023-09-15 19:00:14 +08:00
  • a1cf66ea94 working in cpu, metal buggy Meng Zhang 2023-09-15 16:56:50 +08:00
  • 235f7c193b flake : use pkg-config instead of pkgconfig (#3188) Evgeny Kurnevsky 2023-09-15 10:10:22 +02:00
  • a51b687657 metal : relax conditions on fast matrix multiplication kernel (#3168) Georgi Gerganov 2023-09-15 11:09:24 +03:00