Commit Graph

  • ef68fac2a8 cuda : fix matrix names Georgi Gerganov 2024-02-03 18:36:58 +02:00
  • cfd9732b2e cuda : simplify softmax Georgi Gerganov 2024-02-03 18:31:55 +02:00
  • e04ff39181 cuda : fix -INF block check Georgi Gerganov 2024-02-03 16:57:46 +02:00
  • 5b263dd83a cuda : unroll Q*K^T loop Georgi Gerganov 2024-02-03 16:12:20 +02:00
  • 3b1c4e7673 cuda : speed-up reduce part of the kernel Georgi Gerganov 2024-02-03 15:36:05 +02:00
  • a7b471569b cuda : switch to 1 warp for bs > 16 Georgi Gerganov 2024-02-03 15:17:49 +02:00
  • b958151e3f cuda : use half2 in softmax Georgi Gerganov 2024-02-03 15:00:25 +02:00
  • c51f27c0db cuda : avoid __hisinf branches Georgi Gerganov 2024-02-03 14:27:36 +02:00
  • 92472ea22c cuda : unroll some of the loops Georgi Gerganov 2024-02-03 14:10:01 +02:00
  • 1f8a592482 cuda : make loops use the same loop values Georgi Gerganov 2024-02-03 14:01:32 +02:00
  • 7c34655b36 cuda : use int instead of int64_t Georgi Gerganov 2024-02-03 13:39:46 +02:00
  • 52bb63c708 refactor : switch to emplace_back to avoid extra object (#5291) b2054 Michael Klimenko 2024-02-03 12:23:37 +01:00
  • 1ec3332ade YaRN : store rope scaling type as int32_t in memory (#5285) b2053 Jared Van Bortel 2024-02-03 06:22:06 -05:00
  • 6a66c5071a readme : add tenere in the ui tools list (#5284) BADR 2024-02-03 12:20:26 +01:00
  • b150abe83e cuda : avoid warp_reduce for smax Georgi Gerganov 2024-02-03 13:17:47 +02:00
  • a305dba8ff Fix im2col with 32fp (#5286) b2051 AidanBeltonS 2024-02-03 08:11:37 +00:00
  • 191221178f perplexity : fix KL divergence calculations on Windows (#5273) b2050 kalomaze 2024-02-02 08:15:30 -06:00
  • b68a112204 cuda : fix __hisinf() result check Georgi Gerganov 2024-02-02 15:12:28 +02:00
  • e437b37fd0 scripts : parse wtype in server-llm.sh (#5167) Georgi Gerganov 2024-02-02 14:23:40 +02:00
  • 2d40085c26 py : add check for '.attn.masked_bias' layers to GPT2model (#5281) Mirror Azure 2024-02-02 14:39:09 +03:00
  • 12eaa22628 tests : update dims Georgi Gerganov 2024-02-02 11:55:38 +02:00
  • b05102fe8c Tidy ggml-sycl (#5261) b2047 AidanBeltonS 2024-02-02 08:39:48 +00:00
  • 6b91b1e0a9 docker : add build for SYCL, Vulkan + update readme (#5228) Xuan Son Nguyen 2024-02-02 08:56:31 +01:00
  • e805f0fa99 [SYCL] get MAX_MEM_ALLOC from device property (#5270) b2045 Meng, Hengyu 2024-02-02 15:54:14 +08:00
  • af3ba5d946 [SYCL] update guide of SYCL backend (#5254) Neo Zhang Jianyu 2024-02-02 15:53:27 +08:00
  • e1e721094d llama : fix memory leak in llama_batch_free (#5252) b2043 Ian Bull 2024-02-01 23:20:13 -08:00
  • db1f3c482e cuda : avoid zeroing fragments Georgi Gerganov 2024-02-01 22:08:37 +02:00
  • 128dcbd3c9 add --no-mmap in llama-bench (#5257) b2042 Neo Zhang Jianyu 2024-02-02 03:48:53 +08:00
  • c6769b9422 tests : minor fix Georgi Gerganov 2024-02-01 21:24:26 +02:00
  • cda5a60a41 metal : optimize softmax Georgi Gerganov 2024-02-01 20:53:29 +02:00
  • 4d0924a890 Vulkan Phi Fix for AMD Proprietary Drivers (#5260) b2041 0cc4m 2024-02-01 19:25:24 +01:00
  • 56e45a239e metal : optimize softmax for C > 32 Georgi Gerganov 2024-02-01 20:16:32 +02:00
  • 41d136b602 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-02-01 19:51:41 +02:00
  • 5a19a9f6d0 cuda : add flash_attn kernel (wip) Georgi Gerganov 2024-02-01 19:47:11 +02:00
  • b957b8f5f6 cuda : add flash_attn kernel (wip) gg/flash-attn-cuda Georgi Gerganov 2024-02-01 19:47:11 +02:00
  • 8ca511cade cuda : fix LLAMA_CUDA_F16 (#5262) b2040 slaren 2024-02-01 18:30:17 +01:00
  • d71ac90985 make : generate .a library for static linking (#5205) b2039 Ali Nehzat 2024-02-02 02:18:53 +11:00
  • ac26f27028 cuda : increase C to 128 for better performance flash-attn-cuda Georgi Gerganov 2024-02-01 16:12:56 +02:00
  • 2e46013749 cuda : fix soft_max to use correct mask size Georgi Gerganov 2024-02-01 16:47:20 +02:00
  • 910b15bb40 ggml : fix ggml_soft_max mask requirement Georgi Gerganov 2024-02-01 16:41:02 +02:00
  • 9a5c2a1681 cuda : switch to F16 scalars + tune warps for RTX 2060 Georgi Gerganov 2024-02-01 15:00:47 +02:00
  • 2c04beeb81 cuda : avoid extra QxQ matrix in shared memory Georgi Gerganov 2024-02-01 14:03:03 +02:00
  • ce32060198 llama : support InternLM2 (#5184) b2038 Guoteng 2024-02-01 17:19:51 +08:00
  • 71b69aa7fd cuda : fix flash_attn kernel to produce same results as CPU Georgi Gerganov 2024-02-01 09:40:56 +02:00
  • fd878f71ed cuda: mask as fp16 FSSRepo 2024-01-31 16:22:11 -05:00
  • 3df0b8d47c Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-31 16:09:34 -05:00
  • 0afe47fa5f fix naive implementation FSSRepo 2024-01-31 15:43:42 -05:00
  • 1cfb5372cf Fix broken Vulkan Cmake (properly) (#5230) b2037 Eve 2024-01-31 19:21:55 +00:00
  • 8ad92dc1ec ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext Georgi Gerganov 2024-01-31 19:17:16 +02:00
  • 1ad42b1f1e ggml : ggml_soft_max uses F16 mask gg/flash-attn-mask-f16 Georgi Gerganov 2024-01-31 19:17:16 +02:00
  • b1479dfbc5 fix kernel FSSRepo 2024-01-31 12:28:48 -05:00
  • 2ddc9bbef1 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-01-31 18:49:43 +02:00
  • d3bac7d584 llama : reorder build_orion() at correct place (#5118) b2036 Georgi Gerganov 2024-01-31 18:47:10 +02:00
  • 5cb04dbc16 llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240) b2035 Georgi Gerganov 2024-01-31 17:30:17 +02:00
  • efb7bdbbd0 metal : add im2col F32 dst support (#5132) b2034 Georgi Gerganov 2024-01-31 15:35:41 +02:00
  • 15606309a0 llava : add MobileVLM support (#5132) b2033 JidongZhang-THU 2024-01-31 21:10:15 +08:00
  • b2b9f025e7 format license text, restore apache license by legal suggestion (#5233) b2032 Neo Zhang Jianyu 2024-01-31 21:04:46 +08:00
  • dabcc5b471 ggml : limit n_threads to the max n_tasks (#5238) b2031 slaren 2024-01-31 13:43:03 +01:00
  • f8e9140cb4 Vulkan Fixes (#5223) b2030 0cc4m 2024-01-31 11:44:19 +01:00
  • d62520eb2c Fix typos of IQ2_XXS and IQ3_XXS in llama.cpp (#5231) b2029 Yiming Cui 2024-01-31 11:04:21 +08:00
  • 01684139c3 support SYCL backend windows build (#5208) b2028 Neo Zhang Jianyu 2024-01-31 10:38:07 +08:00
  • e8dc55d006 kompute : llama-bench support and ggml_cpu_has_kompute() (#5226) b2027 Jared Van Bortel 2024-01-30 19:04:37 -05:00
  • 3b0f74b428 latest kernel update, wrong values FSSRepo 2024-01-30 14:57:12 -05:00
  • 3d03bcb7af Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-01-30 21:49:13 +02:00
  • 78df5527e4 tests : ifdef Georgi Gerganov 2024-01-30 21:46:49 +02:00
  • d073e4f933 metal : fix array initialization Georgi Gerganov 2024-01-30 21:45:32 +02:00
  • e0085fdf7c Revert "server : change deps.sh xxd files to string literals (#5221)" b2026 Georgi Gerganov 2024-01-30 21:19:26 +02:00
  • e6f291d158 server : fix context shift (#5195) Georgi Gerganov 2024-01-30 20:17:30 +02:00
  • 4003be0e5f server : change deps.sh xxd files to string literals (#5221) JohnnyB 2024-01-30 12:15:05 -06:00
  • fea4fd4ba7 ggml : fix IQ3_XXS on Metal (#5219) Kawrakow 2024-01-30 19:15:28 +02:00
  • 719a087138 iq3_xxs: forgotten update of the grid points ik/fix_iq3xxs_metal Iwan Kawrakow 2024-01-30 18:39:07 +02:00
  • 8f8ddfcfad sync : ggml (#0) b2022 Georgi Gerganov 2024-01-30 16:21:57 +02:00
  • 6fb50ebbf0 gguf : fix comparison (ggml/715) Georgi Gerganov 2024-01-29 21:08:18 +02:00
  • 625a699b54 ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting (ggml/686) John Balis 2024-01-29 06:37:33 -06:00
  • a4b07c057a gguf : add input validation, prevent integer overflows (ggml/709) Georgi Gerganov 2024-01-29 14:00:10 +02:00
  • 549a1e6cd5 ci : fix yolo URLs + fix metal capture (ggml/712) Georgi Gerganov 2024-01-29 13:29:46 +02:00
  • 5f14ee0b0c metal : add debug capture backend function (ggml/694) Jack Mousseau 2024-01-29 01:22:23 -08:00
  • 8e14e3ddb3 Faster AVX2 dot product for IQ2_XS (#5187) b2016 Kawrakow 2024-01-30 15:15:07 +02:00
  • f4d7e54974 SOTA 3-bit quants (#5196) b2015 Kawrakow 2024-01-30 15:14:12 +02:00
  • 2256f36b79 Vulkan Windows APU Memory Handling (#5199) b2014 0cc4m 2024-01-30 13:59:30 +01:00
  • 7359016c7c quantize : fix typo (#5211) b2013 Vladimir Malyutin 2024-01-30 17:57:07 +07:00
  • 813416991a main : allow empty --prompt-cache file (#5176) b2012 divinity76 2024-01-30 10:18:02 +01:00
  • 5589921ef8 readme : minor (#5204) Romain Neutron 2024-01-30 10:16:38 +01:00
  • 49f44b5c55 readme : update hot topics Georgi Gerganov 2024-01-30 11:14:44 +02:00
  • 6685cc41c2 server : improve README (#5209) Wu Jian Ping 2024-01-30 17:11:46 +08:00
  • ceebbb5b21 ggml alloc: Fix for null dereference on alloc failure (#5200) b2008 Paul Tsochantaris 2024-01-29 22:19:29 +00:00
  • 6daa69ee81 kompute : fix fallback to CPU (#5201) b2007 Jared Van Bortel 2024-01-29 17:11:27 -05:00
  • fbf1ddec69 Nomic Vulkan backend (#4456) b2006 Jared Van Bortel 2024-01-29 15:50:50 -05:00
  • 7980178a17 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-29 13:17:39 -05:00
  • a1d5a12bc5 fix compiler error FSSRepo 2024-01-29 13:15:33 -05:00
  • 5fcb9c1c5a metal : faster inner loop for C == 32 Georgi Gerganov 2024-01-29 19:46:22 +02:00
  • c6c1132e5e tests : more Georgi Gerganov 2024-01-29 18:22:28 +02:00
  • abeaf0d90e metal : disable buffer allocation logs Georgi Gerganov 2024-01-29 18:12:24 +02:00
  • 2aed77eb06 fix typo "RLIMIT_MLOCK" (#5175) b2005 divinity76 2024-01-29 15:45:41 +01:00
  • 4794821a31 tests : add ATTN tests Georgi Gerganov 2024-01-29 16:44:55 +02:00
  • c82d18e863 server : embeddings compatibility for OpenAI (#5190) b2004 Wu Jian Ping 2024-01-29 21:48:10 +08:00
  • 14fef85e2d py : fix except (#5194) Georgi Gerganov 2024-01-29 15:35:54 +02:00
  • e76627bcce py : improve BPE tokenizer support (#5189) b2002 Sang-Kil Park 2024-01-29 18:24:19 +09:00
  • fbe7dfa53c ggml : add max buffer sizes to opencl and metal backends (#5181) b2001 slaren 2024-01-29 09:05:13 +01:00
  • 172ac82629 cmake : fix Vulkan build (#5182) b2000 Eve 2024-01-29 08:04:47 +00:00