Commit Graph

  • 172ac82629 cmake : fix Vulkan build (#5182) b2000 Eve 2024-01-29 08:04:47 +00:00
  • 1db22d7032 metal : support Q > 8 Georgi Gerganov 2024-01-28 23:08:31 +02:00
  • 134c81c78d metal : minor Georgi Gerganov 2024-01-28 22:23:40 +02:00
  • 0ad44baf33 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-01-28 21:53:51 +02:00
  • d2f650cb5b metal : free metal objects (#5161) b1999 Paul Tsochantaris 2024-01-28 19:50:16 +00:00
  • 35dec26cc2 sync : ggml b1998 Georgi Gerganov 2024-01-28 19:48:05 +02:00
  • d460510c72 ggml : minor type fix (int64_t -> size_t) Georgi Gerganov 2024-01-28 18:44:58 +02:00
  • 2307523d32 ggml : add Vulkan backend (#2059) b1996 0cc4m 2024-01-28 18:03:59 +01:00
  • 8612864108 ggml : fix f16 mad Georgi Gerganov 2024-01-28 18:10:16 +02:00
  • 0f648573dd ggml : add unified SYCL backend for Intel GPUs (#2690) b1995 Abhilash Majumder 2024-01-28 21:26:23 +05:30
  • 3a428a1097 metal : improve precision Georgi Gerganov 2024-01-28 17:47:22 +02:00
  • b764b8f1d0 flake.lock: Update (#5162) Georgi Gerganov 2024-01-28 16:54:54 +02:00
  • ecc466a460 metal : add tests, fix scaling, support C > 32 Georgi Gerganov 2024-01-28 15:42:57 +02:00
  • 77f6976a87 metal : move output into local memory + optimize Georgi Gerganov 2024-01-28 13:15:00 +02:00
  • 9241c3a2ac Apply min_p to unsorted tokens (#5115) b1993 Johannes Gäßler 2024-01-28 09:59:49 +01:00
  • b3dd7d975f Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-01-28 10:53:16 +02:00
  • b2b2bf988c Tests for min_p, sampling queue (#5147) b1992 Johannes Gäßler 2024-01-28 09:35:14 +01:00
  • af4980bfed readme : add link to rust bindings (#5148) Marcus Dunn 2024-01-28 00:30:44 -08:00
  • f2e69d28c0 llama : add support for Orion-14B (#5118) b1990 sharpHL 2024-01-28 16:00:30 +08:00
  • 39baaf55a1 docker : add server-first container images (#5157) b1989 Kyle Mistele 2024-01-28 01:55:31 -06:00
  • 2455a8d6c3 update impl FSSRepo 2024-01-27 12:23:40 -05:00
  • 7cea9735ab Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-27 11:38:20 -05:00
  • 6db2b41a76 llava : support for Yi-VL and fix for mobileVLM (#5093) b1988 John 2024-01-27 16:09:18 +01:00
  • 753eafed0e sync : ggml b1987 Georgi Gerganov 2024-01-27 16:59:20 +02:00
  • e976423005 ggml : check ggml_add src1 type (ggml/708) Judd 2024-01-26 21:04:01 +08:00
  • 35a2ee9143 Remove unused data and add fixes (#5154) b1985 Michael Klimenko 2024-01-27 15:25:55 +01:00
  • ec903c0341 server : add self-extend support (#5104) b1984 Maximilian Winter 2024-01-27 14:38:05 +01:00
  • 0a481fe1a9 integrate tensor cores FSSRepo 2024-01-26 20:14:02 -05:00
  • a1d6df129b Add OpenCL add kernel (#5151) b1983 0cc4m 2024-01-26 23:07:32 +01:00
  • bbe7c56c99 cmake : pass CPU architecture flags to nvcc (#5146) b1982 Jared Van Bortel 2024-01-26 15:34:06 -05:00
  • 62fead3ea0 cuda : fix tensor size calculation for non-split buffer (#5145) b1981 slaren 2024-01-26 18:59:43 +01:00
  • 15b4538ff2 ggml-alloc : add 10% margin to the buffer sizes (#5149) b1980 slaren 2024-01-26 18:18:26 +01:00
  • 7032f4f634 ggml : update softmax n_task calculation (#5126) b1979 snadampal 2024-01-26 11:17:59 -06:00
  • 5f1925a8ce scripts : move run-with-preset.py from root to scripts folder Georgi Gerganov 2024-01-26 17:09:44 +02:00
  • 3b7c914de2 tests : gitignore test-c.o Georgi Gerganov 2024-01-26 14:48:15 +02:00
  • 48c857aa10 server : refactored the task processing logic (#5065) b1976 Xuan Son Nguyen 2024-01-26 13:42:20 +01:00
  • 413e7b0559 ci : add model tests + script wrapper (#4586) b1975 crasm 2024-01-26 07:18:00 -05:00
  • 6dd3c28c9c metal : remove unused n_buffers and buffers (#5129) b1974 Paul Tsochantaris 2024-01-26 12:16:07 +00:00
  • 38b431de23 gguf : fix "general.alignment" type in gguf_reader.py (#5136) Riceball LEE 2024-01-26 17:10:28 +08:00
  • aad0b01d73 readme : update hot topics Georgi Gerganov 2024-01-26 10:52:33 +02:00
  • 1182cf4d4f Another bucket sort (#5109) b1971 Kawrakow 2024-01-26 09:14:39 +02:00
  • fe54033b69 readme : add MobileVLM 1.7B/3B to the supported models list (#5107) XiaotaoChen 2024-01-26 04:14:32 +08:00
  • 5eaf9964fc llama : dynamic temperature sampling (#4972) b1969 l3utterfly 2024-01-26 05:06:22 +09:00
  • d292f4f204 examples : make pydantic scripts pass mypy and support py3.8 (#5099) Jared Van Bortel 2024-01-25 14:51:24 -05:00
  • 256d1bb0dd android : use release cmake build type by default (#5123) Valentin Konovalov 2024-01-25 12:05:51 -05:00
  • 6fea843b24 metal : add parallel reduce version (disabled) Georgi Gerganov 2024-01-25 17:59:41 +02:00
  • 6e7cb0eeaf update implementation FSSRepo 2024-01-25 11:04:51 -05:00
  • faa3526a1e Fix Q3_K_XS for MoE models (#5113) b1966 Kawrakow 2024-01-25 17:58:53 +02:00
  • f9ca5dcbe8 llama : avoid ggml_cast, use F32 query Georgi Gerganov 2024-01-25 17:46:07 +02:00
  • 78da3387a8 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-25 09:48:37 -05:00
  • 40ea8cd1ac metal : fix comment Georgi Gerganov 2024-01-25 16:31:39 +02:00
  • 432ad04ffa metal : scale and mask in matrix form Georgi Gerganov 2024-01-25 15:47:52 +02:00
  • d917746ddb metal : avoid redundant loads of the attention Georgi Gerganov 2024-01-25 15:00:49 +02:00
  • 1446a12b29 metal : efficient flash_attn_f16 implementation Georgi Gerganov 2024-01-23 18:27:54 +02:00
  • 2bf91c5306 metal : clean up gg/flash-attn-simd Georgi Gerganov 2024-01-25 13:29:45 +02:00
  • f6416d4493 wip : good version 8x32 Georgi Gerganov 2024-01-25 12:59:59 +02:00
  • ddc5a5033f metal : show compile log messages b1965 Georgi Gerganov 2024-01-25 11:26:17 +02:00
  • eb12e3c391 wip : disable skip Georgi Gerganov 2024-01-25 11:25:07 +02:00
  • 806382a3a6 wip : simdify ms, vs Georgi Gerganov 2024-01-25 09:39:22 +02:00
  • cd4fddb29f cuda : fix 2-bit quants on amd hip (#5105) b1964 Engininja2 2024-01-24 16:18:15 -06:00
  • 0fc36d872c match to metal impl FSSRepo 2024-01-24 16:45:30 -05:00
  • 972c2adc15 use half2 instead half4 FSSRepo 2024-01-24 16:41:57 -05:00
  • f2efa6cd98 wip : simd Georgi Gerganov 2024-01-24 17:06:48 +02:00
  • 6416821499 fix equivalent fp16 math functions, compiler error 'undefined' FSSRepo 2024-01-24 10:57:05 -05:00
  • 6ccbd1777a wip gg/flash-attn-wip3 Georgi Gerganov 2024-01-24 15:45:04 +02:00
  • c9b316c78f nix-shell: use addToSearchPath b1963 Michael Hueschen 2024-01-22 16:44:10 -07:00
  • bf63d695b8 nix: add cc to devShell LD_LIBRARY_PATH Michael Hueschen 2024-01-22 03:17:05 -07:00
  • 1387ea2117 llama : pre-allocate input tensors in a separate buffer (#5100) b1961 slaren 2024-01-24 12:48:14 +01:00
  • da23b56f25 wip : no ic 8 step gg/flash-attn-wip4 Georgi Gerganov 2024-01-24 13:25:34 +02:00
  • af3eda9c77 wip Georgi Gerganov 2024-01-24 11:18:24 +02:00
  • 5cbdba693d wip Georgi Gerganov 2024-01-24 10:16:05 +02:00
  • 035c4f01e6 wip Georgi Gerganov 2024-01-24 00:01:54 +02:00
  • 6374bc5779 cuda: port metal version flash_attn_ext FSSRepo 2024-01-23 16:42:53 -05:00
  • 06c2d0d117 wip gg/flash-attn-wip2 Georgi Gerganov 2024-01-23 18:27:54 +02:00
  • a689b02ad3 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda FSSRepo 2024-01-23 13:51:59 -05:00
  • 26d607608d metal : disable support for MUL_MAT F32 x F16 b1960 Georgi Gerganov 2024-01-23 15:50:56 +02:00
  • 44879ee885 Additional KL-divergence statistics (#5081) b1959 Kawrakow 2024-01-23 15:17:20 +02:00
  • 9ecdd12e95 CUDA: more info when no device code (#5088) b1958 Johannes Gäßler 2024-01-23 13:31:56 +01:00
  • 89758723c7 minor : clean-up some warnings and style (#5094) b1957 Georgi Gerganov 2024-01-23 14:12:57 +02:00
  • 2bed4aa3f3 devops : add intel oneapi dockerfile (#5068) b1956 Xuan Son Nguyen 2024-01-23 08:11:39 +01:00
  • 125d03a503 llama.vim : added api key support (#5090) Michael Coppola 2024-01-23 01:51:27 -05:00
  • 011e8ec577 llama : fix not enough space in buffer with Qwen (#5086) b1954 slaren 2024-01-22 23:42:41 +01:00
  • 6f9939d119 KL-divergence (#5076) b1953 Kawrakow 2024-01-22 16:10:14 +02:00
  • 780e24a22e ggml : parallelize FP32 conversion when using BLAS (#5045) b1952 Reinforce-II 2024-01-22 21:15:08 +08:00
  • 3ce7e8f8e7 llava : MobileVLM support (#4954) b1951 XiaotaoChen 2024-01-22 21:09:35 +08:00
  • b2d80e105a flake.nix: add a comment about flakes vs nix b1950 Someone Serge 2024-01-21 03:41:37 +00:00
  • 28603cd283 nix: add a comment on the many nixpkgs-with-cuda instances Someone Serge 2024-01-21 03:29:38 +00:00
  • 5e97ec91ae nix: add a comment about makeScope Someone Serge 2024-01-21 03:15:13 +00:00
  • 7251870780 nix: refactor the cleanSource rules Someone Serge 2024-01-13 17:45:01 +00:00
  • fe8b3c0d4b workflows: nix-ci: drop the redundant "paths" filter Someone Serge 2024-01-13 17:38:32 +00:00
  • f4dd059259 workflows: nix-build-aarch64: rate limit Someone Serge 2024-01-13 17:16:54 +00:00
  • f7276f7500 workflows: nix-ci: rebuild on flake.lock updates Someone Serge 2024-01-13 17:10:19 +00:00
  • 15bceec2d7 imatrix : keep intermediate imatrix results (#5077) b1943 Kawrakow 2024-01-22 14:18:43 +02:00
  • d6bd4d46dd llama : support StableLM 2 1.6B (#5052) b1942 compilade 2024-01-22 06:21:52 -05:00
  • 152d9d05e0 finetune : print sample-start/include-sample-start (#5072) b1941 Daniel Bevenius 2024-01-22 12:11:01 +01:00
  • 66d575c45c llama : add Q3_K_XS (#5060) b1940 Kawrakow 2024-01-22 12:43:33 +02:00
  • 57744932c6 ci : fix Windows CI by updating Intel SDE version (#5053) b1939 bobqianic 2024-01-22 08:55:05 +00:00
  • 3466c6ebcf llama : add more qwen2 models (#5071) Shijie 2024-01-22 15:33:19 +08:00
  • 504dc37be8 Revert LLAMA_NATIVE to OFF in flake.nix (#5066) iSma 2024-01-21 22:37:13 +01:00
  • 17720fad66 metal : parallel reduce across heads Georgi Gerganov 2024-01-21 22:44:41 +02:00