Commit Graph

  • 78b226a959 gguf : initial model loading - not tested Georgi Gerganov 2023-07-26 16:32:05 +03:00
  • d91b985d2d gguf : read tensor info Georgi Gerganov 2023-07-26 14:58:35 +03:00
  • 8d6acfec12 gguf : read header + meta data Georgi Gerganov 2023-07-26 14:33:53 +03:00
  • 6873148771 gguf : first API pass Georgi Gerganov 2023-07-26 13:24:20 +03:00
  • 7e82d25f40 ci : disable CI temporary to not waste energy Georgi Gerganov 2023-07-26 11:26:14 +03:00
  • bae6b125f6 wip : implement GGUF (#2397) M. Yusuf Sarıgöz 2023-07-26 11:17:05 +03:00
  • 4d698495ea gguf : init Georgi Gerganov 2023-07-26 11:16:07 +03:00
  • 5488fb789e ggml : allocate graphs in a context (#2392) master-5488fb7 slaren 2023-07-26 15:56:53 +02:00
  • eb542d3932 Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) master-eb542d3 Kawrakow 2023-07-25 18:35:53 +03:00
  • 07aaa0f63f ggml : fix ggml_flash_attn to use op_params (#2387) master-07aaa0f slaren 2023-07-25 16:20:12 +02:00
  • fce48caf9a convert.py : support bpe tokenizer (#2228) ldwang 2023-07-25 21:22:09 +08:00
  • 875086bdb9 ggml : relax contiguous constraints in activation function (#2371) master-875086b Jiahao Li 2023-07-25 20:58:32 +08:00
  • da1889834a ggml : improve graph build time via hash table lookup (#2329) master-da18898 slaren 2023-07-25 14:32:20 +02:00
  • 82552b7f54 build : fix line breaking error in build-info.sh (#2349) Hesen Peng 2023-07-25 05:24:09 -07:00
  • 0c06204fb3 main : add --in-prefix-bos to prefix BOS to user inputs; keep EOS (#2304) master-0c06204 Xiao-Yong Jin 2023-07-25 07:19:11 -05:00
  • 1fed755b1f ci : add non-AVX scalar build/test (#2356) master-1fed755 Eve 2023-07-25 08:16:13 -04:00
  • be2301bcda k_quants : add AVX support to dot functions with QK_K as 64 (#2339) master-be2301b katsu560 2023-07-25 21:13:41 +09:00
  • 1aa18ef994 metal : concurrently dispatch commands (#2358) master-1aa18ef Shouzheng Liu 2023-07-25 08:00:19 -04:00
  • 9a08eaf3c4 Another speed gain for Q4_0 and Q4_1 on Metal (#2375) Kawrakow 2023-07-25 13:48:29 +03:00
  • 129d844c87 Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359) master-129d844 Kawrakow 2023-07-25 13:48:04 +03:00
  • d5512b782b server: add rms_norm_eps parameter (#2380) master-d5512b7 slaren 2023-07-25 11:36:17 +02:00
  • c798308e3a [Server] Escape HTML in webchat (#2368) master-c798308 Henri Vasserman 2023-07-25 10:27:34 +03:00
  • 41c674161f make rms_norm_eps a parameter (#2374) master-41c6741 slaren 2023-07-24 17:57:12 +02:00
  • b3f138d058 Chat UI extras (#2366) master-b3f138d Aarni Koskela 2023-07-24 17:54:22 +03:00
  • 5b2b2dc6ae ggml : sync (unary ops refactor, static-correctness) (#2370) master-5b2b2dc Georgi Gerganov 2023-07-24 14:46:21 +03:00
  • ca2467d12c chat css Henri Vasserman 2023-07-24 14:09:05 +03:00
  • f77972f9af Merge remote-tracking branch 'origin/master' into server-cfg Henri Vasserman 2023-07-24 14:08:40 +03:00
  • 42f70cb2f6 Fix scalar version of Q5_K when QK_K = 64 (#2362) master-42f70cb Kawrakow 2023-07-24 12:55:02 +03:00
  • 84e09a7d8b llama : add grammar-based sampling (#1773) master-84e09a7 Evan Jones 2023-07-23 23:58:10 -04:00
  • 2f9cf974a0 Some more Q4_K and Q5_K speedup on CUDA (#2346) master-2f9cf97 Kawrakow 2023-07-24 00:19:47 +03:00
  • 4f06592cc6 Add gqa parameter support to the server (#2351) master-4f06592 IgnacioFDM 2023-07-23 17:31:17 -03:00
  • 70d26ac388 Fix __dp4a documentation (#2348) Johannes Gäßler 2023-07-23 17:49:06 +02:00
  • 57921ca6db common : n_threads == -1 uses std::thread::hardware_concurrency() (#2347) master-57921ca wzy 2023-07-23 21:33:02 +08:00
  • 3602ac4255 fix n_tasks (#2342) master-3602ac4 slaren 2023-07-23 15:19:39 +02:00
  • 95a6c595e7 ggml: move op parameters from tensors to ggml_tensor::op_params (#2333) master-95a6c59 slaren 2023-07-23 14:36:02 +02:00
  • e76d630df1 llama : grouped-query attention + LLaMAv2 70B support (#2276) master-e76d630 Georgi Gerganov 2023-07-23 15:09:47 +03:00
  • 1d0824b247 llama : print help to stdout (#2338) master-1d0824b maddes8cht 2023-07-23 13:59:48 +02:00
  • bc3ec2cdc9 flake : support nix build '.#opencl' (#2337) wzy 2023-07-23 19:57:02 +08:00
  • a940458e48 llama : print max tensor size to stderr (#2336) master-a940458 Christian Demsar 2023-07-23 07:56:34 -04:00
  • 91171b8072 make : fix CLBLAST compile support in FreeBSD (#2331) master-91171b8 Jose Maldonado 2023-07-23 07:52:08 -04:00
  • 355c80f49e examples : simplify vim plugin (#2327) AustinMroz 2023-07-23 06:16:48 -05:00
  • 83a00ce69b metal : support bcast add & dup & cont op (#2323) Jiahao Li 2023-07-23 19:00:37 +08:00
  • d2a43664f9 Speed up Q4_K (#2322) master-d2a4366 Kawrakow 2023-07-23 08:49:20 +03:00
  • b9b7d94fc1 CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313) master-b9b7d94 Johannes Gäßler 2023-07-22 21:27:34 +02:00
  • b47b8a9cfe llama : optimize memory buffers (#2325) master-b47b8a9 Georgi Gerganov 2023-07-22 21:17:57 +03:00
  • d273bfd2c9 allocator: cleanup, more comments ggml-backends slaren 2023-07-22 15:05:24 +02:00
  • b5fe67f8c6 Perplexity: Compute scores correlated to HellaSwag (#2312) master-b5fe67f klosax 2023-07-22 14:21:24 +02:00
  • 5141472e2b llama.cpp: print input/output buffers size slaren 2023-07-22 13:31:06 +02:00
  • e2b9575951 allocator cleanup slaren 2023-07-22 13:29:44 +02:00
  • 24baa54ac1 examples : basic VIM plugin whoreson 2023-07-22 12:34:51 +02:00
  • dd6c67d3cb ci : fix args Georgi Gerganov 2023-07-22 12:00:56 +03:00
  • 5d500e8ccf ci : add 7B CUDA tests (#2319) Georgi Gerganov 2023-07-22 11:48:22 +03:00
  • 7de7882537 allocator: fix partial offloading slaren 2023-07-22 01:46:49 +02:00
  • 7d5f18468c examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311) Richard Roberson 2023-07-21 13:01:10 -06:00
  • e87840f9fd allocator: automatic inplace operations slaren 2023-07-21 16:51:50 +02:00
  • d924522a46 Custom RoPE + bettter memory management for CUDA (#2295) master-d924522 Kawrakow 2023-07-21 17:27:51 +03:00
  • 4d76a5f49b Faster Q3_K implementation on Metal (#2307) Kawrakow 2023-07-21 17:05:30 +03:00
  • 0db14fef06 ggml : fix the rope fix (513f861953) master-0db14fe Georgi Gerganov 2023-07-21 15:16:55 +03:00
  • 03e566977b examples : fix typo in minigpt4.py (#2298) Ikko Eltociear Ashimine 2023-07-21 20:53:07 +09:00
  • 513f861953 ggml : fix rope args order + assert (#2054) master-513f861 Georgi Gerganov 2023-07-21 14:51:34 +03:00
  • 3973b25a64 gitignore : fix final newline Georgi Gerganov 2023-07-21 14:42:41 +03:00
  • 3d679827e7 improved memory management fixes slaren 2023-07-21 12:41:46 +02:00
  • ab0e26bdfb llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) master-ab0e26b Guillaume "Vermeille" Sanchez 2023-07-21 12:58:36 +02:00
  • 73643f5fb1 gitignore : changes for Poetry users + chat examples (#2284) master-73643f5 Jose Maldonado 2023-07-21 06:53:27 -04:00
  • a814d04f81 make : fix indentation master-a814d04 Georgi Gerganov 2023-07-21 13:50:55 +03:00
  • 4c013bb738 ci : fix MNT realpath usage (#2250) Georgi Gerganov 2023-07-21 13:48:18 +03:00
  • 56e9ae062c llama.cpp: partially restore state support, graph export slaren 2023-07-21 12:39:51 +02:00
  • 42c7c2e2e9 make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275) master-42c7c2e Sky Yan 2023-07-21 18:38:57 +08:00
  • 78a3d13424 flake : remove intel mkl from flake.nix due to missing files (#2277) master-78a3d13 wzy 2023-07-21 18:26:34 +08:00
  • ae178ab46b llama : make tensor_split ptr instead of array (#2272) master-ae178ab Georgi Gerganov 2023-07-21 13:10:51 +03:00
  • 54e3bc76fe make : add new target for test binaries (#2244) master-54e3bc7 Jiří Podivín 2023-07-21 12:09:16 +02:00
  • 019fe257bb MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287) Hatsune Miku 2023-07-21 08:13:18 +00:00
  • e68c96f7fe Faster Q2_K on Metal (#2297) Kawrakow 2023-07-21 10:44:40 +03:00
  • 9cf022a188 make : fix embdinput library and server examples building on MSYS2 (#2235) master-9cf022a Przemysław Pawełczyk 2023-07-21 09:42:21 +02:00
  • 37d3f6a260 remove unused code slaren 2023-07-21 02:33:06 +02:00
  • cd6f5dec92 improved memory management slaren 2023-07-21 00:28:49 +02:00
  • d45c1631bc metal : rewrite to fit new backend interface correctly (WIP) ggml-backends-metal Georgi Gerganov 2023-07-20 16:36:33 +03:00
  • e782c9e735 Faster Q5_K and Q6_K on Metal (#2294) Kawrakow 2023-07-20 18:19:45 +03:00
  • de69f8f20d initial implementation of delayed graph allocation slaren 2023-07-20 15:57:48 +02:00
  • e4db70720d [wip] chat now has parameter and cfg Henri Vasserman 2023-07-20 15:37:31 +03:00
  • 785829dfe8 Faster Q4_K on Metal (#2290) Kawrakow 2023-07-20 15:18:43 +03:00
  • cb82adadb8 metal : first working version of the inference without prompt processing Georgi Gerganov 2023-07-20 14:56:29 +03:00
  • 290cb700bf metal : map the CPU buffers to Metal buffers (WIP) Georgi Gerganov 2023-07-20 14:30:34 +03:00
  • fff0e0eafe llama : fix regression from #2000 - could not load no-mmap models master-fff0e0e Georgi Gerganov 2023-07-20 13:47:26 +03:00
  • 417a85a001 metal: minor q4 optimization and reduce code size (#2248) Shouzheng Liu 2023-07-20 06:32:22 -04:00
  • 082dd81286 [wip] chat improvements Henri Vasserman 2023-07-20 03:48:48 +03:00
  • cb205c0d13 automatically calculate compute buffer sizes (without graph allocator) slaren 2023-07-20 02:22:54 +02:00
  • 77ac8deaf1 llama.cpp: remove backend-specific code where possible slaren 2023-07-20 00:59:26 +02:00
  • 43694ca867 consistent semicolons Henri Vasserman 2023-07-20 00:58:16 +03:00
  • 890d1b8446 Merge master into server-cfg Henri Vasserman 2023-07-20 00:48:03 +03:00
  • dd3cf5760a last n tokens done Henri Vasserman 2023-07-20 00:36:36 +03:00
  • 42591a0acd remove "smooth factor" Henri Vasserman 2023-07-20 00:02:13 +03:00
  • 2cb8469e7f refactor evaluation logic Henri Vasserman 2023-07-19 23:45:40 +03:00
  • f38433ef5d Merge remote-tracking branch 'origin/ggml-backends' into ggml-backends-metal Georgi Gerganov 2023-07-19 17:45:45 +03:00
  • 70c55c17c7 metal : create backend, mostly reuse CPU backend interface Georgi Gerganov 2023-07-19 16:47:43 +03:00
  • 294f424554 llama : extend API to get max devices at runtime (#2253) master-294f424 Rinne 2023-07-19 15:06:40 +08:00
  • 45a1b07e9b flake : update flake.nix (#2270) master-45a1b07 wzy 2023-07-19 15:01:55 +08:00
  • b1f4290953 cmake : install targets (#2256) master-b1f4290 wzy 2023-07-19 15:01:11 +08:00
  • 295f85654a allocators wip renamed ggml_backend functions changed ggml_buffer and ggml_backend to always be used as pointers rename ggml_tensor::params -> op_params slaren 2023-07-17 19:03:51 +02:00
  • ed960fa1ab llama : separate compute buffer for metal Georgi Gerganov 2023-07-18 19:19:59 +03:00