Commit Graph

  • ea4402bb0e test-backend-ops : add one more sum_rows test Georgi Gerganov 2023-12-12 17:03:38 +02:00
  • a51bc0c1c0 metal : fix binary ops for ne10 % 4 != 0 Georgi Gerganov 2023-12-12 15:55:42 +02:00
  • 08eb99179a metal : add cpy f16 -> f32 kernel Georgi Gerganov 2023-12-12 14:14:15 +02:00
  • a742d9f9b7 gguf-py : bump version slaren 2023-12-12 12:46:33 +01:00
  • 6a419f4d19 convert : support safetensors format Georgi Gerganov 2023-12-12 13:04:33 +02:00
  • fecac45658 server : tweak default sampling parameters (#4367) kalomaze 2023-12-12 04:12:35 -06:00
  • 9494d7c477 english : use typos to fix comments and logs (#4354) b1627 Richard Kiss 2023-12-12 01:53:36 -08:00
  • 6138963fb2 build : target Windows 8 for standard mingw-w64 (#4405) b1626 Jared Van Bortel 2023-12-12 04:27:26 -05:00
  • 6391817cd1 llama : document logits_all deprecation (#4418) b1625 crasm 2023-12-12 04:25:57 -05:00
  • d9d4cfef64 server : fix local model name in server (#4420) b1624 Vladimir Zorin 2023-12-12 11:25:29 +02:00
  • 41a11aaf99 ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424) b1623 Taikono-Himazin 2023-12-12 18:24:32 +09:00
  • a81a34add0 cmake : detect host compiler and cuda compiler separately Jared Van Bortel 2023-12-11 17:12:37 -05:00
  • abacb27868 cmake : silence linker check stdout Jared Van Bortel 2023-12-11 17:13:19 -05:00
  • 88781479f1 make : honor NVCC, LLAMA_CUDA_CCBIN, NVCCFLAGS Jared Van Bortel 2023-12-11 16:42:22 -05:00
  • 93ca80fa3a make editorconfig checker happy Jared Van Bortel 2023-12-11 15:17:07 -05:00
  • 91df2623d7 make : detect host compiler and cuda compiler separately Jared Van Bortel 2023-12-11 15:09:56 -05:00
  • 9b28f3413b make : simplify nvcc flags Jared Van Bortel 2023-12-11 14:14:48 -05:00
  • f1cbfabd64 convert : fix style slaren 2023-12-11 20:02:55 +01:00
  • 7dc75e3923 convert : use 1e6 rope_freq_base for mixtral slaren 2023-12-11 20:00:28 +01:00
  • 296c945de5 cuda : fix mul_mat_id with multi gpu slaren 2023-12-11 16:53:25 +01:00
  • 33e50f1b53 test-backend-ops : disable MOE test with thread sanitizer slaren 2023-12-11 12:27:48 +01:00
  • ffda94c87f test-backend-ops : simplify and disable slow tests to avoid CI timeout slaren 2023-12-11 12:15:31 +01:00
  • 8cbaed1d9a llama : fix hard-coded number of experts Georgi Gerganov 2023-12-11 08:55:16 +02:00
  • b0029815e4 test-backend-ops : fix dequantize block offset slaren 2023-12-11 02:43:52 +01:00
  • 8a7b2fa528 Update README.md (#4388) Yueh-Po Peng 2023-12-11 06:27:38 +08:00
  • f1380d7897 test-backend-ops : add cpy from f32 -> all types test slaren 2023-12-10 22:58:31 +01:00
  • 54d254bbed test-backend-ops : cleanup, add moe test for batches slaren 2023-12-10 21:52:11 +01:00
  • 0ec5fdb5ce main loop finished, starting to debug Leon Ericsson 2023-12-10 20:20:01 +01:00
  • 54ba263410 test-backend-ops : make experts more evenly probable (test_moe) Georgi Gerganov 2023-12-10 15:27:41 +02:00
  • b0b83dd9e2 metal : fix ggml_mul_mat_id for F32 Georgi Gerganov 2023-12-10 14:30:38 +02:00
  • 65923a8ede convert : determine n_ctx correctly Georgi Gerganov 2023-12-10 14:17:46 +02:00
  • 8614aa736d cuda : fix get_rows when ncols is odd slaren 2023-12-10 13:12:11 +01:00
  • cefebb3660 test-backend-ops : add moe test slaren 2023-12-10 13:11:39 +01:00
  • e640cbe055 llama : add n_expert and n_expert_used to hparams + change quants Georgi Gerganov 2023-12-10 13:57:54 +02:00
  • d1259b7b35 llama : do not quantize expert gating tensors Georgi Gerganov 2023-12-10 13:00:13 +02:00
  • 6cfb31f9ea metal : add indirect mat-vec kernels for all quantization types Georgi Gerganov 2023-12-10 10:59:13 +02:00
  • 016f9bb55a metal : fix ggml_get_rows to work with non-cont src1 Georgi Gerganov 2023-12-10 09:38:21 +02:00
  • 0710b0f726 llama : offload missing ffn_moe_silu slaren 2023-12-09 23:29:47 +01:00
  • 62b95f93d0 cuda : support non-contiguous src1 in get_rows slaren 2023-12-09 22:39:34 +01:00
  • 2e4db48291 ggml : update get_rows f16 and q slaren 2023-12-09 22:38:22 +01:00
  • e18f7345a3 grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396) b1621 Xiang (Kevin) Li 2023-12-09 16:29:27 -05:00
  • ac3f7d8e23 ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D slaren 2023-12-09 19:19:03 +01:00
  • 8c5b66eeaa metal : reduce the kernel launches for ggml_mul_mat_id Georgi Gerganov 2023-12-09 15:30:34 +02:00
  • 7e2006b0c0 metal : add/mul/div use general kernel when src1 not cont Georgi Gerganov 2023-12-09 14:24:58 +02:00
  • 06dfde3e94 llama : add basic support for offloading moe with CUDA slaren 2023-12-09 13:21:09 +01:00
  • 2cbcba829f metal : add more general support for ggml_get_rows + tests Georgi Gerganov 2023-12-09 14:18:42 +02:00
  • 9064b1ca05 ggml : fix ggml_get_rows to take into account ne02 / ne11 Georgi Gerganov 2023-12-09 14:04:54 +02:00
  • ee8fb399aa ggml : add n_as argument to ggml_mul_mat_id slaren 2023-12-09 12:42:25 +01:00
  • 7372b62271 ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) Georgi Gerganov 2023-12-09 13:18:58 +02:00
  • 8b185b7030 llama : fix expert weighting in the FFN Georgi Gerganov 2023-12-09 13:01:42 +02:00
  • 7ea36953ba llama : first working version Georgi Gerganov 2023-12-09 12:45:15 +02:00
  • af1a096bf8 llama : fix cur -> cur_expert Georgi Gerganov 2023-12-09 12:07:39 +02:00
  • aedfad120a llama : update graph to support MoE Georgi Gerganov 2023-12-09 11:47:40 +02:00
  • 861cd67899 ggml : sync latest ggml_mul_mat_id Georgi Gerganov 2023-12-09 11:19:46 +02:00
  • a3eefe95a8 llama : model loading Georgi Gerganov 2023-12-09 11:14:03 +02:00
  • d38e41ee69 convert : fix n_ff typo Georgi Gerganov 2023-12-09 10:59:37 +02:00
  • dff8cbeb39 convert : support Mixtral as LLAMA arch Georgi Gerganov 2023-12-09 10:51:58 +02:00
  • fe680e3d10 sync : ggml (new ops, tests, backend, etc.) (#4359) b1620 Georgi Gerganov 2023-12-07 22:26:54 +02:00
  • bcc0eb4591 llama : per-layer KV cache + quantum K cache (#4309) b1619 Georgi Gerganov 2023-12-07 13:03:17 +02:00
  • fc5f334689 readme : add API change notice gg/per-layer-kv Georgi Gerganov 2023-12-07 12:35:02 +02:00
  • 680a99e792 Merge branch 'master' into gg/per-layer-kv Georgi Gerganov 2023-12-07 12:33:11 +02:00
  • 81bc9214a3 train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351) b1618 Hongyu Ouyang 2023-12-07 02:25:22 -08:00
  • 05cd6e5036 server : recognize cache_prompt parameter in OAI API (#4347) b1617 Georgi Gerganov 2023-12-06 20:21:59 +02:00
  • 1a1a1c3845 llama : support quantum K cache (#4312) Georgi Gerganov 2023-12-06 13:30:20 +02:00
  • caa9249217 common : fix compile warning b1616 Georgi Gerganov 2023-12-06 10:41:03 +02:00
  • da5eaef1f3 speculative : support --color (#4343) b1615 stduhpf 2023-12-06 09:08:17 +01:00
  • 5f6e0c0dff grammar : pre-computed pieces + reserve mem + less string copies (#4330) b1614 Marcus Dunn 2023-12-05 10:55:12 -10:00
  • 66a8dd35a0 Merge branch 'master' into cuda-cublas-opts Georgi Gerganov 2023-12-05 20:54:33 +02:00
  • 5aa365d88f llama : allow overriding GGUF metadata when loading model (#4092) b1613 Kerfuffle 2023-12-05 10:19:18 -07:00
  • af99c6fbfc llama : remove memory_f16 and kv_f16 flags gg/quantum-k-cache Georgi Gerganov 2023-12-05 18:18:16 +02:00
  • 4adb1d69d9 cuda : add comment Georgi Gerganov 2023-12-05 18:15:51 +02:00
  • dd86df82e6 metal : use mm kernel only for quantum KV cache Georgi Gerganov 2023-12-05 18:14:04 +02:00
  • 903167a777 llama-bench : support type_k/type_v slaren 2023-12-05 16:32:53 +01:00
  • b2acedeb1a cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels Georgi Gerganov 2023-12-05 16:47:34 +02:00
  • e8457c90a0 cuda : wip Georgi Gerganov 2023-12-05 16:29:52 +02:00
  • 6b58ae9892 metal : add F32 -> Q4_1 copy kernel Georgi Gerganov 2023-12-05 16:09:16 +02:00
  • 9d69ecc0c9 metal : add F32 -> Q4_0 copy kernel Georgi Gerganov 2023-12-05 16:01:50 +02:00
  • 7864a2cd9b llama : fix build Georgi Gerganov 2023-12-05 15:43:25 +02:00
  • 3ce30e07c9 llama : pass KV cache type through API Georgi Gerganov 2023-12-05 15:40:23 +02:00
  • 52c8bc3cf3 sampling : custom samplers order (#4285) b1612 MaggotHATE 2023-12-05 15:05:51 +05:00
  • e4b76bbe31 swift : revert compiler checks for swift package (#4332) b1611 kchro3 2023-12-04 23:29:46 -08:00
  • cae8f50b1a initial commit, going through initializations Leon Ericsson 2023-12-04 21:52:17 +01:00
  • 23b5e12eb5 simple : update error message for KV cache check (#4324) b1610 Daniel Bevenius 2023-12-04 17:04:21 +01:00
  • d208995c6d swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325) b1609 Miwa / Ensan 2023-12-05 01:03:49 +09:00
  • 5c9f90cba1 swift : fix prompt tokenization logic (#4321) b1608 Miwa / Ensan 2023-12-04 22:43:45 +09:00
  • b881f630ca cuda : use mmv kernel for quantum cache ops Georgi Gerganov 2023-12-04 15:41:20 +02:00
  • a1bf6c09f8 cuda : add F32 -> Q8_0 copy kernel Georgi Gerganov 2023-12-04 15:08:36 +02:00
  • bcfebf241d metal : add F32 -> Q8_0 copy kernel Georgi Gerganov 2023-12-04 10:42:10 +02:00
  • 4fa44e84ad grammar-parser : fix typo (#4318) b1607 Ikko Eltociear Ashimine 2023-12-04 16:57:35 +09:00
  • d04ee928a2 llama : support quantum K cache (wip) Georgi Gerganov 2023-12-03 21:31:05 +02:00
  • 66aaac9867 llama : update session save/load Georgi Gerganov 2023-12-03 21:10:16 +02:00
  • e262947d43 common : add command-line arg to disable KV cache offloading Georgi Gerganov 2023-12-03 20:31:01 +02:00
  • c80b8a2bff llama : remove mirrors, perform Device -> Host when partial offload Georgi Gerganov 2023-12-03 19:46:06 +02:00
  • c44bc1ee00 llama : keep the KV related layers on the device Georgi Gerganov 2023-12-03 19:22:47 +02:00
  • 1fa91a4833 llama : enable offload debug temporarily Georgi Gerganov 2023-12-03 18:36:02 +02:00
  • 3d3e6bd0e4 llama : offload for rest of the model arches Georgi Gerganov 2023-12-03 17:52:23 +02:00
  • f3dbfb9f60 llama : offload K shift tensors Georgi Gerganov 2023-12-03 17:43:04 +02:00
  • 986b3da76a llama : offload KV cache per-layer Georgi Gerganov 2023-12-03 17:18:15 +02:00
  • c294c78eb7 Merge branch 'master' into per-layer-kv Georgi Gerganov 2023-12-03 16:18:21 +02:00
  • fbbc42827b ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (#4308) b1606 Georgi Gerganov 2023-12-03 15:56:35 +02:00