Commit Graph

  • 29eee40474 fix speculative decoding build on windows (#5874) b2343 Jeffrey Quesnelle 2024-03-04 19:23:06 -08:00
  • 1d41d6f7c2 nix: static build (#5814) hutli 2024-03-05 02:33:08 +01:00
  • 29ae62d2ae llama : fix embeddings (#5796) Georgi Gerganov 2024-03-04 22:31:20 +02:00
  • e0843afe1b flake : fix Georgi Gerganov 2024-03-04 21:50:50 +02:00
  • a1c6d96ed8 ggml : fix unknown status (#0) Georgi Gerganov 2024-03-04 20:53:27 +02:00
  • efd8533ef8 sync : ggml Georgi Gerganov 2024-03-04 11:06:39 +02:00
  • 9fa2627347 ggml : introduce ggml_status (ggml/750) Michael Podvitskiy 2024-03-04 10:05:42 +01:00
  • 58c7f6167c ggml : fix F16 store (ARM NEON) Georgi Gerganov 2024-03-04 20:44:57 +02:00
  • e307882c34 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-03-04 20:42:48 +02:00
  • fe52be11e3 cmake : handle cases where git index is not found in .git (#5844) Dane Madsen 2024-03-05 05:26:55 +11:00
  • 6d341ab6c5 speculative : implement stochastic speculative sampling (#5625) Minsoo Cheong 2024-03-05 03:24:00 +09:00
  • a6a263b919 iq3_s_mult_shuffle: works on ARM_NEON and Metal Iwan Kawrakow 2024-03-04 20:10:36 +02:00
  • b587482287 iq3_s_mult_shuffle: mult + shuffle based codebook Iwan Kawrakow 2024-03-04 19:43:22 +02:00
  • 4ec0e9abbf wip gg/fix-embeddings-wip Georgi Gerganov 2024-03-04 17:07:12 +02:00
  • e66da356a4 llama : add pooling switch Georgi Gerganov 2024-03-04 14:06:33 +02:00
  • 9bbeb0f110 embeddings : fix llama_batch_init arg Georgi Gerganov 2024-03-04 14:06:00 +02:00
  • eb42596277 llama : do not use KV cache for non-causal models Georgi Gerganov 2024-03-04 13:31:03 +02:00
  • 4ffcdce2ff add alias for chat template (#5858) b2334 Xuan Son Nguyen 2024-03-04 12:22:08 +01:00
  • d0347840c1 llama : fix embeddings Georgi Gerganov 2024-02-29 15:39:10 +02:00
  • a0fc62661f sync : ggml b2333 Georgi Gerganov 2024-03-04 10:40:04 +02:00
  • 7d43c585dc add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) leejet 2024-03-03 20:23:52 +08:00
  • 82f3e668ad common : use LLAMA_DEFAULT_SEED (#5855) b2331 DAN™ 2024-03-04 03:08:19 -05:00
  • 5a51cc1bb4 main : support special tokens as reverse/anti prompt (#5847) b2330 DAN™ 2024-03-04 02:57:20 -05:00
  • b48bf8b411 iq3_s_mult: scalar dot product Iwan Kawrakow 2024-03-03 11:55:31 +02:00
  • f2c2bd6b26 iq3_s_mult: also CUDA Iwan Kawrakow 2024-03-03 19:12:05 +02:00
  • e5e72562c5 iq3_s_mult: back to blocks of 32 Iwan Kawrakow 2024-03-03 18:50:26 +02:00
  • f4cb4eac45 iq3_s_mult: play with blocks of 16 Iwan Kawrakow 2024-03-03 16:43:00 +02:00
  • 67be2ce101 cuda : fix data race in soft max (#5853) b2329 slaren 2024-03-03 14:26:18 +01:00
  • 6aefd11204 llama : adapt new models to F16 KQ_mask Georgi Gerganov 2024-03-03 13:50:54 +02:00
  • 02a645e7b7 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-03-03 13:44:11 +02:00
  • dbe98dfe70 iq3_s_mult: another alternative multiplier Iwan Kawrakow 2024-03-03 13:13:52 +02:00
  • 231ae28f07 readme : add API changes section Georgi Gerganov 2024-03-03 12:44:03 +02:00
  • 475df1d6cf llama : allow for user specified embedding pooling type (#5849) b2327 Douglas Hanley 2024-03-03 04:40:27 -06:00
  • 8b713a987e iq3s_mult: quantization tuning Iwan Kawrakow 2024-03-03 11:32:53 +02:00
  • 5b9c8785fa iq3s_mult: ARM and Metal Iwan Kawrakow 2024-03-03 11:30:01 +02:00
  • b6402fa757 iq3_s_mult: ifdef'd slow / fast versions Iwan Kawrakow 2024-03-03 10:43:53 +02:00
  • 87c2e8b279 gguf-dump : support i-quants (#5841) Nindaleth 2024-03-03 09:43:42 +01:00
  • de9692a7d2 llama : fix llama_copy_state_data with fragmented KV cache (#5840) b2325 compilade 2024-03-03 03:41:55 -05:00
  • e6029348e8 ci : schedule slow server tests only on Release or on demand (#5839) b2324 Pierrick Hymbert 2024-03-03 09:35:23 +01:00
  • 8ef969afce server : init http requests thread pool with --parallel if set (#5836) b2323 Pierrick Hymbert 2024-03-03 08:48:36 +01:00
  • 726aed307a iq3_s_mult: alternative multiplier / bit twidling Iwan Kawrakow 2024-03-03 08:51:28 +02:00
  • fe3c20b251 iq3_s_mult: quantization tuning Iwan Kawrakow 2024-03-03 07:51:20 +02:00
  • 3000e0ac9e iq3_s_mult: Metal works - slower than lookup Iwan Kawrakow 2024-03-03 06:41:58 +02:00
  • fa974646e1 flake.lock: Update (#5842) Georgi Gerganov 2024-03-03 06:11:31 +02:00
  • eb0bf32caf server: tests: schedule slow dispatch only on release or on demand ci/server/fix-slow-test Pierrick HYMBERT 2024-03-02 23:18:31 +01:00
  • 9731134296 server: tests: passkey challenge / self-extend with context shift demo (#5832) b2321 Pierrick Hymbert 2024-03-02 22:00:14 +01:00
  • 4a6e2d6142 llama : add abort_callback to interrupt computation (#5409) b2320 Michael Podvitskiy 2024-03-02 20:52:25 +01:00
  • 494c870326 ggml : fix IQ3_S AVX implementation (#5834) b2319 Georgi Gerganov 2024-03-02 20:00:49 +02:00
  • 4d4d2366fc convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) b2318 Jared Van Bortel 2024-03-02 12:27:26 -05:00
  • c7a0ad8ec9 convert-hf : make model class definitions self-contained (#5825) Jared Van Bortel 2024-03-02 12:21:47 -05:00
  • bf90920fb2 iq3_s_mult: ARM_NEON works - 13 t/s Iwan Kawrakow 2024-03-02 19:17:27 +02:00
  • 0b673ca187 s/_MODEL_CLASSES/_model_classes/ ceb/convert-hf-refactor Jared Van Bortel 2024-03-02 12:14:37 -05:00
  • 0fe9cd488f WIP Iwan Kawrakow 2024-03-02 17:56:16 +02:00
  • bbde6eb256 ggml : IQ3_S improvements (#5829) b2316 Kawrakow 2024-03-02 17:00:51 +02:00
  • ef2cd694c4 scripts : add pod-llama.sh Georgi Gerganov 2024-03-02 16:54:08 +02:00
  • 6c32d8c7ad llama : refactor internal quantization functions (#5830) b2314 Xuan Son Nguyen 2024-03-02 15:19:09 +01:00
  • 802da0091b llama : fix segfault from unknown model arch name (#5820) b2313 compilade 2024-03-02 08:42:56 -05:00
  • 715641391d Support multiple GPUs (split mode) on SYCL backend (#5806) b2312 Neo Zhang Jianyu 2024-03-02 19:49:30 +08:00
  • d4dfc250cc Fix ARM_NEON ik/iq3_s_faster Iwan Kawrakow 2024-03-02 10:12:02 +02:00
  • 93bce3c909 iq3_s: use new grid everywhere Iwan Kawrakow 2024-03-02 07:57:39 +02:00
  • 9bf297a02b workflows : remove nocleanup arg for check-requirements.sh (#5826) b2311 crasm 2024-03-02 00:11:06 -05:00
  • cb5e8f7fc4 build(nix): Introduce flake.formatter for nix fmt (#5687) Tushar 2024-03-02 04:48:26 +05:30
  • da3b9ba2b7 convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792) b2309 nold 2024-03-01 22:51:12 +01:00
  • 7f0a1d66b5 convert-hf : make model class definitions self-contained Jared Van Bortel 2024-03-01 15:52:37 -05:00
  • 95845d17ec convert-hf : make actual types match annotations Jared Van Bortel 2024-03-01 15:19:59 -05:00
  • c29af7e225 llama : add StarCoder2 support (#5795) b2308 Sourab Mangrulkar 2024-03-02 01:00:46 +05:30
  • 11d4e099b4 iq3_s: PPL improvement Iwan Kawrakow 2024-03-01 20:01:30 +02:00
  • 38d16b1426 server : remove api_like_OAI.py proxy script (#5808) Georgi Gerganov 2024-03-01 20:00:58 +02:00
  • f8ab539190 convert : update help string ceb/convert-vocab-fallback Jared Van Bortel 2024-03-01 12:29:34 -05:00
  • 767aef90be docs : s/LLaMa/LLaMA/ Jared Van Bortel 2024-03-01 12:22:59 -05:00
  • 17d22efa40 convert : automatically fall back to HfVocab if needed Jared Van Bortel 2024-03-01 12:08:54 -05:00
  • c2224f003b ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813) b2306 ddpasa 2024-03-01 18:00:00 +01:00
  • e43e81a5d7 WIP Iwan Kawrakow 2024-03-01 18:48:08 +02:00
  • 7b629c3b65 iq3_s: minor improvement on Metal Iwan Kawrakow 2024-03-01 17:46:33 +02:00
  • 9c5b594cde iq3_s: another small ARM_NEON improvement Iwan Kawrakow 2024-03-01 16:53:21 +02:00
  • 1e94989156 iq3_s: somewhat faster ARM_NEON dot product Iwan Kawrakow 2024-03-01 16:22:33 +02:00
  • e743386728 gemma : fix bfloat16 -> float16 conversion issue (#5810) b2305 kunal-vaishnavi 2024-03-01 06:08:08 -08:00
  • 39e3a429c8 iq3_s: somewhat faster AVX2 dot product Iwan Kawrakow 2024-03-01 15:58:08 +02:00
  • f49a535686 common : fix flag --logits-all to --all-logits (#5805) b2304 Miwa / Ensan 2024-03-01 22:48:56 +09:00
  • 9862d59c05 llama : change starcoder2 rope type gg/fix-starcoder2 Georgi Gerganov 2024-03-01 15:10:31 +02:00
  • 160acecaba iq3_s_multiplier: CUDA and AVX2 works Iwan Kawrakow 2024-03-01 13:44:06 +02:00
  • 3ab8b3a92e llama : cleanup unused mmq flags (#5772) b2303 Pierrick Hymbert 2024-03-01 12:39:06 +01:00
  • 4c21c826e1 WIP Iwan Kawrakow 2024-03-01 13:28:20 +02:00
  • 1cc7cb2b46 iq3_s(multiplier): use SIMD also in dequantize Iwan Kawrakow 2024-03-01 12:02:39 +02:00
  • b67b8f6451 handle rope-theta Sourab Mangrulkar 2024-03-01 15:29:36 +05:30
  • 9c752ff0d3 Trying IQ3_S without a lookup table Iwan Kawrakow 2024-03-01 11:52:17 +02:00
  • fdd886f7b4 remove redundant changes Sourab Mangrulkar 2024-03-01 15:14:26 +05:30
  • 9600d59e01 unicode : switch to multimap based nfd_map (#5799) b2302 Douglas Hanley 2024-03-01 03:15:36 -06:00
  • 5cb02b4a01 server: allow to override threads server pool with --threads-http (#5794) b2301 Pierrick Hymbert 2024-03-01 10:08:08 +01:00
  • 6ea0f010ff ci : add Ubuntu 22 Vulkan CI run (#5789) b2300 Eve 2024-03-01 08:54:53 +00:00
  • f105471ef6 server : fix newlines in help (#5785) b2299 Georgi Gerganov 2024-03-01 09:59:43 +02:00
  • 38d1521608 [SYCL] Use batched mul_mat pathway (#5591) b2298 AidanBeltonS 2024-03-01 07:36:47 +00:00
  • 5c06625f58 Update llama.cpp Sourab Mangrulkar 2024-03-01 12:35:18 +05:30
  • 10aa6e927e resolve comments Sourab Mangrulkar 2024-03-01 11:09:35 +05:30
  • 052051d8ae Server: normalize naming (#5779) b2297 Xuan Son Nguyen 2024-02-29 21:42:11 +01:00
  • d62ce1c6b4 skip rope freq and rotary embeddings from being serialized Sourab Mangrulkar 2024-02-29 19:32:04 +05:30
  • 6c108068b1 handle rope type Sourab Mangrulkar 2024-02-29 17:56:32 +05:30
  • ab4eab3a82 Add support for starcoder2 Sourab Mangrulkar 2024-02-29 17:31:25 +05:30
  • d5ab29757e llama : constified llama_set_state_data's src (#5774) b2296 Marcus Dunn 2024-02-29 00:17:23 -08:00
  • 87c91c0766 ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771) b2295 Georgi Gerganov 2024-02-28 21:44:21 +02:00