Commit Graph

  • 82380acf10 iq1_s: we can do even better Iwan Kawrakow 2024-03-11 13:12:33 +02:00
  • caa106d4e0 Server: format error to json (#5961) b2400 Xuan Son Nguyen 2024-03-11 10:56:41 +01:00
  • 3202361c5b ggml, ci : Windows ARM runner and build fixes (#5979) b2399 Michael Podvitskiy 2024-03-11 10:28:51 +01:00
  • 332bdfd798 server : maintain chat completion id for streaming responses (#5988) b2398 Minsoo Cheong 2024-03-11 17:09:32 +09:00
  • ecab1c75de cmake : fix subdir for LLAMA_METAL_EMBED_LIBRARY (#5985) b2397 Gilad S 2024-03-11 10:00:08 +02:00
  • ee35600b90 llama : fix F16/F32 downcast + improve names (#5980) b2396 Georgi Gerganov 2024-03-11 09:56:47 +02:00
  • be858f6205 Better 1.5 bit quantization (#5971) b2395 Kawrakow 2024-03-11 07:51:49 +01:00
  • ef3ced26a3 [SYCL] Add q3_s and q1_s (#5886) b2394 Abhilash Majumder 2024-03-11 10:27:56 +05:30
  • 989e15b3c1 Merge branch 'master' into sycl_q3s_q1s sycl_q3s_q1s Abhilash Majumder 2024-03-11 08:41:35 +05:30
  • 3814a07392 [SYCL] Add support for SYCL Nvidia target (#5738) b2393 AidanBeltonS 2024-03-11 01:13:57 +00:00
  • bb6d00bbf9 metal : move mm_id indices to shared mem (#5982) b2392 Georgi Gerganov 2024-03-10 23:12:48 +02:00
  • 7ab7b733bb android : fix utf8 decoding error (#5935) b2391 Dean 2024-03-11 04:03:17 +08:00
  • d9f65c97c3 readme : update hot topics Georgi Gerganov 2024-03-10 20:58:26 +02:00
  • b838b53ad6 sync : ggml b2389 Georgi Gerganov 2024-03-10 20:10:46 +02:00
  • df4dc3e7cb ggml : try fix 32-bit arm compat (whisper/1938) Georgi Gerganov 2024-03-08 23:45:07 +02:00
  • bf47a5eefc ggml : remove __constant__ specifier for CUDA tables (#5940) b2387 Georgi Gerganov 2024-03-10 20:09:24 +02:00
  • fa8a809a91 server: ci: windows build and tests (#5968) b2386 Pierrick Hymbert 2024-03-10 18:17:47 +01:00
  • bcebd7dbf6 llama : add support for GritLM (#5959) b2385 DAN™ 2024-03-10 11:56:30 -04:00
  • 2960eae847 grammar : verify parsed state (#5950) b2384 Clint Herron 2024-03-10 11:17:43 -04:00
  • c78541479c nix: update flake.lock (#5969) Georgi Gerganov 2024-03-10 16:43:08 +02:00
  • 621e86b331 server: benchmark: chat/completions scenario and other llm servers comparison (#5941) b2382 Pierrick Hymbert 2024-03-09 23:41:49 +01:00
  • 77d1ac7e00 server : print chat template info b2381 Georgi Gerganov 2024-03-09 22:04:00 +02:00
  • b54afce9f4 mostly style fixes; fix KQ_mask comment gritlm-pr Douglas Hanley 2024-03-09 13:03:46 -06:00
  • d894f352bf perplexity : support using multiple sequences to allow larger batch sizes (#5946) b2380 slaren 2024-03-09 19:55:54 +01:00
  • 098dbaab44 readme : update hot topics Georgi Gerganov 2024-03-09 18:14:13 +02:00
  • 8380ecfb21 ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla) (#5951) b2378 Georgi Gerganov 2024-03-09 17:36:20 +02:00
  • 58308a0ecc server : fix metrics init (#5964) b2377 Georgi Gerganov 2024-03-09 17:34:15 +02:00
  • 5b09797321 ggml : remove old quantization functions (#5942) b2376 Georgi Gerganov 2024-03-09 15:53:59 +02:00
  • 97c09585d6 server : clarify some items in the readme (#5957) Georgi Gerganov 2024-03-09 15:47:47 +02:00
  • 03acc82a85 Clean-up GritLM sample code. DAN™ 2024-03-09 07:44:25 -05:00
  • fb215c3832 server : normalize embeddings (#5956) b2374 SeungWon Jeong 2024-03-09 21:27:58 +09:00
  • 2c4f566c88 tests : gitignore ggml-common.h Georgi Gerganov 2024-03-09 14:17:11 +02:00
  • 0db32beaf0 server : fix passing prompt as tokens (#5955) b2372 Alexey Parfenov 2024-03-09 11:16:53 +00:00
  • 8a3012a4ad ggml : add ggml-common.h to deduplicate shared code (#5940) b2371 Georgi Gerganov 2024-03-09 12:47:57 +02:00
  • 9674aaf35c server : simplify logic for empty prompts (#5953) b2370 Georgi Gerganov 2024-03-09 12:34:18 +02:00
  • 950ba1ab84 Server: reorganize some http logic (#5939) b2369 Xuan Son Nguyen 2024-03-09 11:27:53 +01:00
  • e1fa9569ba server : add SSL support (#5926) b2368 Gabe Goodhart 2024-03-09 02:57:09 -07:00
  • fd72d2d2a5 server: tests: add truncated prompt tests, better kv cache size (#5933) b2367 Pierrick Hymbert 2024-03-09 10:30:04 +01:00
  • c2101a2e90 llama : support Mamba Selective State Space Models (#5328) b2366 compilade 2024-03-08 17:31:00 -05:00
  • 7bb531421f increase grid space Abhilash Majumder 2024-03-08 21:45:57 +05:30
  • 515f7d0d4f llama : fix quantization of shared token_embd (#5944) b2365 compilade 2024-03-08 10:53:37 -05:00
  • 76e868821a server: metrics: add llamacpp:prompt_seconds_total and llamacpp:tokens_predicted_seconds_total, reset bucket only on /metrics. Fix values cast to int. Add Process-Start-Time-Unix header. (#5937) b2364 Pierrick Hymbert 2024-03-08 12:25:04 +01:00
  • e457fb3540 llama : assume tied weights if lm_head/output weights is missing (#5824) b2363 Don Mahurin 2024-03-08 02:41:50 -08:00
  • af37fd8b30 server : fix EOS token detection with disabled cache (#5938) b2362 Georgi Gerganov 2024-03-08 12:40:02 +02:00
  • 581ed5c4fe log : fix MSVC compile errors (#5643) b2361 UEXTM.com 2024-03-08 04:35:04 -05:00
  • bd3d9fbfed allow to toggle embedding mode Douglas Hanley 2024-03-07 11:55:27 -06:00
  • 0ba20ed97a llama : compute BERT graph with F16 K, V gg/bert-f16 Georgi Gerganov 2024-03-05 21:22:20 +02:00
  • 6cdabe6526 llama-bench : add embeddings option (#5924) b2360 Georgi Gerganov 2024-03-07 16:32:38 +02:00
  • 89fb735fcf Revert "[SYCL] fix error when set main gpu to non-zero (#5901)" (#5918) b2359 Neo Zhang Jianyu 2024-03-07 19:14:49 +08:00
  • 55a2a900ff server : add /v1/completions endpoint (#5914) b2358 Minsoo Cheong 2024-03-07 19:42:39 +09:00
  • 94f33d7ae3 rm macro abhilash1910 2024-03-07 01:54:26 -08:00
  • 2002bc96bf server : refactor (#5882) b2357 Georgi Gerganov 2024-03-07 11:41:53 +02:00
  • b5b0270372 Revert "[SYCL] fix error when set main gpu to non-zero (#5901)" revert-5901-fix_set_gpu Neo Zhang Jianyu 2024-03-07 17:11:18 +08:00
  • ceca1aef07 [SYCL] fix error when set main gpu to non-zero (#5901) b2356 Neo Zhang Jianyu 2024-03-07 16:34:31 +08:00
  • f618e5060a add to gitignore Douglas Hanley 2024-03-07 01:38:30 -06:00
  • 1ab6aeeeee gritlm embeddings are back babeee Douglas Hanley 2024-03-07 01:37:08 -06:00
  • e04e04f8fa ggml : use SYS_get_cpu if SYS_getcpu is not defined (#5906) b2355 Jared Van Bortel 2024-03-06 15:42:23 -05:00
  • c810047b53 enable ops abhilash1910 2024-03-06 03:06:25 -08:00
  • 50f9ba353c fix build abhilash1910 2024-03-05 23:37:27 -08:00
  • e25fb4b18f ggml : use uint8x16_t return type for ggml_vqtbl1q_u8 (#5894) b2354 bobqianic 2024-03-06 07:35:07 +00:00
  • 1e35d619a6 convert : remove AWQ remnants (#5768) Georgi Gerganov 2024-03-06 09:12:25 +02:00
  • 600193ca9a fix build abhilash1910 2024-03-05 23:13:32 -08:00
  • 97936078b7 rebase to new embed Douglas Hanley 2024-03-05 23:23:17 -06:00
  • 8ced9f7e32 add wait() to make code stable (#5895) b2352 Neo Zhang Jianyu 2024-03-06 12:08:32 +08:00
  • 652ca2bded compare-llama-bench.py : remove mul_mat_q (#5892) slaren 2024-03-05 22:27:29 +01:00
  • 805ae529c4 comment out debug printing Douglas Hanley 2024-03-04 00:18:41 -06:00
  • a71842d7ef tabs to spaces Douglas Hanley 2024-03-04 00:16:29 -06:00
  • e79195fc53 gritlm results match Douglas Hanley 2024-03-03 23:59:28 -06:00
  • 4be8fb18ed add gritlm example Douglas Hanley 2024-02-29 09:50:41 -06:00
  • bd836944f8 quants : use MM256_SET_M128I consistently to fix gcc 7 build (#5889) b2350 Jared Van Bortel 2024-03-05 11:56:37 -05:00
  • 3de31677d3 grammars : blacklists character control set (#5888) ExtReMLapin 2024-03-05 17:33:08 +01:00
  • 82cb31eb93 Revert "grammars : don't allow to output unescaped new line in string (#5885)" Georgi Gerganov 2024-03-05 15:56:24 +02:00
  • b1a4e994fd grammars : don't allow to output unescaped new line in string (#5885) ExtReMLapin 2024-03-05 14:44:29 +01:00
  • c999536320 fix build Abhilash Majumder 2024-03-05 18:17:33 +05:30
  • 6fd581e075 fix compilation Abhilash Majumder 2024-03-05 18:09:00 +05:30
  • 61d1c88e15 Vulkan Improvements (#5835) b2346 0cc4m 2024-03-05 13:33:42 +01:00
  • ad251954eb Add q3_s and q1_s Abhilash Majumder 2024-03-05 17:51:29 +05:30
  • 31cecc8734 iq3_s_mult_shuffle: use lookup table on Metal ik/iq3_s_multiplier Iwan Kawrakow 2024-03-05 10:19:44 +02:00
  • 21b0867433 [SYCL] fix mul_mat fault in CI/unit-test (#5862) b2345 Neo Zhang Jianyu 2024-03-05 16:08:35 +08:00
  • 93034df760 iq3_s_mult_shuffle: use lookup table on CUDA Iwan Kawrakow 2024-03-05 10:06:07 +02:00
  • 6d15da1ec0 iq3_s_mult_shuffle: use new multiplier and cleanup Iwan Kawrakow 2024-03-05 08:36:57 +02:00
  • b1d753be34 iq3_s_mult: remove SLOW_MULT option Iwan Kawrakow 2024-03-05 08:23:37 +02:00
  • 6a87ac3a52 fix editorconfig check break (#5879) Minsoo Cheong 2024-03-05 15:12:23 +09:00
  • 29eee40474 fix speculative decoding build on windows (#5874) b2343 Jeffrey Quesnelle 2024-03-04 19:23:06 -08:00
  • 1d41d6f7c2 nix: static build (#5814) hutli 2024-03-05 02:33:08 +01:00
  • 29ae62d2ae llama : fix embeddings (#5796) Georgi Gerganov 2024-03-04 22:31:20 +02:00
  • e0843afe1b flake : fix Georgi Gerganov 2024-03-04 21:50:50 +02:00
  • a1c6d96ed8 ggml : fix unknown status (#0) Georgi Gerganov 2024-03-04 20:53:27 +02:00
  • efd8533ef8 sync : ggml Georgi Gerganov 2024-03-04 11:06:39 +02:00
  • 9fa2627347 ggml : introduce ggml_status (ggml/750) Michael Podvitskiy 2024-03-04 10:05:42 +01:00
  • 58c7f6167c ggml : fix F16 store (ARM NEON) Georgi Gerganov 2024-03-04 20:44:57 +02:00
  • e307882c34 Merge branch 'master' into gg/flash-attn Georgi Gerganov 2024-03-04 20:42:48 +02:00
  • fe52be11e3 cmake : handle cases where git index is not found in .git (#5844) Dane Madsen 2024-03-05 05:26:55 +11:00
  • 6d341ab6c5 speculative : implement stochastic speculative sampling (#5625) Minsoo Cheong 2024-03-05 03:24:00 +09:00
  • a6a263b919 iq3_s_mult_shuffle: works on ARM_NEON and Metal Iwan Kawrakow 2024-03-04 20:10:36 +02:00
  • b587482287 iq3_s_mult_shuffle: mult + shuffle based codebook Iwan Kawrakow 2024-03-04 19:43:22 +02:00
  • 4ec0e9abbf wip gg/fix-embeddings-wip Georgi Gerganov 2024-03-04 17:07:12 +02:00
  • e66da356a4 llama : add pooling switch Georgi Gerganov 2024-03-04 14:06:33 +02:00
  • 9bbeb0f110 embeddings : fix llama_batch_init arg Georgi Gerganov 2024-03-04 14:06:00 +02:00
  • eb42596277 llama : do not use KV cache for non-causal models Georgi Gerganov 2024-03-04 13:31:03 +02:00