Commit Graph

  • d70917f4b2 ggml : prefix lookup tables with ggml_ Georgi Gerganov 2023-10-30 18:38:11 +02:00
  • 1039a16ce2 ggml : remove duplicate static assert macros Georgi Gerganov 2023-10-30 18:35:03 +02:00
  • 223696c9f9 ggml : add math.h to ggml-impl.h Georgi Gerganov 2023-10-30 17:12:27 +02:00
  • 334984e457 ggml : explicitly initialize deprecated type traits Georgi Gerganov 2023-10-30 17:09:37 +02:00
  • a1c3ff68cd tests : fix ARM build Georgi Gerganov 2023-10-30 16:53:34 +02:00
  • d3e2cedb79 ggml : move FP16 <-> FP32 stuff to ggml-impl.h Georgi Gerganov 2023-10-30 16:35:17 +02:00
  • bc28aaa8c2 make : use -lfto=auto to avoid warnings and maintain perf lto Georgi Gerganov 2023-10-30 16:00:53 +02:00
  • 57c4296cf0 ci : fix focal build Georgi Gerganov 2023-10-30 15:58:40 +02:00
  • a6aba2c85c ci : try to fix code coverage build Georgi Gerganov 2023-10-30 15:43:05 +02:00
  • 6f6b0db6d1 build : disable lto for C++ (make) and enable existing LTO flag (cmake) Georgi Gerganov 2023-10-30 15:40:01 +02:00
  • 1206b5f3be build : enable link-time optimizations Georgi Gerganov 2023-10-30 15:12:54 +02:00
  • a3f80013ad llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading Georgi Gerganov 2023-10-30 12:14:23 +02:00
  • 792d1a1b16 llama : minor Georgi Gerganov 2023-10-30 11:34:47 +02:00
  • f39e6075cf llama : add llm_build_kqv helper Georgi Gerganov 2023-10-29 22:26:36 +02:00
  • c9121fdd0f llama : remove obsolete comments in build graphs Georgi Gerganov 2023-10-29 21:44:19 +02:00
  • a104abea48 llama : simplify falcon Q, K, V computation Georgi Gerganov 2023-10-29 21:24:25 +02:00
  • 31a12f3d03 llama : fix llm_build_k_shift to use n_head_kv instead of n_head Georgi Gerganov 2023-10-29 21:17:46 +02:00
  • 5990861938 llama : remove obsolete offload names Georgi Gerganov 2023-10-29 21:11:20 +02:00
  • 3e0462594b llama : add llm_build_kv_store helper Georgi Gerganov 2023-10-29 20:35:20 +02:00
  • 909d64471b llama : fix offloading after recent changes Georgi Gerganov 2023-10-29 19:45:27 +02:00
  • 6e08281e58 Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843) b1445 Kerfuffle 2023-10-29 11:31:40 -06:00
  • 38728a0be0 llama : add llm_build_k_shift helper Georgi Gerganov 2023-10-29 19:22:54 +02:00
  • dbf836bb64 llama : add llm_build_ffn helper function (#3849) Georgi Gerganov 2023-10-29 18:47:46 +02:00
  • 2046eb4345 make : remove unnecessary dependency on build-info.h (#3842) b1444 cebtenzzre 2023-10-29 12:33:47 -04:00
  • 71a09da301 llama : fix kv shift bug (#3835) b1443 Georgi Gerganov 2023-10-29 18:32:51 +02:00
  • d69d777c02 ggml : quantization refactoring (#3833) b1442 Georgi Gerganov 2023-10-29 18:32:28 +02:00
  • 7db9c96d8a llama : add llm_build_norm helper function Georgi Gerganov 2023-10-29 15:39:58 +02:00
  • 210e6e5d02 llama : remove obsolete map for layer counting Georgi Gerganov 2023-10-29 13:39:04 +02:00
  • 79ad734417 llama : comment Georgi Gerganov 2023-10-29 13:27:53 +02:00
  • 761087932b llama : add functional header Georgi Gerganov 2023-10-29 13:26:23 +02:00
  • 8925cf9ef8 llama : add layer index to all tensor names Georgi Gerganov 2023-10-29 13:22:15 +02:00
  • 1e9c5443c2 llama : refactor tensor offloading as callback Georgi Gerganov 2023-10-29 12:35:07 +02:00
  • 15267192c0 llama : refactor tensor offloading as callback scratch Georgi Gerganov 2023-10-29 12:35:07 +02:00
  • da936188d8 llama : move refact in correct place + optimize graph input Georgi Gerganov 2023-10-29 11:48:24 +02:00
  • 739b85c985 llama : try to fix build Georgi Gerganov 2023-10-29 11:25:32 +02:00
  • 25cfbf6776 llama : fix non-CUDA build Georgi Gerganov 2023-10-29 11:12:03 +02:00
  • b4ad03b3a7 llama : try to optimize offloading code Georgi Gerganov 2023-10-29 10:33:11 +02:00
  • 79617902ea llama : fix res_norm offloading Georgi Gerganov 2023-10-29 09:20:35 +02:00
  • e14aa46151 llama : do tensor offload only with CUDA Georgi Gerganov 2023-10-29 08:03:46 +02:00
  • 0dc05b8433 llama : factor graph input into a function Georgi Gerganov 2023-10-29 07:52:43 +02:00
  • 4e98897ede llama : support offloading result_norm + comments Georgi Gerganov 2023-10-29 07:36:07 +02:00
  • 8a86b95e87 quantize : --pure option for disabling k-quant mixtures ggml-quants cebtenzzre 2023-10-28 16:32:49 -04:00
  • 51c4f9ee9f llama : comments Georgi Gerganov 2023-10-28 22:50:08 +03:00
  • 3af8771389 llama : update offload log messages to print node index Georgi Gerganov 2023-10-28 22:36:44 +03:00
  • 83d2c43791 llama : offload rest of the models Georgi Gerganov 2023-10-28 21:45:03 +03:00
  • 38aca9e1ab llama : factor out tensor offloading outside the build call (wip) Georgi Gerganov 2023-10-28 21:22:31 +03:00
  • 5946d98fc8 metal : disable kernel load log Georgi Gerganov 2023-10-28 21:22:01 +03:00
  • 8b2420d249 llama : factor out ggml-alloc from graph graph build functions Georgi Gerganov 2023-10-28 19:54:28 +03:00
  • ff3bad83e2 flake : update flake.lock for newer transformers version + provide extra dev shell (#3797) Erik Scholz 2023-10-28 16:41:07 +02:00
  • ee37e35dc5 ggml-quants : fix Zig and Swift builds + quantize tool Georgi Gerganov 2023-10-28 17:21:36 +03:00
  • 3412be728b ggml : factor all quantization code in ggml-quants Georgi Gerganov 2023-10-28 17:05:07 +03:00
  • 82a6646e02 metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793) b1440 Aarni Koskela 2023-10-28 15:43:01 +03:00
  • ba231e8a6d issues : change label from bug to bug-unconfirmed (#3748) Georgi Gerganov 2023-10-28 15:25:33 +03:00
  • 8a2f2fea29 convert : ignore tokens if their IDs are within [0, vocab_size) (#3831) Georgi Gerganov 2023-10-28 15:25:15 +03:00
  • de7e0912b6 convert : ignore tokens if their IDs are within [0, vocab_size) apply-3585 Georgi Gerganov 2023-10-28 15:01:36 +03:00
  • bd6d9e2059 llama : allow quantizing k-quants to fall back when tensor size incompatible (#3747) b1437 Kerfuffle 2023-10-28 05:54:24 -06:00
  • ee1a0ec9cb llama : add option for greedy sampling with probs (#3813) b1436 Georgi Gerganov 2023-10-28 14:23:11 +03:00
  • bbfc62ac2f sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs sampling-greedy-with-probs Georgi Gerganov 2023-10-28 14:04:57 +03:00
  • c86cca8061 llama : add comment about llama_sample_token_greedy() missing probs Georgi Gerganov 2023-10-28 13:21:29 +03:00
  • 177461104b common : print that one line of the syntax help *also* to standard output (#3823) b1435 Henk Poley 2023-10-28 12:16:33 +02:00
  • e374227221 Revert "cuda : use CUBLAS_COMPUTE_16F for non-attention ops" Georgi Gerganov 2023-10-28 12:20:08 +03:00
  • fdee152e4e starcoder : add GPU offloading (#3827) b1434 Georgi Gerganov 2023-10-28 12:06:08 +03:00
  • 41aee4df82 speculative : ensure draft and target model vocab matches (#3812) b1433 Kerfuffle 2023-10-27 15:40:07 -06:00
  • 6d459cbfbe llama : correctly report GGUFv3 format (#3818) b1432 cebtenzzre 2023-10-27 17:33:53 -04:00
  • cd3e20fb50 cuda : fix multi-gpu with tensor cores cuda-multi-gpu Georgi Gerganov 2023-10-27 23:11:50 +03:00
  • 706ff4c2e0 cuda : try to fix main device write Georgi Gerganov 2023-10-27 22:17:47 +03:00
  • 0f2498f25d cuda : use CUBLAS_COMPUTE_16F for non-attention ops Georgi Gerganov 2023-10-27 20:15:21 +03:00
  • 3b9ea655d4 cuda : use CUBLAS_COMPUTE_32F to speed-up and avoid dst cpy Georgi Gerganov 2023-10-27 18:13:54 +03:00
  • c8d6a1f34a simple : fix batch handling (#3803) b1431 Thibault Terrasson 2023-10-27 16:37:41 +02:00
  • 1a0843c493 cuda : utilize tensor cores with multiple GPU devices Georgi Gerganov 2023-10-27 13:05:33 +03:00
  • 2f9ec7e271 cuda : improve text-generation and batched decoding performance (#3776) b1430 Georgi Gerganov 2023-10-27 17:01:23 +03:00
  • 4aa1fb0d38 llama : add option for greedy sampling with probs Georgi Gerganov 2023-10-27 16:12:01 +03:00
  • 49af767fad build : add compile option to force use of MMQ kernels cuda-quantum-batch Georgi Gerganov 2023-10-27 13:21:04 +03:00
  • 34b2a5e1ee server : do not release slot on image input (#3798) b1429 Georgi Gerganov 2023-10-26 22:53:37 +03:00
  • a4e15a36e4 cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros Georgi Gerganov 2023-10-25 18:48:36 +03:00
  • 4c6744b526 cuda : remove duplicated cuBLAS GEMM code Georgi Gerganov 2023-10-25 18:25:13 +03:00
  • a3c28439d3 cuda : fine-tune >= VOLTA params + use MMQ only for small batches Georgi Gerganov 2023-10-25 15:07:34 +03:00
  • 16b60dd75c cuda : add F32 sgemm branch Georgi Gerganov 2023-10-25 14:00:21 +03:00
  • 52af782608 cuda : new cublas gemm branch for multi-batch quantized src0 Georgi Gerganov 2023-10-25 13:14:24 +03:00
  • 59d1232ea7 cuda : prints wip Georgi Gerganov 2023-10-25 10:26:58 +03:00
  • 6961c4bd0b batched-bench : print params at start b1428 Georgi Gerganov 2023-10-25 10:26:27 +03:00
  • cc44877486 log : disable pid in log filenames b1427 Georgi Gerganov 2023-10-25 10:09:16 +03:00
  • ad93962657 server : add parameter -tb N, --threads-batch N (#3584) (#3768) b1426 cebtenzzre 2023-10-24 16:10:43 -04:00
  • 1717521cdb server : do not block system prompt update (#3767) b1425 Georgi Gerganov 2023-10-24 23:08:20 +03:00
  • b2f7e04bd3 sync : ggml (conv ops + cuda MSVC fixes) (#3765) b1424 Georgi Gerganov 2023-10-24 21:51:20 +03:00
  • abd21fc99f cmake : add missed dependencies (#3763) b1423 John Smith 2023-10-25 01:48:45 +08:00
  • 2b4ea35e56 cuda : add batched cuBLAS GEMM for faster attention (#3749) b1422 Georgi Gerganov 2023-10-24 16:48:37 +03:00
  • d798a17c34 cuda : add TODO for calling cublas from kernel + using mem pool cuda-batched-gemm Georgi Gerganov 2023-10-24 16:33:24 +03:00
  • 27c34c0112 cuda : reduce mallocs in cublasGemmBatchedEx branch Georgi Gerganov 2023-10-24 15:06:02 +03:00
  • 3d297c1a30 cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases Georgi Gerganov 2023-10-24 13:34:54 +03:00
  • 6966474928 cuda : play with faster Q4_0 dequantization cuda-batched-gemm-deq Georgi Gerganov 2023-10-24 10:29:40 +03:00
  • daab3d7f45 Add more tokenizer tests (#3742) b1421 Galunid 2023-10-24 09:17:17 +02:00
  • 469c9addef metal : handle ggml_scale for n%4 != 0 (close #3754) b1420 Georgi Gerganov 2023-10-24 09:46:50 +03:00
  • d415669087 cuda : add ROCm / hipBLAS cublasGemmBatchedEx define Georgi Gerganov 2023-10-24 00:18:49 +03:00
  • 878aa4f209 Apply suggestions from code review Kerfuffle 2023-10-23 15:09:50 -06:00
  • e3932593d4 Revert "make : add optional CUDA_NATIVE_ARCH (#2482)" b1419 Georgi Gerganov 2023-10-23 23:46:05 +03:00
  • 9d02956443 issues : separate bug and enhancement template + no default title (#3748) M. Yusuf Sarıgöz 2023-10-23 22:57:16 +03:00
  • 69a6735087 Update special token handling in conversion scripts for gpt2 derived tokenizers (#3746) Galunid 2023-10-23 21:46:00 +02:00
  • 5be6c803fa llama : remove token functions with context args in favor of model (#3720) b1416 Marcus Dunn 2023-10-23 12:40:03 -07:00
  • c13fcfbfc0 cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) Georgi Gerganov 2023-10-23 20:37:04 +03:00