Commit Graph

84 Commits

Author SHA1 Message Date
Francis Couture-Harpin 1be5ea7d97 llama : add llama_model_is_recurrent to simplify figuring that out
This will make it easier to more cleanly support RWKV-v6 and Mamba-2.
2024-08-20 23:55:14 -04:00
Francis Couture-Harpin b264eddbb2 llama : fix Mamba pooled embeddings with multiple sequences
Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.
2024-08-20 23:29:48 -04:00
Francis Couture-Harpin 652e9b0d61 llama : fix T5 segfault again 2024-08-20 21:37:43 -04:00
Francis Couture-Harpin 702e1995a1 Merge branch 'master' into compilade/batch-splits 2024-08-14 20:46:28 -04:00
Nico Bosshard 0fd93cdef5 llama : model-based max number of graph nodes calculation (#8970)
* llama : model-based max number of graph nodes calculation

* Update src/llama.cpp

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 17:13:59 +02:00
Liu Jia 2589292cde Fix a spelling mistake (#9001) 2024-08-12 11:46:03 +02:00
fairydreaming 33309f661a llama : check all graph nodes when searching for result_embd_pooled (#8956)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-11 10:35:26 +02:00
Xuan Son Nguyen 7eb23840ed llama : default n_swa for phi-3 (#8931)
* default n_swa for phi-3

* fix

* double check swa
2024-08-10 13:04:40 +02:00
fairydreaming 7c3f55c100 Add support for encoder-only T5 models (#8900)
* gguf-py : add T5ENCODER model architecture

* common : call llama_decode() during warmup only if the model has decoder

* convert-hf : add T5EncoderModel

* llama : add llama_model_has_decoder() API function

* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()

* llama : add support for LLM_ARCH_T5ENCODER

* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE

* llama-embedding : add support for encoder-only models

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-10 11:43:26 +02:00
fairydreaming 6afd1a99dc llama : add support for lora adapters in T5 model (#8938)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-09 18:53:09 +02:00
Georgi Gerganov 45a55b91aa llama : better replace_all (cont) (#8926)
* llama : better replace_all (cont)

ggml-ci

* code : deduplicate replace_all

ggml-ci
2024-08-09 18:23:52 +03:00
Georgi Gerganov 0596a99f09 minor : add struct members for clarity
ggml-ci
2024-08-09 14:38:29 +03:00
Daniel Bevenius 6f6496bb09 llama : fix typo in llama_tensor_get_type comment [no ci] (#8937) 2024-08-09 09:32:23 +03:00
compilade 345a686d82 llama : reduce useless copies when saving session (#8916)
* llama : avoid useless copies in dummy session writer

* llama : avoid double tensor copy when saving session to buffer
2024-08-08 23:54:00 -04:00
Francis Couture-Harpin cfd5a113e1 llama : rename llama_reorder_outputs to llama_output_reorder
Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs
2024-08-06 11:50:35 -04:00
Douglas Hanley cdd1889de6 convert : add support for XLMRoberta embedding models (#8658)
* add conversion for bge-m3; small fix in unigram tokenizer

* clean up and simplify XLMRoberta conversion
2024-08-06 10:20:54 +03:00
fairydreaming d3f0c7166a Stop the generation when <|eom_id|> token is encountered - needed for Llama 3.1 tool call support (#8858)
* gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token

* llama : find Llama-3.1 <|eom_id|> token id during vocab loading

* llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-05 09:38:01 +02:00
Georgi Gerganov f1ea5146d7 llama : better replace_all (#8852) 2024-08-05 08:53:39 +03:00
Francis Couture-Harpin 5679a3bdbb Merge branch 'master' into compilade/batch-splits 2024-08-04 17:24:14 -04:00
Francis Couture-Harpin 952ed35ba8 llama : minor cosmetic changes 2024-08-04 17:23:44 -04:00
pculliton 398ede5efe Adding Gemma 2 2B configs (#8784)
* Adding Gemma 2 2B configs

Updates to Q scaling and Gemma 2 model sizes to match v2 2B model.

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-31 17:12:10 +02:00
Francis Couture-Harpin 704a303323 llama : fix Mamba session save and restore 2024-07-28 01:59:10 -04:00
Francis Couture-Harpin 0dea4263aa Merge branch 'master' into compilade/batch-splits 2024-07-28 01:20:13 -04:00
compilade 4c676c85e5 llama : refactor session file management (#8699)
* llama : refactor session file management

* llama : saving and restoring state checks for overflow

The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.

* llama : stream from session file instead of copying into a big buffer

Loading session files should no longer cause a memory usage spike.

* llama : llama_state_get_size returns the actual size instead of max

This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.

* llama : share code between whole and seq_id-specific state saving

Both session file types now use a more similar format.

* llama : no longer store all hparams in session files

Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.

* llama : fix uint64_t format type

* llama : various integer type cast and format string fixes

Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.

* llama : remove _context suffix for llama_data_context

* llama : fix session file loading

llama_state_get_size cannot be used to get the max size anymore.

* llama : more graceful error handling of invalid session files

* llama : remove LLAMA_MAX_RNG_STATE

It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.

* llama : cast seq_id in comparison with unsigned n_seq_max
2024-07-28 00:42:05 -04:00
Jeffrey Morgan b5e95468b1 llama : add support for llama 3.1 rope scaling factors (#8676)
* Add llama 3.1 rope scaling factors to llama conversion and inference

This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* address comments

* address comments

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-27 15:03:45 +03:00
Georgi Gerganov 92090eca21 llama : add function for model-based max number of graph nodes (#8622)
* llama : model-based max number of graph nodes

ggml-ci

* llama : disable 405B max_nodes path due to lack of complaints

ggml-ci
2024-07-27 14:59:29 +03:00
slaren 2b1f616b20 ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Judd 01245f5b16 llama : fix order of parameters (#8706)
usage of `aclrtGetMemInfo` is correct:

https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html

Co-authored-by: Judd <foldl@boxvest.com>
2024-07-26 11:38:12 +03:00
Georgi Gerganov 4226a8d10e llama : fix build + fix fabs compile warnings (#8683)
ggml-ci
2024-07-25 19:57:31 +03:00
Chen Xi ed67bcb24f [SYCL] fix multi-gpu issue on sycl (#8554)
---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-07-25 19:45:18 +08:00
Georgi Gerganov eddcb5238b ggml : add and use ggml_cpu_has_llamafile() (#8664) 2024-07-25 12:37:42 +03:00
Fan Shupei 8a4bad50a8 llama: use sliding window for phi3 (#8627)
* use sliding window for phi3

* fix typo, "data_swa" -> "data"

* [conver_hf_to_gguf.py] add phi3 sliding window
2024-07-25 10:21:09 +03:00
Xuan Son Nguyen b115105f05 add llama_lora_adapter_clear (#8653) 2024-07-24 11:25:19 +02:00
Francis Couture-Harpin 9c0a61f8c3 Merge branch 'master' into compilade/batch-splits 2024-07-23 13:37:09 -04:00
Georgi Gerganov 938943cdbf llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling

ggml-ci

* llama : move grammar code into llama-grammar

ggml-ci

* cont

ggml-ci

* cont : pre-fetch rules

* cont

ggml-ci

* llama : deprecate llama_sample_grammar

* llama : move tokenizers into llama-vocab

ggml-ci

* make : update llama.cpp deps [no ci]

* llama : redirect external API to internal APIs

ggml-ci

* llama : suffix the internal APIs with "_impl"

ggml-ci

* llama : clean-up
2024-07-23 13:10:17 +03:00
Keke Han 081fe431aa llama : fix codeshell support (#8599)
* llama : fix codeshell support

* llama : move codeshell after smollm below to respect the enum order
2024-07-22 19:43:43 +03:00
Jason Stillerman d94c6e0ccb llama : add support for SmolLm pre-tokenizer (#8609)
* Adding SmolLM Pre Tokenizer

* Update convert_hf_to_gguf_update.py

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* handle regex

* removed .inp and out .out ggufs

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-22 17:43:01 +03:00
Georgi Gerganov 6f11a83e4e llama : allow overrides for tokenizer flags (#8614)
ggml-ci
2024-07-22 13:33:22 +03:00
Douglas Hanley 50e05353e8 llama : add Mistral Nemo inference support (#8604) 2024-07-22 11:06:17 +03:00
Michael Coppola 940362224d llama : add support for Tekken pre-tokenizer (#8579)
* llama : Added support for Tekken pre-tokenizer (#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-20 16:43:51 +03:00
Georgi Gerganov d197545530 llama : bump max layers from 256 to 512 (#8530)
* llama : bump max layers from 256 to 512

* llama : replace asserts with exceptions
2024-07-19 16:50:47 +03:00
Frank Mai f299aa98ec fix: typo of chatglm4 chat tmpl (#8586)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-07-19 11:44:41 +02:00
Francis Couture-Harpin 1725de768e llama : fix t5 segfault 2024-07-17 15:36:56 -04:00
Francis Couture-Harpin 1fb5d4fdee llama : apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-17 14:48:09 -04:00
hipudding 1bdd8ae19f [CANN] Add Ascend NPU backend (#6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-07-17 14:23:50 +03:00
Georgi Gerganov d65a8361fe llama : disable context-shift for DeepSeek v2 (#8501) 2024-07-17 10:32:59 +03:00
Francis Couture-Harpin 7b7db0bbee llama : logits_all has priority over batch->logits
Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.
2024-07-17 01:14:26 -04:00
Francis Couture-Harpin 2e4adb47ec llama : fix integer signedness mixing 2024-07-16 22:12:47 -04:00
Francis Couture-Harpin 22504ec67e Merge branch 'master' into compilade/batch-splits 2024-07-16 20:54:39 -04:00
Francis Couture-Harpin c51daefc32 llama : advanced batch splits
This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators
2024-07-16 20:38:48 -04:00