mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-28 15:20:20 +00:00
64086f2b2f
* feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (https://github.com/ggml-org/llama.cpp/pull/16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
1704 lines
70 KiB
C++
1704 lines
70 KiB
C++
#include "llama-model-loader.h"
|
|
|
|
#include "ggml-alloc.h"
|
|
#include "ggml.h"
|
|
#include "gguf.h"
|
|
#include "llama-hparams.h"
|
|
|
|
#include <algorithm>
|
|
#include <array>
|
|
#include <cinttypes>
|
|
#include <cstdint>
|
|
#include <cstring>
|
|
#include <future>
|
|
#include <regex>
|
|
|
|
static const size_t kiB = 1024;
|
|
static const size_t MiB = 1024*kiB;
|
|
static const size_t GiB = 1024*MiB;
|
|
|
|
const char * llama_file_version_name(llama_fver version) {
|
|
switch (version) {
|
|
case GGUF_FILE_VERSION_V1: return "GGUF V1 (support until nov 2023)";
|
|
case GGUF_FILE_VERSION_V2: return "GGUF V2";
|
|
case GGUF_FILE_VERSION_V3: return "GGUF V3 (latest)";
|
|
}
|
|
|
|
return "unknown";
|
|
}
|
|
|
|
static std::string llama_model_ftype_name(llama_ftype ftype) {
|
|
if (ftype & LLAMA_FTYPE_GUESSED) {
|
|
return llama_model_ftype_name((enum llama_ftype) (ftype & ~LLAMA_FTYPE_GUESSED)) + " (guessed)";
|
|
}
|
|
|
|
switch (ftype) {
|
|
case LLAMA_FTYPE_ALL_F32: return "all F32";
|
|
case LLAMA_FTYPE_MOSTLY_F16: return "F16";
|
|
case LLAMA_FTYPE_MOSTLY_BF16: return "BF16";
|
|
case LLAMA_FTYPE_MOSTLY_Q1_0: return "Q1_0";
|
|
case LLAMA_FTYPE_MOSTLY_Q4_0: return "Q4_0";
|
|
case LLAMA_FTYPE_MOSTLY_Q4_1: return "Q4_1";
|
|
case LLAMA_FTYPE_MOSTLY_Q5_0: return "Q5_0";
|
|
case LLAMA_FTYPE_MOSTLY_Q5_1: return "Q5_1";
|
|
case LLAMA_FTYPE_MOSTLY_Q8_0: return "Q8_0";
|
|
case LLAMA_FTYPE_MOSTLY_MXFP4_MOE: return "MXFP4 MoE";
|
|
case LLAMA_FTYPE_MOSTLY_NVFP4: return "NVFP4";
|
|
case LLAMA_FTYPE_MOSTLY_Q2_K: return "Q2_K - Medium";
|
|
case LLAMA_FTYPE_MOSTLY_Q2_K_S: return "Q2_K - Small";
|
|
case LLAMA_FTYPE_MOSTLY_Q3_K_S: return "Q3_K - Small";
|
|
case LLAMA_FTYPE_MOSTLY_Q3_K_M: return "Q3_K - Medium";
|
|
case LLAMA_FTYPE_MOSTLY_Q3_K_L: return "Q3_K - Large";
|
|
case LLAMA_FTYPE_MOSTLY_Q4_K_S: return "Q4_K - Small";
|
|
case LLAMA_FTYPE_MOSTLY_Q4_K_M: return "Q4_K - Medium";
|
|
case LLAMA_FTYPE_MOSTLY_Q5_K_S: return "Q5_K - Small";
|
|
case LLAMA_FTYPE_MOSTLY_Q5_K_M: return "Q5_K - Medium";
|
|
case LLAMA_FTYPE_MOSTLY_Q6_K: return "Q6_K";
|
|
case LLAMA_FTYPE_MOSTLY_TQ1_0: return "TQ1_0 - 1.69 bpw ternary";
|
|
case LLAMA_FTYPE_MOSTLY_TQ2_0: return "TQ2_0 - 2.06 bpw ternary";
|
|
case LLAMA_FTYPE_MOSTLY_IQ2_XXS: return "IQ2_XXS - 2.0625 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ2_XS: return "IQ2_XS - 2.3125 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ2_S: return "IQ2_S - 2.5 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ2_M: return "IQ2_M - 2.7 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ3_XS: return "IQ3_XS - 3.3 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ3_XXS: return "IQ3_XXS - 3.0625 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ1_S: return "IQ1_S - 1.5625 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ1_M: return "IQ1_M - 1.75 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ4_NL: return "IQ4_NL - 4.5 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ4_XS: return "IQ4_XS - 4.25 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ3_S: return "IQ3_S - 3.4375 bpw";
|
|
case LLAMA_FTYPE_MOSTLY_IQ3_M: return "IQ3_S mix - 3.66 bpw";
|
|
|
|
default: return "unknown, may not work";
|
|
}
|
|
}
|
|
|
|
// return a list of splits for a given path
|
|
// for example, given "<name>-00002-of-00004.gguf", returns list of all 4 splits
|
|
static std::vector<std::string> llama_get_list_splits(const std::string & path, const int idx, const int n_split) {
|
|
std::vector<std::string> paths;
|
|
std::string split_prefix;
|
|
std::vector<char> buf(llama_path_max(), 0);
|
|
|
|
{
|
|
int ret = llama_split_prefix(buf.data(), buf.size(), path.c_str(), idx, n_split);
|
|
if (!ret) {
|
|
throw std::runtime_error(format("invalid split file name: %s", path.c_str()));
|
|
}
|
|
split_prefix = std::string(buf.data(), ret);
|
|
}
|
|
|
|
if (split_prefix.empty()) {
|
|
throw std::runtime_error(format("invalid split file: %s", path.c_str()));
|
|
}
|
|
|
|
for (int idx = 0; idx < n_split; ++idx) {
|
|
int ret = llama_split_path(buf.data(), buf.size(), split_prefix.c_str(), idx, n_split);
|
|
paths.push_back(std::string(buf.data(), ret));
|
|
}
|
|
|
|
return paths;
|
|
}
|
|
|
|
namespace GGUFMeta {
|
|
template <typename T, gguf_type gt_, T (*gfun)(const gguf_context *, const int64_t)>
|
|
struct GKV_Base_Type {
|
|
static constexpr gguf_type gt = gt_;
|
|
|
|
static T getter(const gguf_context * ctx, const int kid) {
|
|
return gfun(ctx, kid);
|
|
}
|
|
};
|
|
|
|
template<typename T> struct GKV_Base;
|
|
|
|
template<> struct GKV_Base<bool >: GKV_Base_Type<bool, GGUF_TYPE_BOOL, gguf_get_val_bool> {};
|
|
template<> struct GKV_Base<uint8_t >: GKV_Base_Type<uint8_t, GGUF_TYPE_UINT8, gguf_get_val_u8 > {};
|
|
template<> struct GKV_Base<uint16_t >: GKV_Base_Type<uint16_t, GGUF_TYPE_UINT16, gguf_get_val_u16 > {};
|
|
template<> struct GKV_Base<uint32_t >: GKV_Base_Type<uint32_t, GGUF_TYPE_UINT32, gguf_get_val_u32 > {};
|
|
template<> struct GKV_Base<uint64_t >: GKV_Base_Type<uint64_t, GGUF_TYPE_UINT64, gguf_get_val_u64 > {};
|
|
template<> struct GKV_Base<int8_t >: GKV_Base_Type<int8_t, GGUF_TYPE_INT8, gguf_get_val_i8 > {};
|
|
template<> struct GKV_Base<int16_t >: GKV_Base_Type<int16_t, GGUF_TYPE_INT16, gguf_get_val_i16 > {};
|
|
template<> struct GKV_Base<int32_t >: GKV_Base_Type<int32_t, GGUF_TYPE_INT32, gguf_get_val_i32 > {};
|
|
template<> struct GKV_Base<int64_t >: GKV_Base_Type<int64_t, GGUF_TYPE_INT64, gguf_get_val_i64 > {};
|
|
template<> struct GKV_Base<float >: GKV_Base_Type<float, GGUF_TYPE_FLOAT32, gguf_get_val_f32 > {};
|
|
template<> struct GKV_Base<double >: GKV_Base_Type<double, GGUF_TYPE_FLOAT64, gguf_get_val_f64 > {};
|
|
template<> struct GKV_Base<const char *>: GKV_Base_Type<const char *, GGUF_TYPE_STRING, gguf_get_val_str > {};
|
|
|
|
template<> struct GKV_Base<std::string> {
|
|
static constexpr gguf_type gt = GGUF_TYPE_STRING;
|
|
|
|
static std::string getter(const gguf_context * ctx, const int kid) {
|
|
return gguf_get_val_str(ctx, kid);
|
|
}
|
|
};
|
|
|
|
struct ArrayInfo {
|
|
const gguf_type gt;
|
|
const size_t length;
|
|
const void * data;
|
|
};
|
|
|
|
template<> struct GKV_Base<ArrayInfo> {
|
|
public:
|
|
static constexpr gguf_type gt = GGUF_TYPE_ARRAY;
|
|
static ArrayInfo getter(const gguf_context *ctx, const int k) {
|
|
const enum gguf_type arr_type = gguf_get_arr_type(ctx, k);
|
|
return ArrayInfo {
|
|
arr_type,
|
|
gguf_get_arr_n(ctx, k),
|
|
arr_type == GGUF_TYPE_STRING ? nullptr : gguf_get_arr_data(ctx, k),
|
|
};
|
|
}
|
|
};
|
|
|
|
template<typename T>
|
|
class GKV : public GKV_Base<T> {
|
|
GKV() = delete;
|
|
|
|
public:
|
|
static T get_kv(const gguf_context * ctx, const int k) {
|
|
const enum gguf_type kt = gguf_get_kv_type(ctx, k);
|
|
|
|
if (kt != GKV::gt) {
|
|
throw std::runtime_error(format("key %s has wrong type %s but expected type %s",
|
|
gguf_get_key(ctx, k), gguf_type_name(kt), gguf_type_name(GKV::gt)));
|
|
}
|
|
return GKV::getter(ctx, k);
|
|
}
|
|
|
|
static const char * override_type_to_str(const llama_model_kv_override_type ty) {
|
|
switch (ty) {
|
|
case LLAMA_KV_OVERRIDE_TYPE_BOOL: return "bool";
|
|
case LLAMA_KV_OVERRIDE_TYPE_INT: return "int";
|
|
case LLAMA_KV_OVERRIDE_TYPE_FLOAT: return "float";
|
|
case LLAMA_KV_OVERRIDE_TYPE_STR: return "str";
|
|
}
|
|
return "unknown";
|
|
}
|
|
|
|
static bool validate_override(const llama_model_kv_override_type expected_type, const struct llama_model_kv_override * ovrd) {
|
|
if (!ovrd) { return false; }
|
|
if (ovrd->tag == expected_type) {
|
|
LLAMA_LOG_INFO("%s: Using metadata override (%5s) '%s' = ",
|
|
__func__, override_type_to_str(ovrd->tag), ovrd->key);
|
|
switch (ovrd->tag) {
|
|
case LLAMA_KV_OVERRIDE_TYPE_BOOL: {
|
|
LLAMA_LOG_INFO("%s\n", ovrd->val_bool ? "true" : "false");
|
|
} break;
|
|
case LLAMA_KV_OVERRIDE_TYPE_INT: {
|
|
LLAMA_LOG_INFO("%" PRId64 "\n", ovrd->val_i64);
|
|
} break;
|
|
case LLAMA_KV_OVERRIDE_TYPE_FLOAT: {
|
|
LLAMA_LOG_INFO("%.6f\n", ovrd->val_f64);
|
|
} break;
|
|
case LLAMA_KV_OVERRIDE_TYPE_STR: {
|
|
LLAMA_LOG_INFO("%s\n", ovrd->val_str);
|
|
} break;
|
|
default:
|
|
// Shouldn't be possible to end up here, but just in case...
|
|
throw std::runtime_error(
|
|
format("Unsupported attempt to override %s type for metadata key %s\n",
|
|
override_type_to_str(ovrd->tag), ovrd->key));
|
|
}
|
|
return true;
|
|
}
|
|
LLAMA_LOG_WARN("%s: Warning: Bad metadata override type for key '%s', expected %s but got %s\n",
|
|
__func__, ovrd->key, override_type_to_str(expected_type), override_type_to_str(ovrd->tag));
|
|
return false;
|
|
}
|
|
|
|
template<typename OT>
|
|
static typename std::enable_if<std::is_same<OT, bool>::value, bool>::type
|
|
try_override(OT & target, const struct llama_model_kv_override * ovrd) {
|
|
if (validate_override(LLAMA_KV_OVERRIDE_TYPE_BOOL, ovrd)) {
|
|
target = ovrd->val_bool;
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
template<typename OT>
|
|
static typename std::enable_if<!std::is_same<OT, bool>::value && std::is_integral<OT>::value, bool>::type
|
|
try_override(OT & target, const struct llama_model_kv_override * ovrd) {
|
|
if (validate_override(LLAMA_KV_OVERRIDE_TYPE_INT, ovrd)) {
|
|
target = ovrd->val_i64;
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
template<typename OT>
|
|
static typename std::enable_if<std::is_floating_point<OT>::value, bool>::type
|
|
try_override(T & target, const struct llama_model_kv_override * ovrd) {
|
|
if (validate_override(LLAMA_KV_OVERRIDE_TYPE_FLOAT, ovrd)) {
|
|
target = ovrd->val_f64;
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
template<typename OT>
|
|
static typename std::enable_if<std::is_same<OT, std::string>::value, bool>::type
|
|
try_override(T & target, const struct llama_model_kv_override * ovrd) {
|
|
if (validate_override(LLAMA_KV_OVERRIDE_TYPE_STR, ovrd)) {
|
|
target = ovrd->val_str;
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
static bool set(const gguf_context * ctx, const int k, T & target, const struct llama_model_kv_override * ovrd = nullptr) {
|
|
if (try_override<T>(target, ovrd)) {
|
|
return true;
|
|
}
|
|
if (k < 0) { return false; }
|
|
target = get_kv(ctx, k);
|
|
return true;
|
|
}
|
|
|
|
static bool set(const gguf_context * ctx, const char * key, T & target, const struct llama_model_kv_override * ovrd = nullptr) {
|
|
return set(ctx, gguf_find_key(ctx, key), target, ovrd);
|
|
}
|
|
|
|
static bool set(const gguf_context * ctx, const std::string & key, T & target, const struct llama_model_kv_override * ovrd = nullptr) {
|
|
return set(ctx, key.c_str(), target, ovrd);
|
|
}
|
|
};
|
|
}
|
|
|
|
template<typename T>
|
|
typename std::enable_if<std::is_integral<T>::value, bool>::type
|
|
llama_model_loader::get_arr_n(const std::string & key, T & result, bool required) {
|
|
const int kid = gguf_find_key(metadata, key.c_str());
|
|
|
|
if (kid < 0) {
|
|
if (required) {
|
|
throw std::runtime_error(format("key not found in model: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
struct GGUFMeta::ArrayInfo arr_info =
|
|
GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(metadata, kid);
|
|
|
|
|
|
result = arr_info.length;
|
|
return true;
|
|
}
|
|
|
|
template<typename T>
|
|
typename std::enable_if<std::is_integral<T>::value, bool>::type
|
|
llama_model_loader::get_arr_n(enum llm_kv kid, T & result, bool required) {
|
|
return get_arr_n(llm_kv(kid), result, required);
|
|
}
|
|
|
|
template bool llama_model_loader::get_arr_n(enum llm_kv kid, uint32_t & result, bool required);
|
|
|
|
template<typename T>
|
|
bool llama_model_loader::get_arr(const std::string & key, std::vector<T> & result, bool required) {
|
|
const gguf_context * ctx = metadata;
|
|
const int kid = gguf_find_key(ctx, key.c_str());
|
|
|
|
if (kid < 0 || gguf_get_kv_type(ctx, kid) != GGUF_TYPE_ARRAY) {
|
|
if (required) {
|
|
throw std::runtime_error(format("array key not found in model: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
struct GGUFMeta::ArrayInfo arr_info =
|
|
GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(ctx, kid);
|
|
|
|
switch (arr_info.gt) {
|
|
case GGUF_TYPE_UINT32:
|
|
case GGUF_TYPE_INT32: GGML_ASSERT((std::is_same<T, int32_t>::value) ||
|
|
(std::is_same<T, uint32_t>::value)); break;
|
|
case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
|
|
case GGUF_TYPE_STRING: GGML_ASSERT((std::is_same<T, std::string>::value)); break;
|
|
default:
|
|
throw std::runtime_error(format("%s is not a string/float32/uint32/int32 array", key.c_str()));
|
|
}
|
|
|
|
if constexpr (std::is_same<T, std::string>::value) {
|
|
const size_t n_items = gguf_get_arr_n(ctx, kid);
|
|
result.clear();
|
|
|
|
for (size_t i = 0; i < n_items; i++) {
|
|
const T value = gguf_get_arr_str(ctx, kid, i);
|
|
result.emplace_back(value);
|
|
}
|
|
} else {
|
|
result.resize(arr_info.length);
|
|
result.assign((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length);
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
template<typename T, size_t N_MAX>
|
|
bool llama_model_loader::get_arr(const std::string & key, std::array<T, N_MAX> & result, bool required) {
|
|
const gguf_context * ctx = metadata;
|
|
const int kid = gguf_find_key(ctx, key.c_str());
|
|
|
|
if (kid < 0 || gguf_get_kv_type(ctx, kid) != GGUF_TYPE_ARRAY) {
|
|
if (required) {
|
|
throw std::runtime_error(format("array key not found in model: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
struct GGUFMeta::ArrayInfo arr_info =
|
|
GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(ctx, kid);
|
|
|
|
switch (arr_info.gt) {
|
|
case GGUF_TYPE_BOOL:
|
|
case GGUF_TYPE_UINT32:
|
|
case GGUF_TYPE_INT32: GGML_ASSERT((std::is_same<T, int32_t>::value) ||
|
|
(std::is_same<T, uint32_t>::value)); break;
|
|
case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
|
|
case GGUF_TYPE_STRING: GGML_ASSERT((std::is_same<T, std::string>::value)); break;
|
|
default:
|
|
throw std::runtime_error(format("%s is not a string/float32/uint32/int32 array", key.c_str()));
|
|
}
|
|
|
|
if (arr_info.length > N_MAX) {
|
|
throw std::runtime_error(format("array length %u for key %s exceeds max %u", (uint32_t) arr_info.length, key.c_str(), (uint32_t) N_MAX));
|
|
}
|
|
|
|
if constexpr (std::is_same<T, std::string>::value) {
|
|
const size_t n_items = gguf_get_arr_n(ctx, kid);
|
|
|
|
for (size_t i = 0; i < n_items; i++) {
|
|
const T value = gguf_get_arr_str(ctx, kid, i);
|
|
result[i] = value;
|
|
}
|
|
} else {
|
|
if (arr_info.gt == GGUF_TYPE_BOOL) {
|
|
const int8_t * values = (const int8_t *) arr_info.data;
|
|
std::transform(values, values + arr_info.length, result.begin(), [](int8_t x) {
|
|
return static_cast<T>(x != 0);
|
|
});
|
|
} else {
|
|
std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
|
|
}
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
template<typename T>
|
|
bool llama_model_loader::get_arr(enum llm_kv kid, T & result, bool required) {
|
|
return get_arr(llm_kv(kid), result, required);
|
|
}
|
|
|
|
template bool llama_model_loader::get_arr<std::vector<std::string>>(enum llm_kv kid, std::vector<std::string> & result, bool required);
|
|
template bool llama_model_loader::get_arr<std::array<int32_t, 512>>(enum llm_kv kid, std::array<int32_t, 512> & result, bool required);
|
|
|
|
template<typename T>
|
|
bool llama_model_loader::get_key(const std::string & key, T & result, bool required) {
|
|
auto it = kv_overrides.find(key);
|
|
|
|
const struct llama_model_kv_override * override =
|
|
it != kv_overrides.end() ? &it->second : nullptr;
|
|
|
|
const bool found = GGUFMeta::GKV<T>::set(metadata, key, result, override);
|
|
|
|
if (required && !found) {
|
|
throw std::runtime_error(format("key not found in model: %s", key.c_str()));
|
|
}
|
|
|
|
return found;
|
|
}
|
|
|
|
template<typename T>
|
|
bool llama_model_loader::get_key(enum llm_kv kid, T & result, bool required) {
|
|
return get_key(llm_kv(kid), result, required);
|
|
}
|
|
|
|
template bool llama_model_loader::get_key<bool> (enum llm_kv kid, bool & result, bool required);
|
|
template bool llama_model_loader::get_key<float> (enum llm_kv kid, float & result, bool required);
|
|
template bool llama_model_loader::get_key<uint32_t> (enum llm_kv kid, uint32_t & result, bool required);
|
|
template bool llama_model_loader::get_key<std::string>(enum llm_kv kid, std::string & result, bool required);
|
|
|
|
template<>
|
|
bool llama_model_loader::get_key(enum llm_kv kid, enum llama_pooling_type & result, bool required) {
|
|
uint32_t tmp;
|
|
const bool found = get_key(kid, tmp, required);
|
|
if (found) {
|
|
result = (enum llama_pooling_type) tmp;
|
|
} else {
|
|
result = LLAMA_POOLING_TYPE_UNSPECIFIED;
|
|
}
|
|
return found;
|
|
}
|
|
|
|
// get array of n <= N_MAX elements, or a single element repeated n times
|
|
template<typename T, size_t N_MAX>
|
|
bool llama_model_loader::get_key_or_arr(const std::string & key, std::array<T, N_MAX> & result, uint32_t n, bool required) {
|
|
const int kid = gguf_find_key(metadata, key.c_str());
|
|
|
|
if (kid < 0) {
|
|
if (required) {
|
|
throw std::runtime_error(format("key not found in model: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
if (n > N_MAX) {
|
|
throw std::runtime_error(format("n > N_MAX: %u > %u for key %s", n, (uint32_t) N_MAX, key.c_str()));
|
|
}
|
|
|
|
if (gguf_get_kv_type(metadata, kid) == GGUF_TYPE_ARRAY) {
|
|
struct GGUFMeta::ArrayInfo arr_info =
|
|
GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(metadata, kid);
|
|
|
|
if (n != arr_info.length) {
|
|
throw std::runtime_error(format("key %s has wrong array length; expected %u, got %u", key.c_str(), n, (uint32_t) arr_info.length));
|
|
}
|
|
|
|
return get_arr(key, result, required);
|
|
}
|
|
|
|
T value;
|
|
|
|
bool ok = get_key(key, value, required);
|
|
if (!ok) {
|
|
return false;
|
|
}
|
|
|
|
for (uint32_t i = 0; i < n; i++) {
|
|
result[i] = value;
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
template<typename T>
|
|
bool llama_model_loader::get_key_or_arr(enum llm_kv kid, T & result, uint32_t n, bool required) {
|
|
return get_key_or_arr(llm_kv(kid), result, n, required);
|
|
}
|
|
|
|
bool llama_model_loader::get_key_or_arr(enum llm_kv kid, uint32_t & result, bool required) {
|
|
const std::string key = llm_kv(kid);
|
|
|
|
const int id = gguf_find_key(metadata, key.c_str());
|
|
|
|
if (id < 0) {
|
|
if (required) {
|
|
throw std::runtime_error(format("key not found in model: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
// throw and error if type is an array
|
|
if (gguf_get_kv_type(metadata, id) == GGUF_TYPE_ARRAY) {
|
|
if (required) {
|
|
throw std::runtime_error(format("expected scalar, found array for key: %s", key.c_str()));
|
|
}
|
|
return false;
|
|
}
|
|
|
|
return get_key(key, result, required);
|
|
}
|
|
|
|
// TODO: this is not very clever - figure out something better
|
|
template bool llama_model_loader::get_key_or_arr<std::array<int, 4>> (enum llm_kv kid, std::array<int, 4> & result, uint32_t n, bool required);
|
|
template bool llama_model_loader::get_key_or_arr<std::array<uint32_t, 512>>(enum llm_kv kid, std::array<uint32_t, 512> & result, uint32_t n, bool required);
|
|
template bool llama_model_loader::get_key_or_arr<std::array<float, 512>>(enum llm_kv kid, std::array<float, 512> & result, uint32_t n, bool required);
|
|
|
|
|
|
llama_model_loader::llama_model_loader(
|
|
struct gguf_context * meta,
|
|
llama_model_set_tensor_data_t set_tensor_data,
|
|
void * set_tensor_data_ud,
|
|
const std::string & fname,
|
|
std::vector<std::string> & splits,
|
|
FILE * file,
|
|
bool use_mmap,
|
|
bool use_direct_io,
|
|
bool check_tensors,
|
|
bool no_alloc,
|
|
const llama_model_kv_override * param_overrides_p,
|
|
const llama_model_tensor_buft_override * param_tensor_buft_overrides_p)
|
|
: metadata(meta), set_tensor_data(set_tensor_data), set_tensor_data_ud(set_tensor_data_ud) {
|
|
int trace = 0;
|
|
if (getenv("LLAMA_TRACE")) {
|
|
trace = atoi(getenv("LLAMA_TRACE"));
|
|
}
|
|
|
|
if (param_overrides_p != nullptr) {
|
|
for (const struct llama_model_kv_override * p = param_overrides_p; p->key[0] != 0; p++) {
|
|
kv_overrides.insert({std::string(p->key), *p});
|
|
}
|
|
}
|
|
|
|
tensor_buft_overrides = param_tensor_buft_overrides_p;
|
|
|
|
if (!fname.empty()) {
|
|
// Load the main GGUF
|
|
struct ggml_context * ctx = NULL;
|
|
struct gguf_init_params params = {
|
|
/*.no_alloc = */ true,
|
|
/*.ctx = */ &ctx,
|
|
};
|
|
|
|
metadata_ptr.reset(gguf_init_from_file(fname.c_str(), params));
|
|
metadata = metadata_ptr.get();
|
|
if (metadata == nullptr) {
|
|
throw std::runtime_error(format("%s: failed to load model from %s", __func__, fname.c_str()));
|
|
}
|
|
|
|
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
|
|
llm_kv = LLM_KV(llm_arch_from_string(arch_name));
|
|
|
|
files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
|
|
contexts.emplace_back(ctx);
|
|
|
|
if (use_mmap && use_direct_io) {
|
|
if (files.back()->has_direct_io()) {
|
|
LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
|
|
use_mmap = false;
|
|
} else {
|
|
LLAMA_LOG_WARN("%s: direct I/O is not available, using mmap\n", __func__);
|
|
use_direct_io = false;
|
|
|
|
// reopen file using std::fopen for mmap
|
|
files.pop_back();
|
|
files.emplace_back(new llama_file(fname.c_str(), "rb", false));
|
|
}
|
|
}
|
|
|
|
// Save tensors data offset of the main file.
|
|
// For subsidiary files, `meta` tensor data offset must not be used,
|
|
// so we build a unified tensors index for weights.
|
|
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
|
|
std::string tensor_name = std::string(cur->name);
|
|
// make sure there is no duplicated tensor names
|
|
if (weights_map.find(tensor_name) != weights_map.end()) {
|
|
throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
|
|
}
|
|
n_elements += ggml_nelements(cur);
|
|
n_bytes += ggml_nbytes(cur);
|
|
weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), 0, metadata, cur));
|
|
}
|
|
uint16_t n_split = 0;
|
|
get_key(llm_kv(LLM_KV_SPLIT_COUNT), n_split, false);
|
|
|
|
// Load additional GGML contexts
|
|
if (n_split > 1) {
|
|
// make sure the main file is loaded first
|
|
uint16_t idx = 0;
|
|
const std::string kv_split_no = llm_kv(LLM_KV_SPLIT_NO);
|
|
get_key(kv_split_no, idx);
|
|
if (idx != 0) {
|
|
throw std::runtime_error(format("illegal split file idx: %d (file: %s), model must be loaded with the first split", idx, fname.c_str()));
|
|
}
|
|
|
|
// generate list of splits if needed
|
|
if (splits.empty()) {
|
|
splits = llama_get_list_splits(fname, idx, n_split);
|
|
}
|
|
|
|
// in case user give a custom list of splits, check if it matches the expected number
|
|
if (n_split != (uint16_t)splits.size()) {
|
|
throw std::runtime_error(format("invalid split count, given: %zu splits, but expected %d", splits.size(), n_split));
|
|
}
|
|
|
|
if (trace > 0) {
|
|
LLAMA_LOG_INFO("%s: loading additional %d GGUFs\n", __func__, n_split);
|
|
}
|
|
|
|
// load other splits
|
|
for (idx = 1; idx < n_split; idx++) {
|
|
const char * fname_split = splits[idx].c_str();
|
|
|
|
struct gguf_init_params split_params = {
|
|
/*.no_alloc = */ true,
|
|
/*.ctx = */ &ctx,
|
|
};
|
|
gguf_context_ptr ctx_gguf { gguf_init_from_file(fname_split, split_params) };
|
|
if (!ctx_gguf) {
|
|
throw std::runtime_error(format("%s: failed to load GGUF split from %s", __func__, fname_split));
|
|
}
|
|
|
|
// check idx
|
|
{
|
|
const int kid = gguf_find_key(ctx_gguf.get(), kv_split_no.c_str());
|
|
if (kid < 0) {
|
|
throw std::runtime_error(format("missing key %s in GGUF split %s", kv_split_no.c_str(), fname_split));
|
|
}
|
|
int idx_gguf = gguf_get_val_u16(ctx_gguf.get(), kid);
|
|
if (idx_gguf != idx) {
|
|
throw std::runtime_error(format("invalid split file idx: %d (file: %s), expected %d", idx_gguf, fname_split, idx));
|
|
}
|
|
}
|
|
|
|
files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
|
|
contexts.emplace_back(ctx);
|
|
|
|
// Save tensors data offset info of the shard.
|
|
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
|
|
std::string tensor_name = std::string(cur->name);
|
|
// make sure there is no duplicated tensor names
|
|
if (weights_map.find(tensor_name) != weights_map.end()) {
|
|
throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
|
|
}
|
|
n_elements += ggml_nelements(cur);
|
|
n_bytes += ggml_nbytes(cur);
|
|
weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), idx, ctx_gguf.get(), cur));
|
|
}
|
|
}
|
|
|
|
get_key(llm_kv(LLM_KV_SPLIT_TENSORS_COUNT), n_tensors);
|
|
|
|
// sanity check
|
|
{
|
|
const int n_tensors_loaded = (int) weights_map.size();
|
|
if (n_tensors != n_tensors_loaded) {
|
|
throw std::runtime_error(format("corrupted model: %d tensors expected but %d found", n_tensors, n_tensors_loaded));
|
|
}
|
|
}
|
|
|
|
LLAMA_LOG_INFO("%s: additional %d GGUFs metadata loaded.\n", __func__, n_split - 1);
|
|
}
|
|
} else if (file != nullptr) {
|
|
struct ggml_context * ctx = NULL;
|
|
struct gguf_init_params params = {
|
|
/*.no_alloc = */ true,
|
|
/*.ctx = */ &ctx,
|
|
};
|
|
|
|
metadata_ptr.reset(gguf_init_from_file_ptr(file, params));
|
|
metadata = metadata_ptr.get();
|
|
if (metadata == nullptr) {
|
|
throw std::runtime_error(format("%s: failed to load model from file pointer", __func__));
|
|
}
|
|
|
|
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
|
|
llm_kv = LLM_KV(llm_arch_from_string(arch_name));
|
|
|
|
files.emplace_back(new llama_file(file));
|
|
contexts.emplace_back(ctx);
|
|
|
|
// Save tensors data offset info of the main file.
|
|
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
|
|
std::string tensor_name = std::string(cur->name);
|
|
// make sure there is no duplicated tensor names
|
|
if (weights_map.find(tensor_name) != weights_map.end()) {
|
|
throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
|
|
}
|
|
n_elements += ggml_nelements(cur);
|
|
n_bytes += ggml_nbytes(cur);
|
|
weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), 0, metadata, cur));
|
|
}
|
|
} else {
|
|
get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
|
|
llm_kv = LLM_KV(llm_arch_from_string(arch_name));
|
|
}
|
|
|
|
n_kv = gguf_get_n_kv(metadata);
|
|
n_tensors = weights_map.size();
|
|
|
|
fver = (enum llama_fver) gguf_get_version(metadata);
|
|
|
|
LLAMA_LOG_INFO("%s: loaded meta data with %d key-value pairs and %d tensors from %s (version %s)\n",
|
|
__func__, n_kv, n_tensors, fname.empty() ? "(file*)" : fname.c_str(), llama_file_version_name(fver));
|
|
|
|
// determine file type based on the number of tensors for each quantization and print meta data
|
|
// TODO: make optional
|
|
{
|
|
std::map<enum ggml_type, uint32_t> n_type;
|
|
|
|
uint32_t n_type_max = 0;
|
|
enum ggml_type type_max = GGML_TYPE_F32;
|
|
|
|
for (const auto & it : weights_map) {
|
|
const llama_tensor_weight & w = it.second;
|
|
const ggml_tensor * tensor = w.tensor;
|
|
|
|
enum ggml_type type = tensor->type;
|
|
|
|
n_type[type]++;
|
|
|
|
if (n_type_max < n_type[type]) {
|
|
n_type_max = n_type[type];
|
|
type_max = type;
|
|
}
|
|
|
|
if (trace > 0) {
|
|
const uint16_t sid = w.idx;
|
|
LLAMA_LOG_INFO("%s: - tensor split %2d: %32s %-8s [ %s ] %8.2f MiB\n", __func__,
|
|
sid, ggml_get_name(tensor), ggml_type_name(type), llama_format_tensor_shape(tensor).c_str(),
|
|
ggml_nbytes(tensor)/1024.0f/1024.0f);
|
|
}
|
|
}
|
|
|
|
switch (type_max) {
|
|
case GGML_TYPE_F32: ftype = LLAMA_FTYPE_ALL_F32; break;
|
|
case GGML_TYPE_F16: ftype = LLAMA_FTYPE_MOSTLY_F16; break;
|
|
case GGML_TYPE_BF16: ftype = LLAMA_FTYPE_MOSTLY_BF16; break;
|
|
case GGML_TYPE_Q4_0: ftype = LLAMA_FTYPE_MOSTLY_Q4_0; break;
|
|
case GGML_TYPE_Q4_1: ftype = LLAMA_FTYPE_MOSTLY_Q4_1; break;
|
|
case GGML_TYPE_Q5_0: ftype = LLAMA_FTYPE_MOSTLY_Q5_0; break;
|
|
case GGML_TYPE_Q5_1: ftype = LLAMA_FTYPE_MOSTLY_Q5_1; break;
|
|
case GGML_TYPE_Q8_0: ftype = LLAMA_FTYPE_MOSTLY_Q8_0; break;
|
|
case GGML_TYPE_Q2_K: ftype = LLAMA_FTYPE_MOSTLY_Q2_K; break;
|
|
case GGML_TYPE_Q3_K: ftype = LLAMA_FTYPE_MOSTLY_Q3_K_M; break;
|
|
case GGML_TYPE_Q4_K: ftype = LLAMA_FTYPE_MOSTLY_Q4_K_M; break;
|
|
case GGML_TYPE_Q5_K: ftype = LLAMA_FTYPE_MOSTLY_Q5_K_M; break;
|
|
case GGML_TYPE_Q6_K: ftype = LLAMA_FTYPE_MOSTLY_Q6_K; break;
|
|
case GGML_TYPE_TQ1_0: ftype = LLAMA_FTYPE_MOSTLY_TQ1_0; break;
|
|
case GGML_TYPE_TQ2_0: ftype = LLAMA_FTYPE_MOSTLY_TQ2_0; break;
|
|
case GGML_TYPE_IQ2_XXS: ftype = LLAMA_FTYPE_MOSTLY_IQ2_XXS; break;
|
|
case GGML_TYPE_IQ2_XS: ftype = LLAMA_FTYPE_MOSTLY_IQ2_XS; break;
|
|
case GGML_TYPE_IQ2_S: ftype = LLAMA_FTYPE_MOSTLY_IQ2_S; break;
|
|
case GGML_TYPE_IQ3_XXS: ftype = LLAMA_FTYPE_MOSTLY_IQ3_XXS; break;
|
|
case GGML_TYPE_IQ1_S: ftype = LLAMA_FTYPE_MOSTLY_IQ1_S; break;
|
|
case GGML_TYPE_IQ1_M: ftype = LLAMA_FTYPE_MOSTLY_IQ1_M; break;
|
|
case GGML_TYPE_IQ4_NL: ftype = LLAMA_FTYPE_MOSTLY_IQ4_NL; break;
|
|
case GGML_TYPE_IQ4_XS: ftype = LLAMA_FTYPE_MOSTLY_IQ4_XS; break;
|
|
case GGML_TYPE_IQ3_S: ftype = LLAMA_FTYPE_MOSTLY_IQ3_S; break;
|
|
case GGML_TYPE_NVFP4: ftype = LLAMA_FTYPE_MOSTLY_NVFP4; break;
|
|
case GGML_TYPE_Q1_0: ftype = LLAMA_FTYPE_MOSTLY_Q1_0; break;
|
|
default:
|
|
{
|
|
LLAMA_LOG_WARN("%s: unknown type %s\n", __func__, ggml_type_name(type_max));
|
|
ftype = LLAMA_FTYPE_ALL_F32;
|
|
} break;
|
|
}
|
|
|
|
// this is a way to mark that we have "guessed" the file type
|
|
ftype = (llama_ftype) (ftype | LLAMA_FTYPE_GUESSED);
|
|
|
|
{
|
|
uint32_t ftype_val = 0;
|
|
if (get_key(LLM_KV_GENERAL_FILE_TYPE, ftype_val, false)) {
|
|
ftype = (llama_ftype) ftype_val;
|
|
}
|
|
}
|
|
|
|
LLAMA_LOG_INFO("%s: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n", __func__);
|
|
|
|
for (int i = 0; i < n_kv; i++) {
|
|
const char * name = gguf_get_key(metadata, i);
|
|
const enum gguf_type type = gguf_get_kv_type(metadata, i);
|
|
const std::string type_name =
|
|
type == GGUF_TYPE_ARRAY
|
|
? format("%s[%s,%zu]", gguf_type_name(type), gguf_type_name(gguf_get_arr_type(metadata, i)), gguf_get_arr_n(metadata, i))
|
|
: gguf_type_name(type);
|
|
|
|
std::string value = gguf_kv_to_str(metadata, i);
|
|
const size_t MAX_VALUE_LEN = 40;
|
|
if (value.size() > MAX_VALUE_LEN) {
|
|
value = format("%s...", value.substr(0, MAX_VALUE_LEN - 3).c_str());
|
|
}
|
|
replace_all(value, "\n", "\\n");
|
|
|
|
LLAMA_LOG_INFO("%s: - kv %3d: %42s %-16s = %s\n", __func__, i, name, type_name.c_str(), value.c_str());
|
|
}
|
|
|
|
// print type counts
|
|
for (auto & kv : n_type) {
|
|
if (kv.second == 0) {
|
|
continue;
|
|
}
|
|
|
|
LLAMA_LOG_INFO("%s: - type %4s: %4d tensors\n", __func__, ggml_type_name(kv.first), kv.second);
|
|
}
|
|
}
|
|
|
|
if (!llama_mmap::SUPPORTED) {
|
|
LLAMA_LOG_WARN("%s: mmap is not supported on this platform\n", __func__);
|
|
use_mmap = false;
|
|
}
|
|
|
|
this->use_mmap = use_mmap;
|
|
this->use_direct_io = use_direct_io;
|
|
this->check_tensors = check_tensors;
|
|
this->no_alloc = no_alloc;
|
|
}
|
|
|
|
std::string llama_model_loader::get_arch_name() const {
|
|
return arch_name;
|
|
}
|
|
|
|
enum llm_arch llama_model_loader::get_arch() const {
|
|
return llm_kv.arch;
|
|
}
|
|
|
|
const llama_model_loader::llama_tensor_weight * llama_model_loader::get_weight(const char * name) const {
|
|
auto pos = weights_map.find(name);
|
|
if (pos != weights_map.end()) {
|
|
return &pos->second;
|
|
}
|
|
|
|
return nullptr;
|
|
}
|
|
|
|
const llama_model_loader::llama_tensor_weight & llama_model_loader::require_weight(const char * name) const {
|
|
const llama_tensor_weight * weight = get_weight(name);
|
|
if (!weight) {
|
|
throw std::runtime_error(format("%s: tensor '%s' not found", __func__, name));
|
|
}
|
|
return *weight;
|
|
}
|
|
|
|
struct ggml_tensor * llama_model_loader::get_tensor_meta(const char * name) const {
|
|
const auto * weight = get_weight(name);
|
|
if (!weight) {
|
|
return nullptr;
|
|
}
|
|
return weight->tensor;
|
|
}
|
|
|
|
struct ggml_tensor * llama_model_loader::require_tensor_meta(const std::string & name) const {
|
|
struct ggml_tensor * tensor = get_tensor_meta(name.c_str());
|
|
if (!tensor) {
|
|
throw std::runtime_error(format("%s: tensor '%s' not found", __func__, name.c_str()));
|
|
}
|
|
return tensor;
|
|
}
|
|
|
|
const struct ggml_tensor * llama_model_loader::check_tensor_dims(const std::string & name, const std::vector<int64_t> & ne, bool required) const {
|
|
const struct ggml_tensor * cur = get_tensor_meta(name.c_str());
|
|
|
|
if (cur == NULL) {
|
|
if (!required) {
|
|
return NULL;
|
|
}
|
|
throw std::runtime_error(format("%s: tensor '%s' not found", __func__, name.c_str()));
|
|
}
|
|
|
|
{
|
|
bool is_ok = true;
|
|
for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
|
|
if ((i < ne.size() && ne[i] != cur->ne[i]) || (i >= ne.size() && cur->ne[i] != 1)) {
|
|
is_ok = false;
|
|
break;
|
|
}
|
|
}
|
|
if (!is_ok) {
|
|
throw std::runtime_error(
|
|
format("%s: tensor '%s' has wrong shape; expected %s, got %s",
|
|
__func__, name.c_str(),
|
|
llama_format_tensor_shape(ne).c_str(),
|
|
llama_format_tensor_shape(cur).c_str()));
|
|
}
|
|
}
|
|
|
|
return cur;
|
|
}
|
|
|
|
// checks if the weight tensor can be used with the specified buffer type and device
|
|
static bool weight_buft_supported(const llama_hparams & hparams, ggml_tensor * w, ggml_op op, ggml_backend_buffer_type_t buft, ggml_backend_dev_t dev) {
|
|
GGML_ASSERT(w != nullptr);
|
|
|
|
if (op == GGML_OP_NONE) {
|
|
return true;
|
|
}
|
|
|
|
ggml_init_params params = {
|
|
/*.mem_size =*/ ggml_tensor_overhead()*8,
|
|
/*.mem_buffer =*/ NULL,
|
|
/*.no_alloc =*/ true,
|
|
};
|
|
ggml_context_ptr ctx_ptr { ggml_init(params) };
|
|
if (!ctx_ptr) {
|
|
throw std::runtime_error(format("failed to create ggml context"));
|
|
}
|
|
ggml_context * ctx = ctx_ptr.get();
|
|
|
|
ggml_tensor * op_tensor = nullptr;
|
|
|
|
switch (op) {
|
|
case GGML_OP_GET_ROWS:
|
|
{
|
|
ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 512);
|
|
op_tensor = ggml_get_rows(ctx, w, b);
|
|
} break;
|
|
case GGML_OP_MUL_MAT:
|
|
{
|
|
ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], 512, w->ne[2], w->ne[3]);
|
|
op_tensor = ggml_mul_mat(ctx, w, b);
|
|
} break;
|
|
case GGML_OP_MUL_MAT_ID:
|
|
{
|
|
const int n_expert_used = hparams.n_expert_used;
|
|
GGML_ASSERT(n_expert_used > 0);
|
|
ggml_tensor * b = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, w->ne[0], n_expert_used, 512);
|
|
ggml_tensor * ids = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, n_expert_used, 512);
|
|
op_tensor = ggml_mul_mat_id(ctx, w, b, ids);
|
|
} break;
|
|
case GGML_OP_ADD:
|
|
{
|
|
ggml_tensor * a = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], w->ne[1], w->ne[2], w->ne[3]);
|
|
op_tensor = ggml_add(ctx, a, w);
|
|
} break;
|
|
case GGML_OP_ADD_ID:
|
|
{
|
|
const int n_expert_used = hparams.n_expert_used;
|
|
GGML_ASSERT(n_expert_used > 0);
|
|
ggml_tensor * a = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, w->ne[0], n_expert_used, 512);
|
|
ggml_tensor * c = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, n_expert_used, 512);
|
|
op_tensor = ggml_add_id(ctx, a, w, c);
|
|
} break;
|
|
case GGML_OP_MUL:
|
|
{
|
|
ggml_tensor * a = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, w->ne[0], w->ne[1], w->ne[2], w->ne[3]);
|
|
op_tensor = ggml_mul(ctx, a, w);
|
|
} break;
|
|
case GGML_OP_DIV:
|
|
{
|
|
ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, w->ne[0]);
|
|
op_tensor = ggml_div(ctx, a, w);
|
|
} break;
|
|
case GGML_OP_ROPE:
|
|
{
|
|
const int n_embd_head = hparams.n_embd_head_v();
|
|
const int n_head = hparams.n_head();
|
|
ggml_tensor * a = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, n_embd_head, n_head, 512);
|
|
ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 512);
|
|
op_tensor = ggml_rope_ext(
|
|
ctx, a, b, w,
|
|
0, 0, 0, 0, 0,
|
|
0, 0, 0, 0
|
|
);
|
|
|
|
} break;
|
|
case GGML_OP_SSM_CONV:
|
|
{
|
|
const int64_t n_seq_tokens = 512;
|
|
const int64_t n_seqs = 3;
|
|
ggml_tensor * conv_x = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, w->ne[0] - 1 + n_seq_tokens, w->ne[1], n_seqs);
|
|
op_tensor = ggml_ssm_conv(ctx, conv_x, w);
|
|
} break;
|
|
case GGML_OP_SSM_SCAN:
|
|
{
|
|
// w is ssm_a, which is used to distinguish Mamba-1 and Mamba-2
|
|
const int64_t d_state = w->ne[0] == 1 ? hparams.ssm_d_state : w->ne[0];
|
|
const int64_t n_head = w->ne[1];
|
|
const int64_t head_dim = hparams.ssm_d_inner / n_head;
|
|
const int64_t n_group = hparams.ssm_n_group ? hparams.ssm_n_group : 1;
|
|
const int64_t n_seq_tokens = 512;
|
|
const int64_t n_seqs = 3;
|
|
ggml_tensor * s = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, d_state, head_dim, n_head, n_seqs);
|
|
ggml_tensor * x = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, head_dim, n_head, n_seq_tokens, n_seqs);
|
|
ggml_tensor * dt = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, n_head, n_seq_tokens, n_seqs);
|
|
ggml_tensor * B = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, d_state, n_group, n_seq_tokens, n_seqs);
|
|
ggml_tensor * C = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, d_state, n_group, n_seq_tokens, n_seqs);
|
|
ggml_tensor * ids = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, n_seqs);
|
|
op_tensor = ggml_ssm_scan(ctx, s, x, dt, w, B, C, ids);
|
|
} break;
|
|
case GGML_OP_RWKV_WKV6:
|
|
{
|
|
// FIXME
|
|
const int64_t S = 123;
|
|
const int64_t H = 123;
|
|
const int64_t n_tokens = 123;
|
|
const int64_t n_seqs = 123;
|
|
ggml_tensor * k = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
|
|
ggml_tensor * v = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
|
|
ggml_tensor * r = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
|
|
ggml_tensor * tf = w;
|
|
ggml_tensor * td = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, S, H, n_tokens);
|
|
ggml_tensor * state = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, S, n_seqs, S, H);
|
|
op_tensor = ggml_rwkv_wkv6(ctx, k, v, r, tf, td, state);
|
|
} break;
|
|
case GGML_OP_IM2COL:
|
|
{
|
|
const int n_embd_inp = hparams.n_embd_inp();
|
|
ggml_tensor * b = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, n_embd_inp, w->ne[1], 1, 1);
|
|
op_tensor = ggml_im2col(ctx, w, b, 1, 0, 0, 0, 1, 0, false, GGML_TYPE_F16);
|
|
} break;
|
|
case GGML_OP_SCALE:
|
|
{
|
|
op_tensor = ggml_scale(ctx, w, 1.0f);
|
|
} break;
|
|
default:
|
|
GGML_ABORT("%s: missing test for op %s for tensor %s", __func__, ggml_op_name(op), w->name);
|
|
}
|
|
|
|
// create a temporary dummy buffer for the weight so that supports_op can check the buffer type
|
|
GGML_ASSERT(w->buffer == nullptr);
|
|
w->buffer = ggml_backend_buft_alloc_buffer(buft, 0);
|
|
bool op_supported = ggml_backend_dev_supports_op(dev, op_tensor);
|
|
ggml_backend_buffer_free(w->buffer);
|
|
w->buffer = nullptr;
|
|
|
|
return op_supported;
|
|
}
|
|
|
|
// find the first buffer type in the list that can use the tensor
|
|
static ggml_backend_buffer_type_t select_weight_buft(const llama_hparams & hparams, ggml_tensor * tensor, ggml_op op, const buft_list_t * buft_list) {
|
|
GGML_ASSERT(!buft_list->empty());
|
|
for (const auto & cur : *buft_list) {
|
|
ggml_backend_dev_t cur_dev = cur.first;
|
|
ggml_backend_buffer_type_t cur_buft = cur.second;
|
|
if (weight_buft_supported(hparams, tensor, op, cur_buft, cur_dev)) {
|
|
return cur_buft;
|
|
}
|
|
}
|
|
|
|
return nullptr;
|
|
}
|
|
|
|
struct ggml_tensor * llama_model_loader::create_tensor(
|
|
const llama_hparams & hparams, const buft_list_t * buft_list_cpu, const buft_list_t * buft_list_input, const buft_list_t * buft_list_output,
|
|
const buft_list_t * buft_list_layer, const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) {
|
|
auto ctx_for_buft = [&](ggml_backend_buffer_type_t buft) -> ggml_context * {
|
|
auto it = ctx_map.find(buft);
|
|
if (it == ctx_map.end()) {
|
|
// one ggml context per buffer type
|
|
int max_n_tensors = n_tensors;
|
|
max_n_tensors += 1; // duplicated output tensor
|
|
max_n_tensors += hparams.n_layer()*2; // duplicated rope freq tensors
|
|
if (files.empty()) {
|
|
max_n_tensors += hparams.n_layer()*256; // this should be well above what any model actually uses
|
|
}
|
|
const size_t ctx_size = ggml_tensor_overhead()*max_n_tensors;
|
|
|
|
ggml_init_params params = {
|
|
/*.mem_size =*/ ctx_size,
|
|
/*.mem_buffer =*/ NULL,
|
|
/*.no_alloc =*/ true,
|
|
};
|
|
|
|
ggml_context * ctx = ggml_init(params);
|
|
if (!ctx) {
|
|
throw std::runtime_error(format("failed to create ggml context"));
|
|
}
|
|
|
|
ctx_map.emplace(buft, ctx);
|
|
|
|
return ctx;
|
|
}
|
|
return it->second.get();
|
|
};
|
|
|
|
auto buft_for_tensor = [&](ggml_tensor * t_meta) -> ggml_backend_buffer_type_t {
|
|
if (!t_meta) {
|
|
if (flags & TENSOR_NOT_REQUIRED) {
|
|
return nullptr;
|
|
}
|
|
throw std::runtime_error(format("missing tensor '%s'", tn.str().c_str()));
|
|
}
|
|
|
|
// some models use the token embedding tensor as the output, but since these are used in different layers and with different ops
|
|
// the tensor is duplicated
|
|
// to handle this, we check if the tensor is duplicated, and if so, we assume that it is being loaded as the output tensor
|
|
llm_tensor tn_tensor = tn.tensor;
|
|
if (tn.tensor == LLM_TENSOR_TOKEN_EMBD && (flags & TENSOR_DUPLICATED)) {
|
|
tn_tensor = LLM_TENSOR_OUTPUT;
|
|
}
|
|
|
|
llm_tensor_info info;
|
|
try {
|
|
info = llm_tensor_info_for(tn_tensor);
|
|
} catch (const std::out_of_range & e) {
|
|
throw std::runtime_error(format("missing tensor info mapping for %s", tn.str().c_str()));
|
|
}
|
|
|
|
// skip unused tensors
|
|
if (info.op == GGML_OP_NONE || (flags & TENSOR_SKIP)) {
|
|
const size_t nbytes = ggml_nbytes(t_meta);
|
|
LLAMA_LOG_WARN("model has unused tensor %s (size = %zu bytes) -- ignoring\n", tn.str().c_str(), nbytes);
|
|
|
|
size_data -= nbytes;
|
|
n_created++;
|
|
|
|
return nullptr;
|
|
}
|
|
|
|
// tensors with "bias" suffix are always used with GGML_OP_ADD or GGML_OP_ADD_ID
|
|
ggml_op op;
|
|
bool bias = tn.suffix != nullptr && strcmp(tn.suffix, "bias") == 0;
|
|
if (bias) {
|
|
if (info.op == GGML_OP_MUL_MAT_ID) {
|
|
op = GGML_OP_ADD_ID;
|
|
} else {
|
|
op = GGML_OP_ADD;
|
|
}
|
|
} else {
|
|
op = info.op;
|
|
}
|
|
|
|
// sanity checks
|
|
if (info.layer == LLM_TENSOR_LAYER_INPUT || info.layer == LLM_TENSOR_LAYER_OUTPUT) {
|
|
if (tn.bid != -1) {
|
|
GGML_ABORT("input/output layer tensor %s used with a layer number", tn.str().c_str());
|
|
}
|
|
} else {
|
|
if (tn.bid == -1) {
|
|
GGML_ABORT("repeating layer tensor %s used without a layer number", tn.str().c_str());
|
|
}
|
|
}
|
|
|
|
// select the buffer type for this tensor
|
|
const buft_list_t * buft_list;
|
|
switch (info.layer) {
|
|
case LLM_TENSOR_LAYER_INPUT:
|
|
buft_list = buft_list_input;
|
|
break;
|
|
case LLM_TENSOR_LAYER_OUTPUT:
|
|
buft_list = buft_list_output;
|
|
break;
|
|
case LLM_TENSOR_LAYER_REPEATING:
|
|
GGML_ASSERT(buft_list_layer != nullptr);
|
|
buft_list = buft_list_layer;
|
|
break;
|
|
default:
|
|
GGML_ABORT("invalid layer %d for tensor %s", info.layer, tn.str().c_str());
|
|
}
|
|
|
|
ggml_backend_buffer_type_t buft = nullptr;
|
|
|
|
// check overrides
|
|
if (tensor_buft_overrides) {
|
|
std::string tensor_name = tn.str();
|
|
for (const auto * overrides = tensor_buft_overrides; overrides->pattern != nullptr; ++overrides) {
|
|
std::regex pattern(overrides->pattern);
|
|
if (std::regex_search(tensor_name, pattern)) {
|
|
if (overrides->buft == ggml_backend_cpu_buffer_type()) {
|
|
// when overriding to a CPU buffer, consider the extra buffer types
|
|
buft = select_weight_buft(hparams, t_meta, op, buft_list_cpu);
|
|
if (use_mmap) {
|
|
static std::once_flag once;
|
|
std::call_once(once, [] {
|
|
LLAMA_LOG_WARN("llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance\n");
|
|
});
|
|
}
|
|
} else {
|
|
buft = overrides->buft;
|
|
}
|
|
|
|
LLAMA_LOG_DEBUG("tensor %s (%zu MiB %s) buffer type overridden to %s\n",
|
|
tensor_name.c_str(),
|
|
ggml_nbytes(t_meta) / 1024 / 1024, ggml_type_name(t_meta->type),
|
|
ggml_backend_buft_name(buft));
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
|
|
if (!buft) {
|
|
buft = select_weight_buft(hparams, t_meta, op, buft_list);
|
|
if (!buft) {
|
|
throw std::runtime_error(format("failed to find a compatible buffer type for tensor %s", tn.str().c_str()));
|
|
}
|
|
}
|
|
|
|
// avoid using a host buffer when using mmap
|
|
auto * buft_dev = ggml_backend_buft_get_device(buft);
|
|
if (use_mmap && buft_dev && buft == ggml_backend_dev_host_buffer_type(buft_dev)) {
|
|
auto * cpu_dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
|
|
if (!cpu_dev) {
|
|
throw std::runtime_error("no CPU backend found");
|
|
}
|
|
buft = ggml_backend_dev_buffer_type(cpu_dev);
|
|
}
|
|
|
|
if (buft != buft_list->front().second) {
|
|
if (n_tensors_moved == 0) {
|
|
first_tensor_moved_name = t_meta->name;
|
|
first_tensor_moved_type_name = ggml_type_name(t_meta->type);
|
|
first_moved_from_buft = buft_list->front().second;
|
|
first_moved_to_buft = buft;
|
|
}
|
|
n_tensors_moved++;
|
|
}
|
|
|
|
return buft;
|
|
};
|
|
|
|
if (files.empty()) {
|
|
if (flags & TENSOR_SKIP_IF_VIRTUAL) {
|
|
return nullptr;
|
|
}
|
|
ggml_type type = GGML_TYPE_F32;
|
|
const int64_t tid = gguf_find_tensor(metadata, tn.str().c_str());
|
|
if (tid != -1) {
|
|
type = gguf_get_tensor_type(metadata, tid);
|
|
}
|
|
|
|
// for tensors that are not required some of the dimensions can be invalid:
|
|
if (flags & TENSOR_NOT_REQUIRED) {
|
|
for (size_t dim = 0; dim < ne.size(); dim++) {
|
|
if (ne.begin()[dim] <= 0) {
|
|
return nullptr;
|
|
}
|
|
}
|
|
}
|
|
|
|
ggml_tensor t_meta;
|
|
memset(&t_meta, 0, sizeof(ggml_tensor));
|
|
t_meta.type = type;
|
|
for (size_t dim = 0; dim < GGML_MAX_DIMS; dim++) {
|
|
t_meta.ne[dim] = dim < ne.size() ? ne.begin()[dim] : 1;
|
|
GGML_ASSERT(t_meta.ne[dim] >= 1);
|
|
t_meta.nb[dim] = dim == 0 ? ggml_type_size(type) : t_meta.ne[dim-1]*t_meta.nb[dim-1];
|
|
GGML_ASSERT(t_meta.nb[dim] >= 1);
|
|
}
|
|
ggml_set_name(&t_meta, tn.str().c_str());
|
|
|
|
ggml_backend_buffer_type_t buft = buft_for_tensor(&t_meta);
|
|
GGML_ASSERT(buft != nullptr);
|
|
ggml_context * ctx = ctx_for_buft(buft);
|
|
ggml_tensor * ret = ggml_dup_tensor(ctx, &t_meta);
|
|
ggml_set_name(ret, tn.str().c_str());
|
|
return ret;
|
|
}
|
|
|
|
ggml_tensor * t_meta = get_tensor_meta(tn.str().c_str());
|
|
ggml_backend_buffer_type_t buft = buft_for_tensor(t_meta);
|
|
if (buft == nullptr) {
|
|
return nullptr; // return type is ggml_tensor *
|
|
}
|
|
ggml_context * ctx = ctx_for_buft(buft);
|
|
|
|
// if duplicated, check if the original tensor was allocated in the same buffer type context and avoid creating a new one
|
|
if (flags & TENSOR_DUPLICATED) {
|
|
ggml_tensor * t = ggml_get_tensor(ctx, tn.str().c_str());
|
|
if (t) {
|
|
return t;
|
|
}
|
|
}
|
|
|
|
LLAMA_LOG_DEBUG("%s: loading tensor %s\n", __func__, tn.str().c_str());
|
|
const struct ggml_tensor * cur = check_tensor_dims(tn.str(), ne, !(flags & TENSOR_NOT_REQUIRED));
|
|
|
|
if (cur == NULL) {
|
|
return NULL;
|
|
}
|
|
|
|
const bool duplicated = flags & TENSOR_DUPLICATED;
|
|
|
|
struct ggml_tensor * tensor = ggml_dup_tensor(ctx, cur);
|
|
ggml_set_name(tensor, ggml_get_name(cur));
|
|
|
|
if (duplicated) {
|
|
size_data += ggml_nbytes(cur);
|
|
} else {
|
|
n_created++;
|
|
}
|
|
|
|
return tensor;
|
|
}
|
|
|
|
struct ggml_tensor * llama_model_loader::create_tensor_as_view(struct ggml_context * ctx, struct ggml_tensor * base, const std::string & name, const std::initializer_list<int64_t> & ne, size_t offset, bool required) {
|
|
const struct ggml_tensor * cur = check_tensor_dims(name, ne, required);
|
|
|
|
if (cur == NULL) {
|
|
return NULL;
|
|
}
|
|
|
|
if (cur->type != base->type) {
|
|
throw std::runtime_error(format("%s: tensor '%s' has wrong type; expected %s, got %s", __func__, name.c_str(), ggml_type_name(base->type), ggml_type_name(cur->type)));
|
|
}
|
|
|
|
std::array<int64_t, GGML_MAX_DIMS> dims;
|
|
for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
|
|
dims[i] = i < ne.size() ? ne.begin()[i] : 1;
|
|
}
|
|
|
|
struct ggml_tensor * tensor = ggml_view_4d(ctx, base,
|
|
dims[0], dims[1], dims[2], dims[3],
|
|
cur->nb[1], cur->nb[2], cur->nb[3],
|
|
offset);
|
|
|
|
ggml_set_name(tensor, name.c_str());
|
|
|
|
n_created++;
|
|
|
|
return tensor;
|
|
}
|
|
|
|
void llama_model_loader::done_getting_tensors(bool partial) const {
|
|
if (n_created > n_tensors) {
|
|
throw std::runtime_error(format("%s: too many tensors created; expected %d, got %d", __func__, n_tensors, n_created));
|
|
}
|
|
if (n_created < n_tensors) {
|
|
if (!partial) {
|
|
throw std::runtime_error(format("%s: wrong number of tensors; expected %d, got %d", __func__, n_tensors, n_created));
|
|
}
|
|
LLAMA_LOG_INFO("%s: partial load — used %d of %d tensors in the file (rest belong to a sibling model on the same .gguf)\n",
|
|
__func__, n_created, n_tensors);
|
|
}
|
|
if (n_tensors_moved > 0) {
|
|
LLAMA_LOG_DEBUG("%s: tensor '%s' (%s) (and %zu others) cannot be used with preferred buffer type %s, using %s instead\n",
|
|
__func__, first_tensor_moved_name.c_str(), first_tensor_moved_type_name.c_str(), n_tensors_moved - 1,
|
|
ggml_backend_buft_name(first_moved_from_buft), ggml_backend_buft_name(first_moved_to_buft));
|
|
}
|
|
}
|
|
|
|
void llama_model_loader::init_mappings(bool prefetch, llama_mlocks * mlock_mmaps) {
|
|
if (use_mmap) {
|
|
mappings.reserve(files.size());
|
|
mmaps_used.reserve(files.size());
|
|
for (const auto & file : files) {
|
|
bool is_numa = false;
|
|
|
|
auto * dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
|
|
if (dev) {
|
|
auto * reg = ggml_backend_dev_backend_reg(dev);
|
|
auto * is_numa_fn = (decltype(ggml_is_numa) *) ggml_backend_reg_get_proc_address(reg, "ggml_backend_cpu_is_numa");
|
|
if (is_numa_fn) {
|
|
is_numa = is_numa_fn();
|
|
}
|
|
}
|
|
|
|
std::unique_ptr<llama_mmap> mapping = std::make_unique<llama_mmap>(file.get(), prefetch ? -1 : 0, is_numa);
|
|
mmaps_used.emplace_back(mapping->size(), 0);
|
|
if (mlock_mmaps) {
|
|
std::unique_ptr<llama_mlock> mlock_mmap(new llama_mlock());
|
|
mlock_mmap->init(mapping->addr());
|
|
mlock_mmaps->emplace_back(std::move(mlock_mmap));
|
|
}
|
|
mappings.emplace_back(std::move(mapping));
|
|
}
|
|
}
|
|
|
|
// compute the total size of all tensors for progress reporting
|
|
for (const auto & it : weights_map) {
|
|
size_data += ggml_nbytes(it.second.tensor);
|
|
}
|
|
}
|
|
|
|
void llama_model_loader::get_mapping_range(size_t * first, size_t * last, void ** addr, int idx, ggml_context * ctx) const {
|
|
GGML_ASSERT(!mappings.empty());
|
|
const auto & mapping = mappings.at(idx);
|
|
|
|
*first = mapping->size();
|
|
*last = 0;
|
|
*addr = mapping->addr();
|
|
for (ggml_tensor * tensor = ggml_get_first_tensor(ctx); tensor; tensor = ggml_get_next_tensor(ctx, tensor)) {
|
|
const auto * weight = get_weight(ggml_get_name(tensor));
|
|
if (!weight || weight->idx != idx) {
|
|
continue;
|
|
}
|
|
*first = std::min(*first, weight->offs);
|
|
*last = std::max(*last, weight->offs + ggml_nbytes(tensor));
|
|
}
|
|
}
|
|
|
|
void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
|
|
const auto & w = require_weight(ggml_get_name(cur));
|
|
|
|
if (use_mmap) {
|
|
const auto & mapping = mappings.at(w.idx);
|
|
if (cur->data == nullptr) {
|
|
cur->data = (uint8_t *)mapping->addr() + w.offs;
|
|
} else {
|
|
memcpy(cur->data, (uint8_t *)mapping->addr() + w.offs, ggml_nbytes(cur));
|
|
}
|
|
} else {
|
|
GGML_ASSERT(cur->data != nullptr);
|
|
GGML_ASSERT(w.idx < files.size());
|
|
const auto & file = files.at(w.idx);
|
|
file->seek(w.offs, SEEK_SET);
|
|
file->read_raw(cur->data, ggml_nbytes(cur));
|
|
}
|
|
|
|
if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {
|
|
throw std::runtime_error(format("tensor '%s' has invalid data", ggml_get_name(cur)));
|
|
}
|
|
}
|
|
|
|
bool llama_model_loader::load_all_data(
|
|
struct ggml_context * ctx,
|
|
llama_buf_map & bufs,
|
|
llama_mlocks * lmlocks,
|
|
llama_progress_callback progress_callback,
|
|
void * progress_callback_user_data) {
|
|
if (files.empty()) {
|
|
for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) {
|
|
set_tensor_data(t, set_tensor_data_ud);
|
|
}
|
|
return true;
|
|
}
|
|
GGML_ASSERT(size_data != 0 && "call init_mappings() first");
|
|
|
|
std::vector<no_init<uint8_t>> read_buf;
|
|
std::vector<std::future<std::pair<ggml_tensor *, bool>>> validation_result;
|
|
|
|
// 4 staging buffers for async uploads, each sized 1MB seems to be a good default for single NVMe drives.
|
|
// NVMe raid configurations might require more / larger buffers.
|
|
constexpr size_t n_buffers = 4;
|
|
|
|
size_t alignment = 1;
|
|
for (const auto & file : files) {
|
|
alignment = std::max(file->read_alignment(), alignment);
|
|
}
|
|
|
|
// Buffer size: balance between memory usage and I/O efficiency
|
|
// 64MB works well for NVMe drives
|
|
const size_t buffer_size = alignment != 1 ? 64 * 1024 * 1024 + 2 * alignment : 1 * 1024 * 1024;
|
|
|
|
std::vector<ggml_backend_buffer_t> host_buffers;
|
|
std::vector<ggml_backend_event_t> events;
|
|
std::vector<void *> host_ptrs;
|
|
size_t buffer_idx = 0; // buffer to use for async loads
|
|
ggml_backend_t upload_backend = [&](const char * func) -> ggml_backend_t {
|
|
if (use_mmap || check_tensors) {
|
|
return nullptr;
|
|
}
|
|
// When not using mmaped io use async uploads from pinned memory to GPU memory.
|
|
// First determine if the backend supports the necessary features for async uploads.
|
|
auto * buf = bufs.count(0) ? bufs.at(0) : nullptr;
|
|
if (!buf) {
|
|
LLAMA_LOG_DEBUG("%s: no buffer found for async uploads\n", func);
|
|
return nullptr;
|
|
}
|
|
|
|
auto * buft = ggml_backend_buffer_get_type(buf);
|
|
auto * dev = ggml_backend_buft_get_device(buft);
|
|
if (!dev) {
|
|
LLAMA_LOG_DEBUG("%s: no device found for buffer type %s for async uploads\n", func,
|
|
ggml_backend_buft_name(buft));
|
|
return nullptr;
|
|
}
|
|
|
|
if (buft != ggml_backend_dev_buffer_type(dev)) {
|
|
LLAMA_LOG_DEBUG("%s: buffer type %s is not the default buffer type for device %s for async uploads\n", func,
|
|
ggml_backend_buft_name(buft), ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
ggml_backend_dev_props props;
|
|
ggml_backend_dev_get_props(dev, &props);
|
|
if (!props.caps.async || !props.caps.host_buffer || !props.caps.events) {
|
|
LLAMA_LOG_DEBUG("%s: device %s does not support async, host buffers or events\n", func,
|
|
ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
auto * host_buft = ggml_backend_dev_host_buffer_type(dev);
|
|
if (!host_buft) {
|
|
LLAMA_LOG_DEBUG("%s: no host buffer type found for device %s\n", func,
|
|
ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
// If the backend is supported, create pinned memory buffers and events for synchronisation.
|
|
for (size_t idx = 0; idx < n_buffers; ++idx) {
|
|
auto * buf = ggml_backend_buft_alloc_buffer(host_buft, buffer_size);
|
|
|
|
if (!buf) {
|
|
LLAMA_LOG_DEBUG("%s: failed to allocate host buffer for async uploads for device %s\n", func,
|
|
ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
host_buffers.emplace_back(buf);
|
|
host_ptrs.emplace_back(ggml_backend_buffer_get_base(buf));
|
|
|
|
auto * event = ggml_backend_event_new(dev);
|
|
if (!event) {
|
|
LLAMA_LOG_DEBUG("%s: failed to create event for async uploads for device %s\n", func,
|
|
ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
events.emplace_back(event);
|
|
}
|
|
|
|
ggml_backend_t backend = ggml_backend_dev_init(dev, nullptr);
|
|
if (!backend) {
|
|
LLAMA_LOG_DEBUG("%s: failed to initialize backend for device %s for async uploads\n", func,
|
|
ggml_backend_dev_name(dev));
|
|
return nullptr;
|
|
}
|
|
|
|
return backend;
|
|
}(__func__);
|
|
|
|
if (upload_backend) {
|
|
LLAMA_LOG_DEBUG("%s: using async uploads for device %s, buffer type %s, backend %s\n", __func__,
|
|
ggml_backend_dev_name(ggml_backend_get_device(upload_backend)),
|
|
ggml_backend_buft_name(ggml_backend_buffer_get_type(bufs.at(0))),
|
|
ggml_backend_name(upload_backend));
|
|
}
|
|
|
|
for (struct ggml_tensor * cur = ggml_get_first_tensor(ctx); cur != NULL; cur = ggml_get_next_tensor(ctx, cur)) {
|
|
const auto * weight = get_weight(ggml_get_name(cur));
|
|
if (weight == nullptr) {
|
|
// this can happen with split experts models
|
|
continue;
|
|
}
|
|
|
|
if (progress_callback) {
|
|
if (!progress_callback((float) size_done / size_data, progress_callback_user_data)) {
|
|
return false;
|
|
}
|
|
}
|
|
|
|
size_t n_size = ggml_nbytes(cur);
|
|
|
|
if (use_mmap) {
|
|
const auto & mapping = mappings.at(weight->idx);
|
|
ggml_backend_buffer_t buf_mmap = nullptr;
|
|
if (bufs.count(weight->idx)) {
|
|
buf_mmap = bufs.at(weight->idx);
|
|
}
|
|
uint8_t * data = (uint8_t *) mapping->addr() + weight->offs;
|
|
|
|
if (check_tensors) {
|
|
validation_result.emplace_back(std::async(std::launch::async, [cur, data, n_size] {
|
|
return std::make_pair(cur, ggml_validate_row_data(cur->type, data, n_size));
|
|
}));
|
|
}
|
|
|
|
GGML_ASSERT(buf_mmap || cur->data); // either we have a buffer to allocate the tensor in, or it is already allocated
|
|
if (buf_mmap && cur->data == nullptr) {
|
|
ggml_backend_tensor_alloc(buf_mmap, cur, data);
|
|
if (lmlocks) {
|
|
const auto & lmlock = lmlocks->at(weight->idx);
|
|
lmlock->grow_to(weight->offs + n_size);
|
|
}
|
|
|
|
auto & mmap_used = mmaps_used[weight->idx];
|
|
mmap_used.first = std::min(mmap_used.first, weight->offs);
|
|
mmap_used.second = std::max(mmap_used.second, weight->offs + n_size);
|
|
} else {
|
|
ggml_backend_tensor_set(cur, data, 0, n_size);
|
|
}
|
|
} else {
|
|
const auto & file = files.at(weight->idx);
|
|
|
|
if (ggml_backend_buffer_is_host(cur->buffer)) {
|
|
file->seek(weight->offs, SEEK_SET);
|
|
file->read_raw(cur->data, n_size);
|
|
if (check_tensors) {
|
|
validation_result.emplace_back(std::async(std::launch::async, [cur, n_size] {
|
|
return std::make_pair(cur, ggml_validate_row_data(cur->type, cur->data, n_size));
|
|
}));
|
|
}
|
|
} else {
|
|
// If upload_backend is valid load the tensor in chunks to pinned memory and upload the buffers asynchronously to the GPU.
|
|
if (upload_backend) {
|
|
size_t offset = weight->offs;
|
|
alignment = file->read_alignment();
|
|
size_t aligned_offset = offset & ~(alignment - 1);
|
|
size_t offset_from_alignment = offset - aligned_offset;
|
|
file->seek(aligned_offset, SEEK_SET);
|
|
|
|
// Calculate aligned read boundaries
|
|
size_t read_start = aligned_offset;
|
|
size_t read_end = (offset + n_size + alignment - 1) & ~(alignment - 1);
|
|
|
|
size_t bytes_read = 0;
|
|
size_t data_read = 0; // Actual tensor data copied (excluding padding)
|
|
|
|
while (bytes_read < read_end - read_start) {
|
|
size_t read_size = std::min<size_t>(buffer_size, read_end - read_start - bytes_read);
|
|
|
|
// Align the destination pointer within the pinned buffer
|
|
uintptr_t ptr_dest_aligned = (reinterpret_cast<uintptr_t>(host_ptrs[buffer_idx]) + alignment - 1) & ~(alignment - 1);
|
|
|
|
// Wait for previous upload to complete before reusing buffer
|
|
ggml_backend_event_synchronize(events[buffer_idx]);
|
|
|
|
// Read aligned chunk from file
|
|
file->read_raw_unsafe(reinterpret_cast<void *>(ptr_dest_aligned), read_size);
|
|
|
|
// Calculate actual data portion (excluding alignment padding)
|
|
uintptr_t ptr_data = ptr_dest_aligned;
|
|
size_t data_to_copy = read_size;
|
|
|
|
// Skip alignment padding at start of first chunk
|
|
if (bytes_read == 0) {
|
|
ptr_data += offset_from_alignment;
|
|
data_to_copy -= offset_from_alignment;
|
|
}
|
|
|
|
// Trim alignment padding at end of last chunk
|
|
if (aligned_offset + bytes_read + read_size > offset + n_size) {
|
|
data_to_copy -= (read_end - (offset + n_size));
|
|
}
|
|
|
|
// Async upload actual data to GPU
|
|
ggml_backend_tensor_set_async(upload_backend, cur,
|
|
reinterpret_cast<void *>(ptr_data), data_read, data_to_copy);
|
|
ggml_backend_event_record(events[buffer_idx], upload_backend);
|
|
|
|
data_read += data_to_copy;
|
|
bytes_read += read_size;
|
|
|
|
++buffer_idx;
|
|
buffer_idx %= n_buffers;
|
|
}
|
|
} else {
|
|
read_buf.resize(n_size);
|
|
file->seek(weight->offs, SEEK_SET);
|
|
file->read_raw(read_buf.data(), n_size);
|
|
ggml_backend_tensor_set(cur, read_buf.data(), 0, n_size);
|
|
if (check_tensors && !ggml_validate_row_data(cur->type, read_buf.data(), n_size)) {
|
|
throw std::runtime_error(format("tensor '%s' has invalid data", ggml_get_name(cur)));
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
size_done += n_size;
|
|
}
|
|
|
|
// free temporary resources used for async uploads
|
|
for (auto * event : events) {
|
|
ggml_backend_event_synchronize(event);
|
|
ggml_backend_event_free(event);
|
|
}
|
|
for (auto * buf : host_buffers) {
|
|
ggml_backend_buffer_free(buf);
|
|
}
|
|
ggml_backend_free(upload_backend);
|
|
|
|
// check validation results
|
|
bool validation_failed = false;
|
|
for (auto & future : validation_result) {
|
|
auto result = future.get();
|
|
if (!result.second) {
|
|
LLAMA_LOG_ERROR("%s: tensor '%s' has invalid data\n", __func__, ggml_get_name(result.first));
|
|
validation_failed = true;
|
|
}
|
|
}
|
|
if (validation_failed) {
|
|
throw std::runtime_error("found tensors with invalid data");
|
|
}
|
|
|
|
// check if this is the last call and do final cleanup
|
|
if (size_done >= size_data) {
|
|
// unmap offloaded tensors and metadata
|
|
if (use_mmap) {
|
|
for (uint32_t idx = 0; idx < mappings.size(); idx++) {
|
|
const auto & mmap_used = mmaps_used.at(idx);
|
|
auto & mapping = mappings.at(idx);
|
|
mapping->unmap_fragment(0, mmap_used.first);
|
|
if (mmap_used.second != 0) {
|
|
mapping->unmap_fragment(mmap_used.second, mapping->size());
|
|
}
|
|
}
|
|
}
|
|
if (progress_callback) {
|
|
// Even though the model is done loading, we still honor
|
|
// cancellation since we need to free allocations.
|
|
return progress_callback(1.0f, progress_callback_user_data);
|
|
}
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
std::string llama_model_loader::ftype_name() const {
|
|
return llama_model_ftype_name(ftype);
|
|
}
|
|
|
|
void llama_model_loader::print_info() const {
|
|
LLAMA_LOG_INFO("%s: file format = %s\n", __func__, llama_file_version_name(fver));
|
|
LLAMA_LOG_INFO("%s: file type = %s\n", __func__, llama_model_ftype_name(ftype).c_str());
|
|
if (n_bytes < GiB) {
|
|
LLAMA_LOG_INFO("%s: file size = %.2f MiB (%.2f BPW) \n", __func__, n_bytes/1024.0/1024.0, n_bytes*8.0/n_elements);
|
|
} else {
|
|
LLAMA_LOG_INFO("%s: file size = %.2f GiB (%.2f BPW) \n", __func__, n_bytes/1024.0/1024.0/1024.0, n_bytes*8.0/n_elements);
|
|
}
|
|
}
|