mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-28 15:20:20 +00:00
64086f2b2f
* feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (https://github.com/ggml-org/llama.cpp/pull/16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
1648 lines
65 KiB
C++
1648 lines
65 KiB
C++
#include "clip.h"
|
||
#include "clip-impl.h"
|
||
#include "mtmd.h"
|
||
#include "mtmd-audio.h"
|
||
#include "mtmd-image.h"
|
||
#include "debug/mtmd-debug.h"
|
||
|
||
#include "llama.h"
|
||
|
||
// fix problem with std::min and std::max
|
||
#if defined(_WIN32)
|
||
#define WIN32_LEAN_AND_MEAN
|
||
#ifndef NOMINMAX
|
||
# define NOMINMAX
|
||
#endif
|
||
#include <windows.h>
|
||
#endif
|
||
|
||
#include <algorithm>
|
||
#include <cerrno>
|
||
#include <cstdio>
|
||
#include <cstdlib>
|
||
#include <cstring>
|
||
#include <climits>
|
||
#include <vector>
|
||
|
||
// represents raw image data, layout is RGBRGBRGB...
|
||
// length of data must be nx * ny * 3
|
||
struct mtmd_bitmap {
|
||
uint32_t nx;
|
||
uint32_t ny;
|
||
std::vector<unsigned char> data;
|
||
std::string id; // optional user-defined id, for ex: can be set to image hash, useful for KV cache tracking
|
||
bool is_audio = false; // true if the bitmap is audio
|
||
};
|
||
|
||
// position indexing for decoder model
|
||
enum mtmd_pos_type {
|
||
MTMD_POS_TYPE_NORMAL, // number of positions equals to number of tokens
|
||
MTMD_POS_TYPE_MROPE, // qwen-vl mrope style, each image takes max(t,h,w) position indexes
|
||
MTMD_POS_TYPE_HUNYUANVL, // HunyuanVL mrope + BOI/EOI/newline layout with XD-RoPE dim-3
|
||
};
|
||
|
||
struct mtmd_image_tokens {
|
||
uint32_t nx; // number of tokens in x direction
|
||
uint32_t ny; // number of tokens in y direction
|
||
mtmd_pos_type pos = MTMD_POS_TYPE_NORMAL;
|
||
uint32_t image_idx = 0; // 0-based position of this image among image chunks in the prompt(used by pos == MTMD_POS_TYPE_HUNYUANVL)
|
||
uint32_t n_tokens() const {
|
||
if (pos == MTMD_POS_TYPE_HUNYUANVL) {
|
||
// [BOI] [row0 tokens + newline] ... [row(ny-1) tokens + newline] [EOI]
|
||
return (nx + 1) * ny + 2;
|
||
}
|
||
return nx * ny;
|
||
}
|
||
clip_image_f32_batch batch_f32; // preprocessed image patches
|
||
std::string id; // optional user-defined ID, useful for KV cache tracking
|
||
|
||
mtmd_image_tokens clone() {
|
||
return mtmd_image_tokens{
|
||
nx,
|
||
ny,
|
||
pos,
|
||
image_idx,
|
||
batch_f32.clone(),
|
||
id
|
||
};
|
||
}
|
||
};
|
||
using mtmd_image_tokens_ptr = std::unique_ptr<mtmd_image_tokens>;
|
||
|
||
struct mtmd_audio_tokens {
|
||
uint32_t n_tokens; // number of tokens
|
||
clip_image_f32_batch batch_f32; // preprocessed image patches
|
||
std::string id; // optional user-defined ID, useful for KV cache tracking
|
||
|
||
mtmd_audio_tokens clone() {
|
||
return mtmd_audio_tokens{
|
||
n_tokens,
|
||
batch_f32.clone(),
|
||
id
|
||
};
|
||
}
|
||
};
|
||
using mtmd_audio_tokens_ptr = std::unique_ptr<mtmd_audio_tokens>;
|
||
|
||
struct mtmd_input_chunk {
|
||
mtmd_input_chunk_type type;
|
||
std::vector<llama_token> tokens_text;
|
||
mtmd_image_tokens_ptr tokens_image;
|
||
mtmd_audio_tokens_ptr tokens_audio;
|
||
};
|
||
|
||
struct mtmd_input_chunks {
|
||
std::vector<mtmd_input_chunk> entries;
|
||
};
|
||
|
||
// slice template, used by some llava-uhd models to correctly place the special tokens around image embeddings
|
||
// models not having it (llava-1.6) will process embeddings without any special tokens in-between
|
||
enum mtmd_slice_tmpl {
|
||
MTMD_SLICE_TMPL_NONE,
|
||
MTMD_SLICE_TMPL_MINICPMV_2_5,
|
||
MTMD_SLICE_TMPL_MINICPMV_2_6,
|
||
MTMD_SLICE_TMPL_LLAMA4,
|
||
MTMD_SLICE_TMPL_IDEFICS3,
|
||
MTMD_SLICE_TMPL_LFM2,
|
||
MTMD_SLICE_TMPL_STEP3VL,
|
||
};
|
||
|
||
const char * mtmd_default_marker() {
|
||
return "<__media__>";
|
||
}
|
||
|
||
static clip_flash_attn_type mtmd_get_clip_flash_attn_type(enum llama_flash_attn_type flash_attn_type) {
|
||
switch (flash_attn_type) {
|
||
case LLAMA_FLASH_ATTN_TYPE_AUTO: return CLIP_FLASH_ATTN_TYPE_AUTO;
|
||
case LLAMA_FLASH_ATTN_TYPE_DISABLED: return CLIP_FLASH_ATTN_TYPE_DISABLED;
|
||
case LLAMA_FLASH_ATTN_TYPE_ENABLED: return CLIP_FLASH_ATTN_TYPE_ENABLED;
|
||
}
|
||
return CLIP_FLASH_ATTN_TYPE_AUTO;
|
||
}
|
||
|
||
mtmd_context_params mtmd_context_params_default() {
|
||
mtmd_context_params params {
|
||
/* use_gpu */ true,
|
||
/* print_timings */ true,
|
||
/* n_threads */ 4,
|
||
/* image_marker */ nullptr,
|
||
/* media_marker */ mtmd_default_marker(),
|
||
/* flash_attn_type */ LLAMA_FLASH_ATTN_TYPE_AUTO,
|
||
/* warmup */ true,
|
||
/* image_min_tokens */ -1,
|
||
/* image_max_tokens */ -1,
|
||
/* cb_eval */ nullptr,
|
||
/* cb_eval_user_data */ nullptr,
|
||
};
|
||
return params;
|
||
}
|
||
|
||
struct mtmd_context {
|
||
struct clip_ctx * ctx_v; // vision
|
||
struct clip_ctx * ctx_a; // audio
|
||
std::vector<float> image_embd_v; // image embedding vector
|
||
|
||
bool print_timings;
|
||
int n_threads;
|
||
std::string media_marker;
|
||
const int n_embd_text = -1; // -1 means llm context not provided, skip checking this
|
||
const llama_vocab * vocab = nullptr; // can be nullptr if text_model is not provided
|
||
mtmd_pos_type pos_type;
|
||
|
||
// these are not token, but strings used to mark the beginning and end of image/audio embeddings
|
||
std::string img_beg;
|
||
std::string img_end;
|
||
std::string aud_beg;
|
||
std::string aud_end;
|
||
|
||
// for llava-uhd style models, we need special tokens in-between slices
|
||
// minicpmv calls them "slices", llama 4 calls them "tiles"
|
||
mtmd_slice_tmpl slice_tmpl = MTMD_SLICE_TMPL_NONE;
|
||
std::vector<llama_token> tok_ov_img_start; // overview image
|
||
std::vector<llama_token> tok_ov_img_end; // overview image
|
||
std::vector<llama_token> tok_slices_start; // start of all slices
|
||
std::vector<llama_token> tok_slices_end; // end of all slices
|
||
std::vector<llama_token> tok_sli_img_start; // single slice start
|
||
std::vector<llama_token> tok_sli_img_end; // single slice end
|
||
std::vector<llama_token> tok_sli_img_mid; // between 2 slices
|
||
std::vector<llama_token> tok_row_end; // end of row
|
||
bool tok_row_end_trail = false;
|
||
bool ov_img_first = false;
|
||
|
||
// string template for slice image delimiters with row/col (idefics3)
|
||
std::string sli_img_start_tmpl;
|
||
|
||
std::unique_ptr<mtmd_audio_preprocessor> audio_preproc;
|
||
std::unique_ptr<mtmd_image_preprocessor> image_preproc;
|
||
|
||
// TODO @ngxson : add timings
|
||
|
||
mtmd_context(const char * mmproj_fname,
|
||
const llama_model * text_model,
|
||
const mtmd_context_params & ctx_params,
|
||
bool no_alloc = false) :
|
||
print_timings(ctx_params.print_timings),
|
||
n_threads (ctx_params.n_threads),
|
||
media_marker (ctx_params.media_marker),
|
||
n_embd_text (text_model ? llama_model_n_embd_inp(text_model) : -1),
|
||
vocab (text_model ? llama_model_get_vocab(text_model) : nullptr)
|
||
{
|
||
if (ctx_params.image_marker != nullptr) {
|
||
throw std::runtime_error("custom image_marker is not supported anymore, use media_marker instead");
|
||
}
|
||
|
||
if (media_marker.empty()) {
|
||
throw std::runtime_error("media_marker must not be empty");
|
||
}
|
||
|
||
if (text_model) {
|
||
auto decoder_rope_type = llama_model_rope_type(text_model);
|
||
switch (decoder_rope_type) {
|
||
case LLAMA_ROPE_TYPE_NONE:
|
||
case LLAMA_ROPE_TYPE_NORM:
|
||
case LLAMA_ROPE_TYPE_NEOX:
|
||
{
|
||
pos_type = MTMD_POS_TYPE_NORMAL;
|
||
} break;
|
||
case LLAMA_ROPE_TYPE_MROPE:
|
||
case LLAMA_ROPE_TYPE_IMROPE:
|
||
{
|
||
pos_type = MTMD_POS_TYPE_MROPE;
|
||
} break;
|
||
default:
|
||
throw std::runtime_error(string_format("unsupported decoder rope type: %d\n", decoder_rope_type));
|
||
}
|
||
}
|
||
|
||
clip_context_params ctx_clip_params {
|
||
/* use_gpu */ ctx_params.use_gpu,
|
||
/* flash_attn_type */ mtmd_get_clip_flash_attn_type(ctx_params.flash_attn_type),
|
||
/* image_min_tokens */ ctx_params.image_min_tokens,
|
||
/* image_max_tokens */ ctx_params.image_max_tokens,
|
||
/* warmup */ ctx_params.warmup,
|
||
/* cb_eval */ ctx_params.cb_eval,
|
||
/* cb_eval_user_data */ ctx_params.cb_eval_user_data,
|
||
/* no_alloc */ no_alloc,
|
||
};
|
||
|
||
auto res = clip_init(mmproj_fname, ctx_clip_params);
|
||
ctx_v = res.ctx_v;
|
||
ctx_a = res.ctx_a;
|
||
if (!ctx_v && !ctx_a) {
|
||
throw std::runtime_error(string_format("Failed to load CLIP model from %s\n", mmproj_fname));
|
||
}
|
||
|
||
// if both vision and audio mmproj are present, we need to validate their n_embd
|
||
if (ctx_v && ctx_a) {
|
||
int n_embd_v = clip_n_mmproj_embd(ctx_v);
|
||
int n_embd_a = clip_n_mmproj_embd(ctx_a);
|
||
if (n_embd_v != n_embd_a) {
|
||
throw std::runtime_error(string_format(
|
||
"mismatch between vision and audio mmproj (n_embd_v = %d, n_embd_a = %d)\n",
|
||
n_embd_v, n_embd_a));
|
||
}
|
||
}
|
||
|
||
// since we already validate n_embd of vision and audio mmproj,
|
||
// we can safely assume that they are the same
|
||
int n_embd_clip = clip_n_mmproj_embd(ctx_v ? ctx_v : ctx_a);
|
||
if (n_embd_text > 0 && n_embd_text != n_embd_clip) {
|
||
throw std::runtime_error(string_format(
|
||
"mismatch between text model (n_embd = %d) and mmproj (n_embd = %d)\n"
|
||
"hint: you may be using wrong mmproj\n",
|
||
n_embd_text, n_embd_clip));
|
||
}
|
||
if (ctx_v) {
|
||
init_vision();
|
||
}
|
||
if (ctx_a) {
|
||
init_audio();
|
||
}
|
||
}
|
||
|
||
void init_vision() {
|
||
GGML_ASSERT(ctx_v != nullptr);
|
||
image_preproc.reset();
|
||
|
||
projector_type proj = clip_get_projector_type(ctx_v);
|
||
|
||
switch (proj) {
|
||
case PROJECTOR_TYPE_MLP:
|
||
case PROJECTOR_TYPE_MLP_NORM:
|
||
case PROJECTOR_TYPE_LDP:
|
||
case PROJECTOR_TYPE_LDPV2:
|
||
case PROJECTOR_TYPE_COGVLM:
|
||
case PROJECTOR_TYPE_JANUS_PRO:
|
||
case PROJECTOR_TYPE_GLM_EDGE:
|
||
{
|
||
bool has_pinpoints = !clip_get_hparams(ctx_v)->image_res_candidates.empty();
|
||
if (has_pinpoints) {
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
|
||
} else {
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_fixed_size>(ctx_v);
|
||
}
|
||
} break;
|
||
case PROJECTOR_TYPE_MINICPMV:
|
||
{
|
||
int minicpmv_version = clip_get_hparams(ctx_v)->minicpmv_version;
|
||
if (minicpmv_version == 2) {
|
||
// minicpmv 2.5 format:
|
||
// <image> (overview) </image><slice><image> (slice) </image><image> (slice) </image>\n ... </slice>
|
||
slice_tmpl = MTMD_SLICE_TMPL_MINICPMV_2_5;
|
||
tok_ov_img_start = {lookup_token("<image>")};
|
||
tok_ov_img_end = {lookup_token("</image>")};
|
||
tok_slices_start = {lookup_token("<slice>")};
|
||
tok_slices_end = {lookup_token("</slice>")};
|
||
tok_sli_img_start = tok_ov_img_start;
|
||
tok_sli_img_end = tok_ov_img_end;
|
||
tok_row_end = {lookup_token("\n")};
|
||
tok_row_end_trail = false; // no trailing end-of-row token
|
||
ov_img_first = true;
|
||
} else if (minicpmv_version == 3 || minicpmv_version == 4 || minicpmv_version == 5 || minicpmv_version == 6 || minicpmv_version == 100045) {
|
||
// minicpmv 2.6 format:
|
||
// <image> (overview) </image><slice> (slice) </slice><slice> (slice) </slice>\n ...
|
||
slice_tmpl = MTMD_SLICE_TMPL_MINICPMV_2_6;
|
||
tok_ov_img_start = {lookup_token("<image>")};
|
||
tok_ov_img_end = {lookup_token("</image>")};
|
||
tok_sli_img_start = {lookup_token("<slice>")};
|
||
tok_sli_img_end = {lookup_token("</slice>")};
|
||
tok_row_end = {lookup_token("\n")};
|
||
tok_row_end_trail = false; // no trailing end-of-row token
|
||
ov_img_first = true;
|
||
|
||
} else if (minicpmv_version != 0) {
|
||
throw std::runtime_error(string_format("unsupported minicpmv version: %d\n", minicpmv_version));
|
||
}
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_MINICPMV4_6:
|
||
{
|
||
slice_tmpl = MTMD_SLICE_TMPL_MINICPMV_2_6;
|
||
tok_ov_img_start = {lookup_token("<image>")};
|
||
tok_ov_img_end = {lookup_token("</image>")};
|
||
tok_sli_img_start = {lookup_token("<slice>")};
|
||
tok_sli_img_end = {lookup_token("</slice>")};
|
||
tok_row_end = {lookup_token("\n")};
|
||
tok_row_end_trail = false; // no trailing end-of-row token
|
||
ov_img_first = true;
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_QWEN2VL:
|
||
case PROJECTOR_TYPE_QWEN25VL:
|
||
case PROJECTOR_TYPE_QWEN3VL:
|
||
case PROJECTOR_TYPE_MIMOVL:
|
||
{
|
||
// <|vision_start|> ... (image embeddings) ... <|vision_end|>
|
||
img_beg = "<|vision_start|>";
|
||
img_end = "<|vision_end|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_YOUTUVL:
|
||
{
|
||
// <|vision_start|> ... (image embeddings) ... <|vision_end|>
|
||
img_beg = "<|vision_start|>";
|
||
img_end = "<|vision_end|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_youtuvl>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_YASA2:
|
||
{
|
||
img_beg = "<image>";
|
||
img_end = "</image>";
|
||
// Currently only supprots single-tile preprocessing: any input is downscaled
|
||
// to one image_size x image_size tile (64 output tokens via 8x8 adaptive avg
|
||
// pool).
|
||
// However, the model itself supports llava-uhd multi-tile tiling for high-res
|
||
// images. This will be implemented in a future PR (dispatch on has_pinpoints
|
||
// - see LDP/COGVLM branch above) and emit image_grid_pinpoints in the conversion
|
||
// script.
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_fixed_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_GEMMA3:
|
||
case PROJECTOR_TYPE_GEMMA3NV:
|
||
{
|
||
// <start_of_image> ... (image embeddings) ... <end_of_image>
|
||
img_beg = "<start_of_image>";
|
||
img_end = "<end_of_image>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_fixed_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_IDEFICS3:
|
||
{
|
||
// https://github.com/huggingface/transformers/blob/a42ba80fa520c784c8f11a973ca9034e5f859b79/src/transformers/models/idefics3/processing_idefics3.py#L192-L215
|
||
slice_tmpl = MTMD_SLICE_TMPL_IDEFICS3;
|
||
tok_ov_img_start = {lookup_token("\n\n"), lookup_token("<fake_token_around_image>"), lookup_token("<global-img>")};
|
||
tok_ov_img_end = {lookup_token("<fake_token_around_image>")};
|
||
tok_row_end = {lookup_token("\n")};
|
||
sli_img_start_tmpl = "<fake_token_around_image><row_%d_col_%d>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_idefics3>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_PIXTRAL:
|
||
{
|
||
// https://github.com/huggingface/transformers/blob/1cd110c6cb6a6237614130c470e9a902dbc1a4bd/docs/source/en/model_doc/pixtral.md
|
||
img_end = "[IMG_END]";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_PHI4:
|
||
{
|
||
// Phi-4 uses media marker insertion only. Keep image boundary text empty.
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_LLAMA4:
|
||
{
|
||
// (more details in mtmd_context constructor)
|
||
img_beg = "<|image_start|>";
|
||
img_end = "<|image_end|>";
|
||
LOG_WRN("%s: llama 4 vision is known to have degraded quality:\n"
|
||
" https://github.com/ggml-org/llama.cpp/pull/13282\n", __func__);
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_STEP3VL:
|
||
{
|
||
// Step3 format:
|
||
// <patch_start> (patch) <patch_end> [<patch_newline>]
|
||
// ... (all patch rows)
|
||
// <im_start> (overview) <im_end>
|
||
slice_tmpl = MTMD_SLICE_TMPL_STEP3VL;
|
||
tok_ov_img_start = {lookup_token("<im_start>")};
|
||
tok_ov_img_end = {lookup_token("<im_end>")};
|
||
tok_sli_img_start = {lookup_token("<patch_start>")};
|
||
tok_sli_img_end = {lookup_token("<patch_end>")};
|
||
tok_row_end = {lookup_token("<patch_newline>")};
|
||
tok_row_end_trail = false;
|
||
ov_img_first = false; // patches first, overview last
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_step3vl>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_INTERNVL:
|
||
{
|
||
// <img> ... (image embeddings) ... </img>
|
||
img_beg = "<img>";
|
||
img_end = "</img>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_internvl>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_KIMIVL:
|
||
{
|
||
// <|media_start|> ... (image embeddings) ... <|media_end|>
|
||
img_beg = "<|media_start|>";
|
||
img_end = "<|media_end|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_KIMIK25:
|
||
{
|
||
// <|media_begin|> ... (image embeddings) ... <|media_end|>
|
||
img_beg = "<|media_begin|>";
|
||
img_end = "<|media_end|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_LIGHTONOCR:
|
||
{
|
||
// <|im_start|> ... (image embeddings) ... <|im_end|>
|
||
img_beg = "<|im_start|>";
|
||
img_end = "<|im_end|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_longest_edge>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_DOTS_OCR:
|
||
{
|
||
// <|img|> ... (image embeddings) ... <|endofimg|>
|
||
img_beg = "<|img|>";
|
||
img_end = "<|endofimg|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_NEMOTRON_V2_VL:
|
||
{
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_fixed_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_LFM2:
|
||
{
|
||
// multi-tile:
|
||
// <|image_start|>
|
||
// <|img_row_1_col_1|> (tile) <|img_row_1_col_2|> (tile) ...
|
||
// <|img_thumbnail|> (thumbnail)
|
||
// <|image_end|>
|
||
// single-tile:
|
||
// <|image_start|> (image) <|image_end|>
|
||
img_beg = "<|image_start|>";
|
||
img_end = "<|image_end|>";
|
||
slice_tmpl = MTMD_SLICE_TMPL_LFM2;
|
||
sli_img_start_tmpl = "<|img_row_%d_col_%d|>";
|
||
tok_ov_img_start = {lookup_token("<|img_thumbnail|>")};
|
||
ov_img_first = false;
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_lfm2>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_GLM4V:
|
||
{
|
||
// <|begin_of_image|> ... (image embeddings) ... <|end_of_image|>
|
||
img_beg = "<|begin_of_image|>";
|
||
img_end = "<|end_of_image|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_PADDLEOCR:
|
||
{
|
||
// <|IMAGE_START|> ... (image embeddings) ... <|IMAGE_END|>
|
||
img_beg = "<|IMAGE_START|>";
|
||
img_end = "<|IMAGE_END|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_GEMMA4V:
|
||
case PROJECTOR_TYPE_GEMMA4UV:
|
||
{
|
||
// <|image> ... (image embeddings) ... <image|>
|
||
img_beg = "<|image>";
|
||
img_end = "<image|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_DEEPSEEKOCR:
|
||
{
|
||
img_end = "\n"; // prevent empty batch on llama-server
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_deepseekocr>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_DEEPSEEKOCR2:
|
||
{
|
||
img_end = "\n"; // prevent empty batch on llama-server
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_deepseekocr2>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_HUNYUANVL:
|
||
{
|
||
// note: these use fullwidth | (U+FF5C) and ▁ (U+2581) to match the tokenizer vocabulary
|
||
img_beg = "<|hy_place▁holder▁no▁100|>";
|
||
img_end = "<|hy_place▁holder▁no▁101|>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_EXAONE4_5:
|
||
{
|
||
// <vision> ... (image embeddings) ... </vision>
|
||
img_beg = "<vision>";
|
||
img_end = "</vision>";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
|
||
} break;
|
||
case PROJECTOR_TYPE_GRANITE4_VISION:
|
||
{
|
||
img_beg = "<image>";
|
||
img_end = "";
|
||
image_preproc = std::make_unique<mtmd_image_preprocessor_llava_uhd>(ctx_v);
|
||
} break;
|
||
default:
|
||
throw std::runtime_error(string_format("%s: unexpected vision projector type %d\n", __func__, proj));
|
||
}
|
||
|
||
GGML_ASSERT(image_preproc != nullptr);
|
||
}
|
||
|
||
void init_audio() {
|
||
GGML_ASSERT(ctx_a != nullptr);
|
||
audio_preproc.reset();
|
||
|
||
projector_type proj = clip_get_projector_type(ctx_a);
|
||
|
||
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n"
|
||
" https://github.com/ggml-org/llama.cpp/discussions/13759\n", __func__);
|
||
|
||
// set preprocessor
|
||
switch (proj) {
|
||
case PROJECTOR_TYPE_QWEN2A:
|
||
case PROJECTOR_TYPE_QWEN25O:
|
||
{
|
||
// <|audio_bos|> ... (embeddings) ... <|audio_eos|>
|
||
aud_beg = "<|audio_bos|>";
|
||
aud_end = "<|audio_eos|>";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_whisper>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_QWEN3A:
|
||
{
|
||
aud_beg = "<|audio_start|>";
|
||
aud_end = "<|audio_end|>";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_qwen3a>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_VOXTRAL:
|
||
{
|
||
// [BEGIN_AUDIO] ... (embeddings) ...
|
||
aud_beg = "[BEGIN_AUDIO]";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_whisper>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_MUSIC_FLAMINGO:
|
||
{
|
||
// <sound> ... (embeddings) ...
|
||
aud_beg = "<sound>";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_whisper>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_ULTRAVOX:
|
||
case PROJECTOR_TYPE_GLMA:
|
||
case PROJECTOR_TYPE_MERALION:
|
||
{
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_whisper>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_LFM2A:
|
||
{
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_conformer>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_GRANITE_SPEECH:
|
||
{
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_granite_speech>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_GEMMA4A:
|
||
{
|
||
aud_beg = "<|audio>";
|
||
aud_end = "<audio|>";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_gemma4a>(ctx_a);
|
||
} break;
|
||
case PROJECTOR_TYPE_GEMMA4UA:
|
||
{
|
||
aud_beg = "<|audio>";
|
||
aud_end = "<audio|>";
|
||
audio_preproc = std::make_unique<mtmd_audio_preprocessor_gemma4ua>(ctx_a);
|
||
} break;
|
||
default:
|
||
throw std::runtime_error(string_format("%s: unexpected audio projector type %d\n", __func__, proj));
|
||
}
|
||
|
||
// initialize audio preprocessor
|
||
GGML_ASSERT(audio_preproc != nullptr);
|
||
audio_preproc->initialize();
|
||
}
|
||
|
||
// get clip ctx based on chunk type
|
||
clip_ctx * get_clip_ctx(const mtmd_input_chunk * chunk) const {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
return ctx_v;
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
return ctx_a;
|
||
}
|
||
GGML_ABORT("unknown chunk type");
|
||
}
|
||
|
||
projector_type proj_type_v() const {
|
||
return ctx_v ? clip_get_projector_type(ctx_v) : PROJECTOR_TYPE_UNKNOWN;
|
||
}
|
||
|
||
projector_type proj_type_a() const {
|
||
return ctx_a ? clip_get_projector_type(ctx_a) : PROJECTOR_TYPE_UNKNOWN;
|
||
}
|
||
|
||
~mtmd_context() {
|
||
clip_free(ctx_a);
|
||
clip_free(ctx_v);
|
||
}
|
||
|
||
private:
|
||
llama_token lookup_token(const std::string & token_text) {
|
||
if (vocab == nullptr) {
|
||
// TODO @ngxson : this case is currently hit by mtmd_get_memory_usage
|
||
// but we should reconsider this if this case is needed in other places in the future
|
||
return LLAMA_TOKEN_NULL;
|
||
}
|
||
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||
for (int i = 0; i < n_vocab; i++) {
|
||
if (token_to_piece(vocab, i, true) == token_text) {
|
||
return i;
|
||
}
|
||
}
|
||
return LLAMA_TOKEN_NULL;
|
||
}
|
||
|
||
std::string token_to_piece(const llama_vocab * vocab, llama_token token, bool special) {
|
||
if (vocab == nullptr) {
|
||
throw std::runtime_error("llama_vocab is not provided");
|
||
}
|
||
std::string piece;
|
||
piece.resize(piece.capacity()); // using string internal cache, 15 bytes + '\n'
|
||
const int n_chars = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
|
||
if (n_chars < 0) {
|
||
piece.resize(-n_chars);
|
||
int check = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
|
||
GGML_ASSERT(check == -n_chars);
|
||
} else {
|
||
piece.resize(n_chars);
|
||
}
|
||
return piece;
|
||
}
|
||
};
|
||
|
||
mtmd_context * mtmd_init_from_file(const char * mmproj_fname,
|
||
const struct llama_model * text_model,
|
||
const struct mtmd_context_params ctx_params) {
|
||
try {
|
||
return new mtmd_context(mmproj_fname, text_model, ctx_params);
|
||
} catch (const std::exception & e) {
|
||
LOG_ERR("%s: error: %s\n", __func__, e.what());
|
||
return nullptr;
|
||
}
|
||
}
|
||
|
||
void mtmd_free(mtmd_context * ctx) {
|
||
delete ctx;
|
||
}
|
||
|
||
struct mtmd_tokenizer {
|
||
mtmd_context * ctx;
|
||
std::vector<const mtmd_bitmap *> bitmaps;
|
||
|
||
std::string input_text;
|
||
bool add_special;
|
||
bool parse_special;
|
||
const llama_vocab * vocab;
|
||
|
||
mtmd_input_chunks cur;
|
||
uint32_t n_images_added = 0; // 0-based index assigned to the next image chunk
|
||
|
||
mtmd_tokenizer(mtmd_context * ctx,
|
||
const mtmd_input_text * text,
|
||
const mtmd_bitmap ** bitmaps,
|
||
size_t n_bitmaps) : ctx(ctx), bitmaps(bitmaps, bitmaps + n_bitmaps) {
|
||
add_special = text->add_special;
|
||
parse_special = text->parse_special;
|
||
input_text = text->text;
|
||
vocab = ctx->vocab;
|
||
}
|
||
|
||
int32_t tokenize(mtmd_input_chunks * output) {
|
||
cur.entries.clear();
|
||
std::vector<std::string> parts = split_text(input_text, ctx->media_marker);
|
||
size_t i_bm = 0; // index of the current bitmap
|
||
for (auto & part : parts) {
|
||
if (part == ctx->media_marker) {
|
||
// this is a marker, we should add the next bitmap
|
||
if (i_bm >= bitmaps.size()) {
|
||
LOG_ERR("%s: error: number of bitmaps (%zu) does not match number of markers (%zu)\n",
|
||
__func__, bitmaps.size(), parts.size() - 1);
|
||
return 1;
|
||
}
|
||
const mtmd_bitmap * bitmap = bitmaps[i_bm++];
|
||
int32_t res = add_media(bitmap);
|
||
if (res != 0) {
|
||
return res;
|
||
}
|
||
} else {
|
||
// this is a text part, we should add it as text
|
||
add_text(part, parse_special);
|
||
}
|
||
}
|
||
|
||
if (vocab != nullptr) {
|
||
if (add_special && llama_vocab_get_add_bos(vocab)) {
|
||
// if first chunk is text, we add BOS token to first text chunk
|
||
// otherwise, create a new text chunk with BOS token
|
||
if (!cur.entries.empty() && cur.entries[0].type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
// add BOS token to the beginning of first text chunk
|
||
cur.entries[0].tokens_text.insert(cur.entries[0].tokens_text.begin(), llama_vocab_bos(vocab));
|
||
} else {
|
||
// create a new text chunk with BOS token at the beginning
|
||
mtmd_input_chunk bos_chunk{
|
||
MTMD_INPUT_CHUNK_TYPE_TEXT,
|
||
{llama_vocab_bos(vocab)},
|
||
nullptr, // image tokens
|
||
nullptr, // audio tokens
|
||
};
|
||
cur.entries.insert(cur.entries.begin(), std::move(bos_chunk));
|
||
}
|
||
}
|
||
|
||
if (add_special && llama_vocab_get_add_eos(vocab)) {
|
||
// if last chunk is text, we add EOS token to it
|
||
add_text({llama_vocab_eos(vocab)});
|
||
}
|
||
}
|
||
|
||
if (i_bm != bitmaps.size()) {
|
||
LOG_ERR("%s: error: number of bitmaps (%zu) does not match number of markers (%zu)\n",
|
||
__func__, bitmaps.size(), parts.size() - 1);
|
||
return 1;
|
||
}
|
||
|
||
*output = std::move(cur);
|
||
|
||
return 0;
|
||
}
|
||
|
||
void add_text(const std::string & txt, bool parse_special) {
|
||
if (vocab == nullptr) {
|
||
throw std::runtime_error("llama_vocab is not provided");
|
||
}
|
||
LOG_DBG("%s: %s\n", __func__, txt.c_str());
|
||
auto tokens = mtmd_tokenize_text_internal(vocab, txt, /* add_special */ false, parse_special);
|
||
add_text(tokens);
|
||
}
|
||
|
||
void add_text(const std::vector<llama_token> & tokens) {
|
||
if (tokens.empty()) {
|
||
return;
|
||
}
|
||
// if last entry is also a text chunk, add tokens to it instead of creating new chunk
|
||
if (!cur.entries.empty() && cur.entries.back().type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
cur.entries.back().tokens_text.insert(
|
||
cur.entries.back().tokens_text.end(),
|
||
tokens.begin(),
|
||
tokens.end());
|
||
} else {
|
||
mtmd_input_chunk chunk{
|
||
MTMD_INPUT_CHUNK_TYPE_TEXT,
|
||
tokens,
|
||
nullptr, // image tokens
|
||
nullptr, // audio tokens
|
||
};
|
||
cur.entries.emplace_back(std::move(chunk));
|
||
}
|
||
}
|
||
|
||
int32_t add_media(const mtmd_bitmap * bitmap) {
|
||
if (!bitmap->is_audio) {
|
||
// handle image
|
||
|
||
if (!ctx->ctx_v) {
|
||
LOG_ERR("%s: error: model does not support vision input\n", __func__);
|
||
return 2;
|
||
}
|
||
|
||
if (!ctx->img_beg.empty()) {
|
||
add_text(ctx->img_beg, true); // add image begin token
|
||
}
|
||
|
||
// sanity check
|
||
GGML_ASSERT(bitmap->nx > 0 && bitmap->ny > 0);
|
||
GGML_ASSERT(bitmap->data.size() == (size_t)bitmap->nx * bitmap->ny * 3);
|
||
GGML_ASSERT(ctx->image_preproc != nullptr);
|
||
|
||
// convert mtmd_bitmap to clip_image_u8
|
||
clip_image_u8_ptr img_u8(clip_image_u8_init());
|
||
img_u8->nx = bitmap->nx;
|
||
img_u8->ny = bitmap->ny;
|
||
img_u8->buf.resize(bitmap->data.size());
|
||
std::memcpy(img_u8->buf.data(), bitmap->data.data(), img_u8->nx * img_u8->ny * 3);
|
||
|
||
// preprocess image
|
||
clip_image_f32_batch batch_f32;
|
||
bool ok = ctx->image_preproc->preprocess(*img_u8, batch_f32);
|
||
if (!ok) {
|
||
LOG_ERR("Unable to preprocess image\n");
|
||
return 2;
|
||
}
|
||
|
||
// Annotate llava-next style tiles so clip_n_output_tokens accounts
|
||
// for per-tile newline injection.
|
||
if (ctx->proj_type_v() == PROJECTOR_TYPE_GRANITE4_VISION) {
|
||
if (batch_f32.entries.size() == 1) {
|
||
// Single-tile (overview only): append one newline row.
|
||
batch_f32.entries[0]->add_newline = true;
|
||
} else {
|
||
// Multi-tile: overview gets no newline, grid tiles get one.
|
||
batch_f32.entries[0]->add_newline = false;
|
||
for (size_t i = 1; i < batch_f32.entries.size(); ++i) {
|
||
batch_f32.entries[i]->add_newline = true;
|
||
}
|
||
}
|
||
}
|
||
|
||
// handle llava-uhd style preprocessing
|
||
const bool has_tiling_grid = batch_f32.grid_x > 0 && batch_f32.grid_y > 0;
|
||
if (
|
||
ctx->slice_tmpl == MTMD_SLICE_TMPL_MINICPMV_2_5
|
||
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_MINICPMV_2_6
|
||
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_LLAMA4
|
||
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_IDEFICS3
|
||
|| ctx->slice_tmpl == MTMD_SLICE_TMPL_STEP3VL
|
||
|| (ctx->slice_tmpl == MTMD_SLICE_TMPL_LFM2 && has_tiling_grid)
|
||
) {
|
||
const int n_col = batch_f32.grid_x;
|
||
const int n_row = batch_f32.grid_y;
|
||
// split batch into chunks of single images
|
||
// NOTE: batch_f32 will be invalidated after this call
|
||
auto chunks = split_batch_to_chunk(std::move(batch_f32), bitmap->id);
|
||
GGML_ASSERT(chunks.size() > 0);
|
||
|
||
auto ov_chunk = std::move(chunks.front());
|
||
chunks.erase(chunks.begin());
|
||
|
||
// add overview image (first)
|
||
if (ctx->ov_img_first) {
|
||
add_text(ctx->tok_ov_img_start);
|
||
cur.entries.emplace_back(std::move(ov_chunk));
|
||
add_text(ctx->tok_ov_img_end);
|
||
}
|
||
|
||
// add slices (or tiles)
|
||
if (!chunks.empty()) {
|
||
GGML_ASSERT((int)chunks.size() == n_row * n_col);
|
||
add_text(ctx->tok_slices_start);
|
||
for (int y = 0; y < n_row; y++) {
|
||
for (int x = 0; x < n_col; x++) {
|
||
const bool is_last_in_row = (x == n_col - 1);
|
||
if (!ctx->tok_sli_img_start.empty()) {
|
||
add_text(ctx->tok_sli_img_start);
|
||
} else if (!ctx->sli_img_start_tmpl.empty()) {
|
||
// If using a template to preceed a slice image
|
||
const size_t sz = std::snprintf(nullptr, 0, ctx->sli_img_start_tmpl.c_str(), y+1, x+1) + 1;
|
||
std::unique_ptr<char[]> buf(new char[sz]);
|
||
std::snprintf(buf.get(), sz, ctx->sli_img_start_tmpl.c_str(), y+1, x+1);
|
||
add_text(std::string(buf.get(), buf.get() + sz - 1), true);
|
||
}
|
||
cur.entries.emplace_back(std::move(chunks[y * n_col + x]));
|
||
add_text(ctx->tok_sli_img_end);
|
||
if (!is_last_in_row) {
|
||
add_text(ctx->tok_sli_img_mid);
|
||
}
|
||
}
|
||
if ((y != n_row - 1 || ctx->tok_row_end_trail)) {
|
||
add_text(ctx->tok_row_end);
|
||
}
|
||
}
|
||
add_text(ctx->tok_slices_end);
|
||
}
|
||
|
||
// add overview image (last)
|
||
if (!ctx->ov_img_first) {
|
||
add_text(ctx->tok_ov_img_start);
|
||
cur.entries.emplace_back(std::move(ov_chunk));
|
||
add_text(ctx->tok_ov_img_end);
|
||
}
|
||
|
||
} else {
|
||
|
||
size_t n_tokens = 0;
|
||
for (const auto & e : batch_f32.entries) {
|
||
n_tokens += clip_n_output_tokens(ctx->ctx_v, e.get());
|
||
}
|
||
|
||
mtmd_image_tokens_ptr image_tokens(new mtmd_image_tokens);
|
||
if (mtmd_decode_use_mrope(ctx)) {
|
||
// for Qwen2VL, we need this information for M-RoPE decoding positions
|
||
image_tokens->nx = clip_n_output_tokens_x(ctx->ctx_v, batch_f32.entries[0].get());
|
||
image_tokens->ny = clip_n_output_tokens_y(ctx->ctx_v, batch_f32.entries[0].get());
|
||
} else {
|
||
// other models, we only need the total number of tokens
|
||
image_tokens->nx = n_tokens;
|
||
image_tokens->ny = 1;
|
||
}
|
||
image_tokens->pos = ctx->pos_type;
|
||
// HunyuanVL wraps the image grid with BOI/EOI and adds one newline per row,
|
||
// and uses XD-RoPE (dim-3 = image index). Override the position type so that
|
||
// n_tokens() and mtmd_image_tokens_get_decoder_pos pick the HunyuanVL layout.
|
||
if (ctx->proj_type_v() == PROJECTOR_TYPE_HUNYUANVL) {
|
||
image_tokens->pos = MTMD_POS_TYPE_HUNYUANVL;
|
||
image_tokens->image_idx = n_images_added;
|
||
GGML_ASSERT(n_tokens == (size_t)image_tokens->n_tokens());
|
||
}
|
||
image_tokens->batch_f32 = std::move(batch_f32);
|
||
image_tokens->id = bitmap->id; // optional
|
||
|
||
LOG_DBG("image_tokens->nx = %d\n", image_tokens->nx);
|
||
LOG_DBG("image_tokens->ny = %d\n", image_tokens->ny);
|
||
LOG_DBG("batch_f32 size = %d\n", (int)image_tokens->batch_f32.entries.size());
|
||
|
||
mtmd_input_chunk chunk{
|
||
MTMD_INPUT_CHUNK_TYPE_IMAGE,
|
||
{}, // text tokens
|
||
std::move(image_tokens),
|
||
nullptr, // audio tokens
|
||
};
|
||
cur.entries.emplace_back(std::move(chunk));
|
||
}
|
||
|
||
if (!ctx->img_end.empty()) {
|
||
add_text(ctx->img_end, true); // add image end token
|
||
}
|
||
|
||
// advance image-chunk counter so the next image gets the next XD-RoPE dim-3 slot
|
||
n_images_added++;
|
||
|
||
} else {
|
||
// handle audio
|
||
|
||
if (!ctx->ctx_a) {
|
||
LOG_ERR("%s: error: model does not support audio input\n", __func__);
|
||
return 2;
|
||
}
|
||
|
||
if (bitmap->data.size() == 0) {
|
||
LOG_ERR("%s: error: empty audio data\n", __func__);
|
||
return 2;
|
||
}
|
||
|
||
if (!ctx->aud_beg.empty()) {
|
||
add_text(ctx->aud_beg, true); // add audio begin token
|
||
}
|
||
|
||
// sanity check
|
||
GGML_ASSERT(ctx->audio_preproc != nullptr);
|
||
GGML_ASSERT(bitmap->data.size() > sizeof(float));
|
||
GGML_ASSERT(bitmap->data.size() % sizeof(float) == 0);
|
||
|
||
// preprocess audio
|
||
std::vector<mtmd_audio_mel> mel_spec_chunks;
|
||
const float * samples = (const float *)bitmap->data.data();
|
||
size_t n_samples = bitmap->data.size() / sizeof(float);
|
||
bool ok = ctx->audio_preproc->preprocess(samples, n_samples, mel_spec_chunks);
|
||
if (!ok) {
|
||
LOG_ERR("Unable to preprocess audio\n");
|
||
return 2;
|
||
}
|
||
|
||
// consider each mel_spec as a separate audio chunk
|
||
// TODO: maybe support batching, but this may come with memory cost
|
||
for (auto & mel_spec : mel_spec_chunks) {
|
||
clip_image_f32_ptr mel_f32(clip_image_f32_init());
|
||
mel_f32->nx = mel_spec.n_len;
|
||
mel_f32->ny = mel_spec.n_mel;
|
||
mel_f32->buf = std::move(mel_spec.data);
|
||
size_t n_tokens = clip_n_output_tokens(ctx->ctx_a, mel_f32.get());
|
||
|
||
clip_image_f32_batch batch_f32;
|
||
batch_f32.is_audio = true;
|
||
batch_f32.entries.push_back(std::move(mel_f32));
|
||
|
||
mtmd_audio_tokens_ptr audio_tokens(new mtmd_audio_tokens);
|
||
audio_tokens->n_tokens = n_tokens;
|
||
audio_tokens->batch_f32 = std::move(batch_f32);
|
||
audio_tokens->id = bitmap->id; // optional
|
||
|
||
LOG_DBG("audio_tokens->n_tokens = %d\n", audio_tokens->n_tokens);
|
||
|
||
mtmd_input_chunk chunk{
|
||
MTMD_INPUT_CHUNK_TYPE_AUDIO,
|
||
{}, // text tokens
|
||
nullptr, // image tokens
|
||
std::move(audio_tokens),
|
||
};
|
||
cur.entries.emplace_back(std::move(chunk));
|
||
}
|
||
|
||
if (!ctx->aud_end.empty()) {
|
||
add_text(ctx->aud_end, true); // add audio end token
|
||
}
|
||
}
|
||
|
||
return 0;
|
||
}
|
||
|
||
std::vector<mtmd_input_chunk> split_batch_to_chunk(clip_image_f32_batch && batch_f32, const std::string & id) {
|
||
std::vector<mtmd_input_chunk> chunks;
|
||
|
||
for (auto & entry : batch_f32.entries) {
|
||
mtmd_image_tokens_ptr image_tokens(new mtmd_image_tokens);
|
||
image_tokens->nx = clip_n_output_tokens(ctx->ctx_v, entry.get());
|
||
image_tokens->ny = 1;
|
||
image_tokens->batch_f32.entries.push_back(std::move(entry));
|
||
image_tokens->id = id;
|
||
|
||
mtmd_input_chunk chunk{
|
||
MTMD_INPUT_CHUNK_TYPE_IMAGE,
|
||
{}, // text tokens
|
||
std::move(image_tokens),
|
||
nullptr, // audio tokens
|
||
};
|
||
chunks.emplace_back(std::move(chunk));
|
||
}
|
||
|
||
return chunks;
|
||
}
|
||
|
||
// for example: "a <__media__> b <__media__> c" --> "a", "<__media__>", "b", "<__media__>", "c"
|
||
static std::vector<std::string> split_text(const std::string & input, const std::string & delimiter) {
|
||
std::vector<std::string> result;
|
||
if (input.empty()) {
|
||
return result;
|
||
}
|
||
size_t start = 0;
|
||
size_t pos = 0;
|
||
while ((pos = input.find(delimiter, start)) != std::string::npos) {
|
||
if (pos > start) {
|
||
result.push_back(input.substr(start, pos - start));
|
||
}
|
||
result.push_back(delimiter);
|
||
start = pos + delimiter.length();
|
||
}
|
||
if (start < input.length()) {
|
||
result.push_back(input.substr(start));
|
||
}
|
||
return result;
|
||
}
|
||
|
||
// copied from common_tokenize
|
||
static std::vector<llama_token> mtmd_tokenize_text_internal(
|
||
const struct llama_vocab * vocab,
|
||
const std::string & text,
|
||
bool add_special,
|
||
bool parse_special) {
|
||
if (vocab == nullptr) {
|
||
throw std::runtime_error("llama_vocab is not provided");
|
||
}
|
||
// upper limit for the number of tokens
|
||
int n_tokens = text.length() + 2 * add_special;
|
||
std::vector<llama_token> result(n_tokens);
|
||
n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
||
if (n_tokens == std::numeric_limits<int32_t>::min()) {
|
||
throw std::runtime_error("Tokenization failed: input text too large, tokenization result exceeds int32_t limit");
|
||
}
|
||
if (n_tokens < 0) {
|
||
result.resize(-n_tokens);
|
||
int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
|
||
GGML_ASSERT(check == -n_tokens);
|
||
} else {
|
||
result.resize(n_tokens);
|
||
}
|
||
return result;
|
||
}
|
||
};
|
||
|
||
int32_t mtmd_tokenize(mtmd_context * ctx,
|
||
mtmd_input_chunks * output,
|
||
const mtmd_input_text * text,
|
||
const mtmd_bitmap ** bitmaps,
|
||
size_t n_bitmaps) {
|
||
mtmd_tokenizer tokenizer(ctx, text, bitmaps, n_bitmaps);
|
||
return tokenizer.tokenize(output);
|
||
}
|
||
|
||
int32_t mtmd_encode_chunk(mtmd_context * ctx, const mtmd_input_chunk * chunk) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
LOG_WRN("mtmd_encode_chunk has no effect for text chunks\n");
|
||
return 0;
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
if (!ctx->ctx_v) {
|
||
LOG_ERR("%s: model does not support vision input\n", __func__);
|
||
return 1;
|
||
}
|
||
return mtmd_encode(ctx, chunk->tokens_image.get());
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
if (!ctx->ctx_a) {
|
||
LOG_ERR("%s: model does not support audio input\n", __func__);
|
||
return 1;
|
||
}
|
||
int n_mmproj_embd = ctx->n_embd_text;
|
||
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd);
|
||
bool ok = clip_image_batch_encode(
|
||
ctx->ctx_a,
|
||
ctx->n_threads,
|
||
&chunk->tokens_audio->batch_f32,
|
||
ctx->image_embd_v.data());
|
||
return ok ? 0 : 1;
|
||
}
|
||
|
||
LOG_ERR("%s: unknown chunk type %d\n", __func__, (int)chunk->type);
|
||
return 1;
|
||
}
|
||
|
||
int32_t mtmd_encode(mtmd_context * ctx, const mtmd_image_tokens * image_tokens) {
|
||
clip_ctx * ctx_clip = ctx->ctx_v;
|
||
if (!ctx_clip) {
|
||
LOG_ERR("%s: this API does not support non-vision input, please use mtmd_encode_chunk instead\n", __func__);
|
||
return 1;
|
||
}
|
||
auto proj_type = clip_get_projector_type(ctx_clip);
|
||
int n_mmproj_embd = clip_n_mmproj_embd(ctx_clip);
|
||
ctx->image_embd_v.resize(image_tokens->n_tokens() * n_mmproj_embd);
|
||
bool ok = false;
|
||
|
||
if (clip_is_llava(ctx_clip)
|
||
|| proj_type == PROJECTOR_TYPE_MINICPMV
|
||
|| proj_type == PROJECTOR_TYPE_GLM_EDGE
|
||
|| proj_type == PROJECTOR_TYPE_INTERNVL
|
||
|| proj_type == PROJECTOR_TYPE_DEEPSEEKOCR2
|
||
|| proj_type == PROJECTOR_TYPE_GRANITE4_VISION) {
|
||
// TODO @ngxson : llava does not support batched encoding ; this should be fixed inside clip_image_batch_encode()
|
||
const auto & entries = image_tokens->batch_f32.entries;
|
||
// entries may have different token counts
|
||
// e.g., DeepSeek-OCR-2: 144 per tile views, 257 for the global view
|
||
size_t offset = 0;
|
||
for (size_t i = 0; i < entries.size(); i++) {
|
||
int n_tokens_per_image = clip_n_output_tokens(ctx_clip, entries[i].get());
|
||
ok = clip_image_encode(
|
||
ctx_clip,
|
||
ctx->n_threads,
|
||
entries[i].get(),
|
||
ctx->image_embd_v.data() + offset);
|
||
offset += static_cast<size_t>(n_mmproj_embd) * n_tokens_per_image;
|
||
}
|
||
} else {
|
||
ok = clip_image_batch_encode(
|
||
ctx_clip,
|
||
ctx->n_threads,
|
||
&image_tokens->batch_f32,
|
||
ctx->image_embd_v.data());
|
||
}
|
||
|
||
return ok ? 0 : 1;
|
||
}
|
||
|
||
float * mtmd_get_output_embd(mtmd_context * ctx) {
|
||
return ctx->image_embd_v.data();
|
||
}
|
||
|
||
bool mtmd_decode_use_non_causal(const mtmd_context * ctx, const mtmd_input_chunk * chunk) {
|
||
auto proj_type = ctx->proj_type_v();
|
||
if (chunk && chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
proj_type = ctx->proj_type_a();
|
||
}
|
||
switch (proj_type) {
|
||
case PROJECTOR_TYPE_GEMMA3:
|
||
case PROJECTOR_TYPE_GEMMA4V:
|
||
case PROJECTOR_TYPE_GEMMA4UV:
|
||
return true;
|
||
default:
|
||
return false;
|
||
}
|
||
}
|
||
|
||
bool mtmd_decode_use_mrope(const mtmd_context * ctx) {
|
||
return ctx->pos_type == MTMD_POS_TYPE_MROPE;
|
||
}
|
||
|
||
bool mtmd_support_vision(const mtmd_context * ctx) {
|
||
return ctx->ctx_v != nullptr;
|
||
}
|
||
|
||
bool mtmd_support_audio(const mtmd_context * ctx) {
|
||
return ctx->ctx_a != nullptr;
|
||
}
|
||
|
||
int mtmd_get_audio_sample_rate(const mtmd_context * ctx) {
|
||
if (!ctx->ctx_a) {
|
||
return -1;
|
||
}
|
||
return clip_get_hparams(ctx->ctx_a)->audio_sample_rate;
|
||
}
|
||
|
||
//
|
||
// public API functions
|
||
//
|
||
|
||
// mtmd_bitmap
|
||
|
||
mtmd_bitmap * mtmd_bitmap_init(uint32_t nx,
|
||
uint32_t ny,
|
||
const unsigned char * data) {
|
||
mtmd_bitmap * bitmap = new mtmd_bitmap;
|
||
bitmap->nx = nx;
|
||
bitmap->ny = ny;
|
||
size_t data_size = (size_t)nx * ny * 3;
|
||
bitmap->data.resize(data_size);
|
||
std::memcpy(bitmap->data.data(), data, data_size);
|
||
return bitmap;
|
||
}
|
||
|
||
mtmd_bitmap * mtmd_bitmap_init_from_audio(size_t n_samples,
|
||
const float * data) {
|
||
mtmd_bitmap * bitmap = new mtmd_bitmap;
|
||
bitmap->nx = n_samples;
|
||
bitmap->ny = 1;
|
||
bitmap->is_audio = true;
|
||
size_t data_size = n_samples * sizeof(float);
|
||
bitmap->data.resize(data_size);
|
||
std::memcpy(bitmap->data.data(), data, data_size);
|
||
return bitmap;
|
||
}
|
||
|
||
uint32_t mtmd_bitmap_get_nx(const mtmd_bitmap * bitmap) {
|
||
return bitmap->nx;
|
||
}
|
||
|
||
uint32_t mtmd_bitmap_get_ny(const mtmd_bitmap * bitmap) {
|
||
return bitmap->ny;
|
||
}
|
||
|
||
const unsigned char * mtmd_bitmap_get_data(const mtmd_bitmap * bitmap) {
|
||
return bitmap->data.data();
|
||
}
|
||
|
||
size_t mtmd_bitmap_get_n_bytes(const mtmd_bitmap * bitmap) {
|
||
return bitmap->data.size();
|
||
}
|
||
|
||
bool mtmd_bitmap_is_audio(const mtmd_bitmap * bitmap) {
|
||
return bitmap->is_audio;
|
||
}
|
||
|
||
const char * mtmd_bitmap_get_id(const mtmd_bitmap * bitmap) {
|
||
return bitmap->id.c_str();
|
||
}
|
||
|
||
void mtmd_bitmap_set_id(mtmd_bitmap * bitmap, const char * id) {
|
||
if (id) {
|
||
bitmap->id = std::string(id);
|
||
} else {
|
||
bitmap->id.clear();
|
||
}
|
||
}
|
||
|
||
void mtmd_bitmap_free(mtmd_bitmap * bitmap) {
|
||
if (bitmap) {
|
||
delete bitmap;
|
||
}
|
||
}
|
||
|
||
// mtmd_input_chunks
|
||
|
||
mtmd_input_chunks * mtmd_input_chunks_init() {
|
||
return new mtmd_input_chunks;
|
||
}
|
||
|
||
size_t mtmd_input_chunks_size(const mtmd_input_chunks * chunks) {
|
||
return chunks->entries.size();
|
||
}
|
||
|
||
const mtmd_input_chunk * mtmd_input_chunks_get(const mtmd_input_chunks * chunks, size_t idx) {
|
||
if (idx >= chunks->entries.size()) {
|
||
return nullptr;
|
||
}
|
||
return &chunks->entries[idx];
|
||
}
|
||
|
||
void mtmd_input_chunks_free(mtmd_input_chunks * chunks) {
|
||
if (chunks) {
|
||
delete chunks;
|
||
}
|
||
}
|
||
|
||
// mtmd_input_chunk
|
||
|
||
enum mtmd_input_chunk_type mtmd_input_chunk_get_type(const mtmd_input_chunk * chunk) {
|
||
return chunk->type;
|
||
}
|
||
|
||
const llama_token * mtmd_input_chunk_get_tokens_text(const mtmd_input_chunk * chunk, size_t * n_tokens_output) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
*n_tokens_output = chunk->tokens_text.size();
|
||
return chunk->tokens_text.data();
|
||
}
|
||
*n_tokens_output = 0;
|
||
return nullptr;
|
||
}
|
||
|
||
const mtmd_image_tokens * mtmd_input_chunk_get_tokens_image(const mtmd_input_chunk * chunk) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
return chunk->tokens_image.get();
|
||
}
|
||
return nullptr;
|
||
}
|
||
|
||
size_t mtmd_input_chunk_get_n_tokens(const mtmd_input_chunk * chunk) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
return chunk->tokens_text.size();
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
return mtmd_image_tokens_get_n_tokens(chunk->tokens_image.get());
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
return chunk->tokens_audio->n_tokens;
|
||
} else {
|
||
GGML_ABORT("invalid chunk type");
|
||
}
|
||
}
|
||
|
||
llama_pos mtmd_input_chunk_get_n_pos(const mtmd_input_chunk * chunk) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
|
||
return chunk->tokens_text.size();
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
return mtmd_image_tokens_get_n_pos(chunk->tokens_image.get());
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
return chunk->tokens_audio->n_tokens;
|
||
} else {
|
||
GGML_ABORT("invalid chunk type");
|
||
}
|
||
}
|
||
|
||
const char * mtmd_input_chunk_get_id(const mtmd_input_chunk * chunk) {
|
||
if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
|
||
return chunk->tokens_image->id.c_str();
|
||
} else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
|
||
return chunk->tokens_audio->id.c_str();
|
||
}
|
||
return nullptr;
|
||
}
|
||
|
||
mtmd_input_chunk * mtmd_input_chunk_copy(const mtmd_input_chunk * chunk) {
|
||
mtmd_input_chunk * copy = new mtmd_input_chunk{
|
||
chunk->type,
|
||
chunk->tokens_text,
|
||
nullptr,
|
||
nullptr,
|
||
};
|
||
if (chunk->tokens_image) {
|
||
// copy the image tokens
|
||
copy->tokens_image = mtmd_image_tokens_ptr(new mtmd_image_tokens());
|
||
*copy->tokens_image = chunk->tokens_image->clone();
|
||
}
|
||
if (chunk->tokens_audio) {
|
||
// copy the audio tokens
|
||
copy->tokens_audio = mtmd_audio_tokens_ptr(new mtmd_audio_tokens());
|
||
*copy->tokens_audio = chunk->tokens_audio->clone();
|
||
}
|
||
return copy;
|
||
}
|
||
|
||
void mtmd_input_chunk_free(mtmd_input_chunk * chunk) {
|
||
if (chunk) {
|
||
delete chunk;
|
||
}
|
||
}
|
||
|
||
// mtmd_image_tokens
|
||
|
||
size_t mtmd_image_tokens_get_n_tokens(const mtmd_image_tokens * image_tokens) {
|
||
return image_tokens->n_tokens();
|
||
}
|
||
|
||
size_t mtmd_image_tokens_get_nx(const mtmd_image_tokens * image_tokens) {
|
||
return image_tokens->nx;
|
||
}
|
||
|
||
size_t mtmd_image_tokens_get_ny(const mtmd_image_tokens * image_tokens) {
|
||
return image_tokens->ny;
|
||
}
|
||
|
||
mtmd_decoder_pos mtmd_image_tokens_get_decoder_pos(const mtmd_image_tokens * image_tokens, llama_pos pos_0, size_t i) {
|
||
mtmd_decoder_pos pos;
|
||
switch (image_tokens->pos) {
|
||
case MTMD_POS_TYPE_MROPE:
|
||
{
|
||
pos.t = pos_0;
|
||
pos.x = pos_0 + (i % image_tokens->nx);
|
||
pos.y = pos_0 + (i / image_tokens->nx);
|
||
pos.z = 0; // unused for now
|
||
} break;
|
||
case MTMD_POS_TYPE_NORMAL:
|
||
{
|
||
pos.t = pos_0 + i;
|
||
pos.x = pos_0 + i;
|
||
pos.y = pos_0 + i;
|
||
pos.z = pos_0 + i;
|
||
} break;
|
||
case MTMD_POS_TYPE_HUNYUANVL:
|
||
{
|
||
// HunyuanVL layout: [BOI] [row0 tokens + newline] ... [row(ny-1) tokens + newline] [EOI]
|
||
// Total = 1 + ny*(nx+1) + 1. BOI and EOI use sequential positions in every dim;
|
||
// content and row-newline tokens use (row, col) with XD-RoPE dim-3 = image_idx.
|
||
const uint32_t nx = image_tokens->nx;
|
||
const uint32_t n_total = image_tokens->n_tokens();
|
||
if (i == 0) {
|
||
// BOI
|
||
pos.t = pos_0 + i;
|
||
pos.x = pos_0 + i;
|
||
pos.y = pos_0 + i;
|
||
pos.z = pos_0 + i;
|
||
} else if (i == n_total - 1) {
|
||
// EOI
|
||
pos.t = pos_0 + i;
|
||
pos.x = pos_0 + i;
|
||
pos.y = pos_0 + i;
|
||
pos.z = pos_0 + i;
|
||
} else {
|
||
// content token at (row, col), or the trailing newline of a row (col == nx)
|
||
// section 0 = sequential, section 1 = w(col), section 2 = h(row), section 3 = image_count.
|
||
// set_position_mrope_2d writes .y -> section 1 and .x -> section 2
|
||
const uint32_t offset = (uint32_t)i - 1;
|
||
const uint32_t row = offset / (nx + 1);
|
||
const uint32_t col = offset % (nx + 1);
|
||
pos.t = pos_0 + i;
|
||
pos.x = row;
|
||
pos.y = col;
|
||
pos.z = image_tokens->image_idx;
|
||
}
|
||
} break;
|
||
default:
|
||
GGML_ABORT("invalid position type");
|
||
}
|
||
return pos;
|
||
}
|
||
|
||
const char * mtmd_image_tokens_get_id(const mtmd_image_tokens * image_tokens) {
|
||
return image_tokens->id.c_str();
|
||
}
|
||
|
||
llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
|
||
switch (image_tokens->pos) {
|
||
case MTMD_POS_TYPE_MROPE:
|
||
return std::max(image_tokens->nx, image_tokens->ny);
|
||
case MTMD_POS_TYPE_NORMAL:
|
||
return image_tokens->n_tokens();
|
||
case MTMD_POS_TYPE_HUNYUANVL:
|
||
// HunyuanVL: the sequential (dim-0) position advances by the full token count
|
||
// (includes BOI/EOI and row newline tokens), not by max(nx, ny)
|
||
return image_tokens->n_tokens();
|
||
default:
|
||
GGML_ABORT("invalid position type");
|
||
}
|
||
}
|
||
|
||
// test function
|
||
|
||
mtmd_input_chunks * mtmd_test_create_input_chunks() {
|
||
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
|
||
if (!chunks) {
|
||
return nullptr;
|
||
}
|
||
|
||
// create a text chunk
|
||
std::vector<llama_token> tokens_text = { 1, 2, 3, 4, 5 };
|
||
mtmd_input_chunk chunk_text{
|
||
MTMD_INPUT_CHUNK_TYPE_TEXT,
|
||
std::move(tokens_text),
|
||
nullptr, // image tokens
|
||
nullptr, // audio tokens
|
||
};
|
||
chunks->entries.emplace_back(std::move(chunk_text));
|
||
|
||
// create an image chunk
|
||
mtmd_image_tokens_ptr image_tokens(new mtmd_image_tokens);
|
||
image_tokens->nx = 4;
|
||
image_tokens->ny = 4;
|
||
image_tokens->batch_f32.entries.resize(16);
|
||
image_tokens->id = "image_1";
|
||
mtmd_input_chunk chunk_image{
|
||
MTMD_INPUT_CHUNK_TYPE_IMAGE,
|
||
{}, // text tokens
|
||
std::move(image_tokens),
|
||
nullptr, // audio tokens
|
||
};
|
||
chunks->entries.emplace_back(std::move(chunk_image));
|
||
|
||
return chunks;
|
||
}
|
||
|
||
void mtmd_log_set(ggml_log_callback log_callback, void * user_data) {
|
||
g_logger_state.log_callback = log_callback ? log_callback : clip_log_callback_default;
|
||
g_logger_state.log_callback_user_data = user_data;
|
||
}
|
||
|
||
struct mtmd_caps mtmd_get_cap_from_file(const char * fname) {
|
||
try {
|
||
auto tmp = clip_get_cap(fname);
|
||
mtmd_caps cap;
|
||
cap.inp_audio = tmp.has_audio;
|
||
cap.inp_vision = tmp.has_vision;
|
||
return cap;
|
||
} catch (const std::exception & e) {
|
||
LOG_ERR("%s: failed to get capabilities from file '%s': %s\n", __func__, fname, e.what());
|
||
return mtmd_caps{ false, false };
|
||
}
|
||
}
|
||
|
||
//
|
||
// Debugging API (NOT intended for public use)
|
||
//
|
||
|
||
static void mtmd_debug_encode_impl(mtmd_context * ctx, clip_ctx * ctx_clip, clip_image_f32 & image) {
|
||
clip_set_debug_output_embeddings(ctx_clip, true);
|
||
int n_mmproj_embd = clip_n_mmproj_embd(ctx_clip);
|
||
int n_tokens = clip_n_output_tokens(ctx_clip, &image);
|
||
std::vector<float> embd_output(n_tokens * n_mmproj_embd, 0.0f);
|
||
bool ok = clip_image_encode(
|
||
ctx_clip,
|
||
ctx->n_threads,
|
||
&image,
|
||
embd_output.data());
|
||
if (!ok) {
|
||
LOG_ERR("%s: failed to encode image\n", __func__);
|
||
}
|
||
}
|
||
|
||
void mtmd_debug_encode_image(mtmd_context * ctx, const std::vector<std::vector<float>> & image) {
|
||
if (!ctx->ctx_v) {
|
||
LOG_ERR("%s: model does not support vision input\n", __func__);
|
||
return;
|
||
}
|
||
clip_image_f32 inp_image;
|
||
inp_image.nx = image.size();
|
||
inp_image.ny = inp_image.nx;
|
||
inp_image.buf.reserve(inp_image.nx * inp_image.ny);
|
||
for (const auto & row : image) {
|
||
inp_image.buf.insert(inp_image.buf.end(), row.begin(), row.end());
|
||
}
|
||
LOG_INF("%s: created input image with nx=%d, ny=%d\n", __func__, inp_image.nx, inp_image.ny);
|
||
mtmd_debug_encode_impl(ctx, ctx->ctx_v, inp_image);
|
||
}
|
||
|
||
void mtmd_debug_encode_audio(mtmd_context * ctx, const std::vector<float> & input) {
|
||
if (!ctx->ctx_a) {
|
||
LOG_ERR("%s: model does not support audio input\n", __func__);
|
||
return;
|
||
}
|
||
int n_mel = clip_get_hparams(ctx->ctx_a)->n_mel_bins;
|
||
clip_image_f32 inp_audio;
|
||
inp_audio.nx = input.size();
|
||
inp_audio.ny = n_mel;
|
||
inp_audio.buf.resize(input.size() * n_mel);
|
||
for (size_t i = 0; i < input.size(); i++) {
|
||
for (int j = 0; j < n_mel; j++) {
|
||
inp_audio.buf[j * inp_audio.nx + i] = input[i];
|
||
}
|
||
}
|
||
LOG_INF("%s: created input audio with nx=%d, ny=%d\n", __func__, inp_audio.nx, inp_audio.ny);
|
||
mtmd_debug_encode_impl(ctx, ctx->ctx_a, inp_audio);
|
||
}
|
||
|
||
void mtmd_debug_preprocess_image(mtmd_context * ctx, const std::vector<uint8_t> & rgb_values, int nx, int ny) {
|
||
if (!ctx->ctx_v) {
|
||
LOG_ERR("%s: model does not support vision input\n", __func__);
|
||
return;
|
||
}
|
||
clip_image_u8 img_u8;
|
||
img_u8.nx = nx;
|
||
img_u8.ny = ny;
|
||
img_u8.buf = rgb_values;
|
||
clip_image_f32_batch batch_f32;
|
||
GGML_ASSERT(ctx->image_preproc != nullptr);
|
||
bool ok = ctx->image_preproc->preprocess(img_u8, batch_f32);
|
||
if (!ok) {
|
||
LOG_ERR("%s: failed to preprocess image\n", __func__);
|
||
return;
|
||
}
|
||
LOG_INF("%s: preprocessed image to batch_f32 with %d entries\n", __func__, (int)batch_f32.entries.size());
|
||
for (size_t i = 0; i < batch_f32.entries.size(); i++) {
|
||
LOG_INF("%s: entry %zu has nx=%d, ny=%d\n", __func__, i, batch_f32.entries[i]->nx, batch_f32.entries[i]->ny);
|
||
// TODO: better way to dump entry content?
|
||
}
|
||
}
|
||
|
||
void mtmd_debug_preprocess_audio(mtmd_context * ctx, const std::vector<float> & samples) {
|
||
if (!ctx->ctx_a) {
|
||
LOG_ERR("%s: model does not support audio input\n", __func__);
|
||
return;
|
||
}
|
||
std::vector<mtmd_audio_mel> mel_spec_chunks;
|
||
bool ok = ctx->audio_preproc->preprocess(samples.data(), samples.size(), mel_spec_chunks);
|
||
if (!ok) {
|
||
LOG_ERR("%s: failed to preprocess audio\n", __func__);
|
||
return;
|
||
}
|
||
LOG_INF("%s: preprocessed audio to %zu mel spec chunks\n", __func__, mel_spec_chunks.size());
|
||
for (size_t i = 0; i < mel_spec_chunks.size(); i++) {
|
||
LOG_INF("%s: mel spec chunk %zu has n_len=%d, n_mel=%d\n", __func__, i, mel_spec_chunks[i].n_len, mel_spec_chunks[i].n_mel);
|
||
|
||
// dump mel entries: data is stored as [n_mel][n_len] (mel-major)
|
||
const auto & mel = mel_spec_chunks[i];
|
||
for (int m = 0; m < mel.n_mel; m++) {
|
||
for (int t = 0; t < mel.n_len; t++) {
|
||
LOG_INF("mel[%zu][m=%d][t=%d] = %f\n", i, m, t, mel.data[m * mel.n_len + t]);
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
static void stub_log_callback(enum ggml_log_level, const char *, void *) {
|
||
// do nothing
|
||
}
|
||
|
||
std::map<ggml_backend_dev_t, size_t> mtmd_get_memory_usage(const char * mmproj_fname,
|
||
struct mtmd_context_params ctx_params) {
|
||
mtmd::context_ptr ctx;
|
||
auto saved_log_callback = g_logger_state.log_callback;
|
||
auto saved_log_user_data = g_logger_state.log_callback_user_data;
|
||
try {
|
||
mtmd_log_set(stub_log_callback, nullptr); // suppress logging
|
||
ctx.reset(new mtmd_context(mmproj_fname, nullptr, ctx_params));
|
||
mtmd_log_set(saved_log_callback, saved_log_user_data); // restore log callback
|
||
std::map<ggml_backend_dev_t, size_t> total_mem;
|
||
auto merge = [&](const struct clip_ctx * c) {
|
||
for (auto & [dev, size] : clip_get_mem_usage(c)) {
|
||
total_mem[dev] += size;
|
||
}
|
||
};
|
||
if (ctx->ctx_v) {
|
||
merge(ctx->ctx_v);
|
||
}
|
||
if (ctx->ctx_a) {
|
||
merge(ctx->ctx_a);
|
||
}
|
||
return total_mem;
|
||
} catch (const std::exception & e) {
|
||
mtmd_log_set(saved_log_callback, saved_log_user_data); // restore log callback
|
||
LOG_ERR("%s: error: %s\n", __func__, e.what());
|
||
return {};
|
||
}
|
||
}
|