Files
llama.cpp/tools/mtmd/README-dev.md
T
Xuan-Son Nguyen 24bba7b98e mtmd: refactor preprocessor, add mtmd_image_preproc_out (#24736)
* add mtmd_image_preproc_out

* add dev docs

* remove unused clip API

* rm unused clip_image_f32_batch::grid

* change preprocess() call signature
2026-06-18 12:04:39 +02:00

1.8 KiB

libmtmd dev guide

History

Please refer to multimodal.md for a broader context.

In short:

  • libmtmd started as a wrapper around libllava / clip.cpp
  • Various components that used to be in clip.cpp are moved progressively to mtmd. For example, preprocessor is now part of mtmd

Terminologies

  • mtmd: MulTiMoDal
  • bitmap: representing a raw input data, for example: RGB image, PCM audio
  • tiles / slices: for llava-uhd-style models, the preprocessor breaks a large input into smaller square images called tiles or slices
  • chunk: a mtmd_input_chunk represents a preprocessed input that can then be passed through mtmd_encode()

Pipeline

A typical pipeline of the core libmtmd is as follows:

  • A bitmap (RGB image or PCM audio) is created
  • Bitmap and the text prompt is provided to mtmd_tokenize() that breaks the input into chunks
    • The tokenizer function first expands a "lazy" bitmap if it finds one. Typically, this is used by video, so that one media token corresponds to one input bitmap
    • For models that support "fused" temporal frames like Qwen-VL, the tokenizer tries to merge pair of consecutive frames into one batch
    • The preprocessor will then be called, which produces a list of chunks
    • Depending on the model itself, special tokens will be injected to separate image chunks (i.e. llava-uhd-style models)
  • Multiple bitmaps may be batched together to form a larger mtmd_batch()
  • Single image or batch is encoded, via mtmd_encode() or mtmd_batch_encode()
  • Get the output embeddings

Helper

We provide a set of helper functions via mtmd_helper to make using libmtmd easier. The helper provides:

  • Image, audio and video file decoding (for example, decode raw JPEG into RGB bitmap)
  • Manage llama_batch and calls to llama_decode