11 April, 2025

My Journey Into Vision Models

Join my adventure into the world of llama.cpp and AI vision!

My Journey Into Vision Models
Available in:
 English
Reading time: 11 min.
Table of content
  • Overview of Vision Models
  • Components
  • Preprocessing
  • Split into patches
  • Vision Encoder
  • Projector
  • Language Decoder
  • Decoding Images with Multiple Slices
  • What the Future Holds
  • Conclusion

Computer vision has always been an intriguing field for me. I remember the first time I dabbled in image recognition stuff was back in high school - just a fun little project during a summer break πŸ˜‚. Good times!

Then came the golden age of large language models (LLMs), and like many others, I was super excited to see the rapid progress. I started learning about the transformer architecture and how it revolutionized not just the field of NLP, but was starting to shake things up in computer vision too.

Luckily, I joined Hugging Face in August 2024. This happily meant I could finally dedicate more time to projects like llama.cpp. Back then, vision support was still in its early stages, but I was eager to contribute. Thanks to the amazing folks at Hugging Face, I learned a ton about the latest advancements in vision models.

In this article, I want to share my journey navigating the world of vision models and the adventure of integrating them into llama.cpp. I hope this might inspire others to explore the exciting realm of computer vision and maybe even contribute to the open-source community!

Yep, that was me in the cover image, with my trusty Fuji X-E1 πŸ“·
Check out my posts about photography here

Overview of Vision Models

Most vision models nowadays have two main parts: the vision encoder and the language decoder. Vision encoders are often based on the transformer architecture, hence the catchy name "Vision Transformer" (ViT).

To wrap your head around this, imagine two people working together:

  • One person can look at an image.
  • The other person can only read descriptions.

The first person looks at the image and describes what they see to the second person. Then, the second person uses that description to answer questions or generate text.

meme illustration

That's kind of how vision models work! The vision tower (or vision encoder) acts like the first person, "looking" at the image and compressing it into a set of intermediate representations. Then, the language model (or language decoder) acts like the second person, generating answers based on those representations.

These intermediate representations can take various forms (like KV vectors for cross-attention), but the most common form is a set of embedding vectors. We'll focus mostly on those in the next sections.

Components

Here's a simplified diagram of the whole pipeline:

Vision Model

Preprocessing

Split into patches

Projection

Tokenize

generate

Image

Pre-processed image

Vision Encoder (transformer)

Embeddings

Language Model (transformer)

Text prompt

Tokens

Answer

Preprocessing

Preprocessing is where the magic begins. It typically involves:

  1. Converting the image out of its comfy file format (e.g., JPEG, PNG) into a raw bitmap.
  2. Resizing the image to a fixed size (usually shrinking it down). This might require some padding or cropping if the aspect ratio doesn't match the target.
  3. For models that can handle different image sizes, we might need to slice the image up (more on this below).
  4. Converting the image into a tensor and normalizing its pixel values.

The vision encoder usually expects a fixed, often quite small, input image size. This means precious details can get lost if the original image is too large. To tackle this:

  • Some models (like LLaVA-UHD, MiniCPM-V) get clever and split the large image into smaller slices, processing both the slices and the downscaled original.
  • Other models just bulk up and accept larger image inputs directly (e.g., Gemma 3).
  • Most notably, Qwen2-VL uses a special positional embedding technique called M-RoPE to keep track of where patches came from, even in different-sized images, without losing spatial context. Very cool!

One thing that made me scratch my head is that while slicing seems purely algorithmic (no fancy Machine Learning needed here), the specific slicing strategy can be surprisingly complex and vary wildly between models.

Example of LLaVA-UHD Adaptive Slicing algorithm

Example of LLaVA-UHD's Adaptive Slicing algorithm. Source: LLaVA-UHD Github repo

IMPORTANT: Each slice is then treated like a separate image when it goes into the encoder. So, when I talk about the encoder, I might use "slice" and "image" interchangeably.

Split into patches

Next up, we chop the image (or slice) into smaller, equal-sized patches. Think of it like cutting a photo into a grid of tiny squares. Each patch is then flattened into a single vector.

In many implementations, this chopping isn't done with scissors, but with a math operation called 2D convolution, using a kernel size that matches the desired patch size. This step also sneakily embeds some extra information into the patches thanks to the convolution's kernel and bias.

Positional embeddings are also added to these patch vectors. This is crucial because transformers, by themselves, have no inherent sense of space - these embeddings tell the model where each patch came from in the original picture.

Illustration of splitting an image or a slice into patches

Illustration of splitting an image or slice into patches. Source: ResearchGate

You might wonder, "Why not just feed the whole image in?" Good question! But doing that directly would make the input vector enormous, causing the model size to balloon exponentially. Patching keeps things manageable.

Vision Encoder

The vision encoder is typically a transformer-based model. It takes the patches (already infused with positional info) as input and outputs a set of embedding vectors.

This part is the core of the vision processing. It's often relatively straightforward to implement (phew!), partly because we don't need to worry about the KV cache complexity found in generative language models. The transformer processes all patches in a non-causal manner. This means it gets to peek at all the patches simultaneously, figuring out the relationships between them.

Illustration of Causal vs Non-Causal Attention

Illustration of Causal vs Non-Causal Attention. Source: ResearchGate

The underlying transformer extracts features from the patches. If the image has a cat and a dog, the transformer churns out embedding vectors that somehow represent "cat-ness" and "dog-ness" derived from the patches.

Using non-causal attention is important because an object might span multiple patches. For instance, one patch might just contain the cat's majestic snoot, while another contains an ear. Non-causal attention helps the model piece the whole cat together from these different parts.

Of course, trying to directly interpret what's in these embedding vectors is like trying to understand abstract art - meaningful, but tricky! Below is a highly simplified idea:

Simplified example of what embeddings vectors can represent

Simplified example of what embedding vectors might represent

Projector

Okay, so we have a nice set of embedding vectors representing the image. But wait! In most cases, the language model expects input vectors of a different size (dimension) than what the vision encoder spits out. Uh oh.

To bridge this gap, we need a projector. The simplest method is often an MLP (multilayer perceptron) involving a couple of matrix multiplications with an activation function (like GELU) sandwiched in between. Here’s a basic idea in code:

import torch
import torch.nn as nn

class Projector(nn.Module):
    # input_dim:   dimension of the vision encoder's output vectors
    # hidden_size: dimension of the MLP's hidden layer
    # output_dim:  dimension the language model expects
    def __init__(self, input_dim, hidden_size, output_dim):
        super(Projector, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_size)
        self.activation = nn.GELU()
        self.fc2 = nn.Linear(hidden_size, output_dim)

    def forward(self, x):
        # input: n vectors of size input_dim
        x = self.fc1(x) # Project from input_dim to hidden_size
        x = self.activation(x)
        x = self.fc2(x) # Project from hidden_size to output_dim
        return x # output: n vectors of size output_dim

But life isn't always simple! Some models get fancier:

  • MiniCPM-V throws another transformer πŸ˜‚ into the mix just for projection!
  • Models like Gemma 3, Phi-4-multimodal, and MobileVLM use a Pool2D layer to reduce the number of output vectors, kind of like "summarizing" the image embeddings, so fewer tokens get fed into the language model.

Yeah, because of these varying complexities, the projector is often one of the trickiest parts to reimplement accurately.

Language Decoder

"Hold on," you might be thinking, "language models work with tokens (which are basically just numbers), right? How can we feed them these continuous embedding vectors?"

Excellent question! Turns out, the text tokens also get converted into embedding vectors internally within the language model. Most (if not all) LLMs have a built-in lookup table (usually a tensor called embed_tokens.weight) that maps each token ID to its corresponding embedding vector.

So, what's actually happening looks more like this:

LLM

Projection

Tokenize

Lookup embeddings

Calculate logits & sampling

ViT

Image Embeddings

Transformer

Text prompt

Tokens

Text Embeddings

Output Embeddings

Next tokens

See? From the language model's perspective, the image embeddings are just another set of input vectors, seamlessly concatenated with the text embeddings. The key difference is that image embeddings are dynamic (they change based on the input image), whereas text token embeddings are typically learned and fixed.

During training, the model learns to associate these image embeddings with the surrounding text context. If it sees the image embeddings corresponding to a cat, it learns that generating the word "cat" (or related concepts) is appropriate in that context.

Decoding Images with Multiple Slices

So far, we've mostly talked about single images or slices. But things get spicier when models handle multiple slices (like LLaVA-UHD, MiniCPM-V, Idefics). How does the language model know which embeddings belong to which slice, or where they fit spatially?

A common technique is to use special 'marker' tokens in the input sequence to delineate the embeddings from different slices or rows of slices. For example, with an image split into 4 slices plus the downscaled original:

[Downscaled Image] --> [Slice 1] --> [Slice 2]
                  |                |
                  +--> [Slice 3] --> [Slice 4]

The input embeddings fed to the LLM might be structured like this (using hypothetical special tokens):

<image>[Downscaled Image]</image>\n
<slice>[Slice 1]</slice><slice>[Slice 2]</slice>\n
<slice>[Slice 3]</slice><slice>[Slice 4]</slice>\n

Of course, the exact special tokens (<image>, <slice>, <row>, etc.) and structure vary significantly between models.

Some models, like Qwen2-VL, take a different path, using the M-RoPE technique mentioned earlier to implicitly encode the 2D position of each patch's embeddings, avoiding the need for explicit slice tokens.

Illustration of M-RoPE

Illustration of M-RoPE. Source: Qwen2-VL technical report

This wide variety means figuring out the correct way to handle multi-slice embeddings in downstream projects like llama.cpp can feel like assembling IKEA furniture without instructions - definitely challenging!

What the Future Holds

We've journeyed through the ideas and inner workings of vision models.

As you've seen, the vision encoder is essentially a way to translate images into a language (embeddings) that the LLM can understand. The cool part is, this 'encoder-decoder' concept isn't just for pictures! We can apply the same fundamental idea to other modalities like audio or video. The main difference would be swapping out the vision encoder for an encoder suited to that specific modality.

For instance, to process video input, a model like Qwen2.5-Omni uses its vision encoder frame-by-frame for the visual stream and another transformer to encode the audio stream. The outputs from both encoders are then fed into the language model to understand the combined audio-visual input.

Illustration of Qwen2.5-Omni video processing

Illustration of Qwen2.5-Omni video processing. Source: Qwen2.5-Omni-7B model card

The possibilities for multi-modal AI are vast and incredibly exciting!

Conclusion

In this article, I've shared a bit about my journey wrestling with vision models and hopefully shed some light on how these fascinating beasts work under the hood. From the high-level concepts to the nitty-gritty implementation details (and occasional headaches!), it's been quite an adventure.

I hope this glimpse into the world of computer vision inspires you to explore further and perhaps even dive into contributing to the vibrant open-source AI community. There's always more to learn and build!

Want to receive latest articles from my blog?
Follow on
Discussion
    MostHumble's avatar
    Thanks for sharing,

    I'm somewhat familiar with the whole VLM pipeline but enjoyed reading.
    I was curious about why you didn't speak about CLIP style pretraining, as far as I understand it's a way to "align" image embeddings with text embeddings, is it because it's not common in current pretraining strategies (i.e jump directly into next token prediction and learn in the batch a common representation) or is just because it didn't fit the scope of the blog?
    ngxson's avatar
    @MostHumble thanks for raising a very good point. Indeed, when writing the draft version, I wanted to include a section to talk about pretraining. I do know about CLIP, but it's quite old tbh and since I don't particularly work on pretraining (that's not my main domain at Hugging Face), I feel like I should only focus on what I'm coding on llama.cpp

    But yeah it seems like could be a good idea to talk about pretraining of multimodal in the future. I think it would be fun to see how people pretrain all different kinds of encoder / decoder. Audio input / output is something I recently enjoyed playing with.
    MostHumble's avatar
    @ngxson thanks for your reply, indeed the CLIP *model* is outdated, but I was referring to the paradigm (i.e Contrastive Language Image Pretraining, which is the same AFAIK with SigLIP(2))

    Again, great blog! keep them coming ahah!
    Dampfinchen's avatar
    Thank you for adding Gemma 3 with vision to llama.cpp, that is huge for the project that is quite lacking in terms of multimodal input. Your work is very appreciated! Are you planning to implement vision for LLama 4 as well or is that harder to implement than Gemma 3's vision?

    The blog is a very nice read as well, even one not so deep into LLMs like me can understand a lot. I do wonder though, if there will be a point in time where we ditch the vision encoder and have the language model perceive images natively, I'm sure this would result in a much deeper understanding of the images and videos. I think models like Gemma 3 are already pretrained with images.
    ngxson's avatar
    @Dampfinchen Yes I do plan to work on Llama 4 vision, but will be at a later stage. AFAIK Llama 4 is the same as Gemma 3 or any other models using embeddings as the intermediate representation between the encoder and decoder. This makes it fairly easy to reimplement, unlike Llama 3 where the cross-attention is used.

    I don't have a good answer for your question about why language model can't process vision data directly. But I think it's the same reason for why animal brain always have a dedicated part for vision perception. The input signals from vision and language are vastly different, so it makes more sense to have 2 different "circuit" to process them. However, I do think vision encoder can be even smaller (using less parameters) than what we currently have.

    For Gemma 3, I think they refer to the fact that image appears on the pretrain stage of the model (i.e. part of the dataset), but that doesn't mean Gemma 3 can understand image without an encoder.