The article is based on the following source:

We will focus on the LLaVA variant in this article.

Vision language models

The term vision language models here we are referring to the LLMs with vision capability. For LLaVA variant VLM, it consists of a pretrained vision module and a pretrained LLM and combined the two with the projector and module called multimodal projector and some fine-tuning. The following figure clearly demonstrates the relations between the components.

vlm_overview

Training process in high level

LLaVA introduces a module multimodal projector to project the vision input to the embedding space of LLM tokens.

Multimodal projector training. Use GPT-4 to generate questions whose answer is the caption. Use the QA set to train the multimodal projector while freezing other modules
Instruction fine-tuning. Unfreeze the text decoder and fine tune the model with instruction and the corresponding answers
The image encoder is frozen during the whole process

nanoVLM

Our Colab notebook here.

We use nanoVLM for understanding the details of training a VLM. We tweaked the Colab notebook shared by the author to make it runnable with the implementation code when we were studying it.

When we list the elements in a batch data, we see

input_ids
attention_mask
images
labels

input_ids

When we print the content of an input_ids sequence, we see the following structure

<im_end><im_end>...<im_start>user<row1_col1><image><image>...<row1_col2><image>...<image>Question...<im_end><im_start>assistant Answer: A<im_end><im_start>user...

It would start with a sequence of <im_end>
Following <im_start>user and a sequence of image related tokens
After the image sequence token, it would have the question and following an <im_end>
Next, start with <im_start>assistant and following the corresponding answer and another <im_end>
The above is one back-and-forth between the user and the assistant. The following <im_start> would kick off another back-and-forth between the user and the assistant

input_ids_example1

Why there are multiple <im_end> at the beginning?

<im_end> is the padding token in this notebook
We do the left side padding here
However, if we do print(tokenizer.padding_side) we would see the padding side is right
The reason is we don’t use the tokenizer to do the padding. In the codebase we do the padding ourselves here.

What are user and assistant?

The LLM training here follows a user-agent-conversation-like chat template
The dataset here is a VQA dataset and the questions would be treated as the users’ requests while the answers are the agent’s responses. [code reference1] [chat template]
In this dataset, one image would have multiple questions. Therefore, we would see multiple turns of <user> and <agnet> while only have one image at the beginning.

What is <row_1_col_1><image><image>...?

From the config we see that an input image will be split into 16 regions max. For region, it would be segmented by the image encoder module into patches. <row_1_col_1><image><image>... indicating it is the region on the row 1 and column 1 and the <image> following are the representation for patches generated by the image encoder.
In the notebook we visualize an example of one input image is split into 4 regions, as illustrated below

img_regions_example

How does the images got processed?

It’s possible that each training instances contains more than one image
The _process_images function would use the processor to convert each image into a list of regions. (It has DynamicResize and SplitImage capability as here.)
The get_image_string function would obtain the corresponding image_string based on the processed image

How does the model resolve multiple images in one training instances?

We see there are three sequence of <image> tokens distributed in the decoded input ids. It means that in this data instance, there are three sets of VQA and each one involved with one image and multiple questions The question is how does the model know which image to attend to when answering a question?
We don’t have answers at the moment.

labels

When we print the label tokens, we would see that only the assistant’s response will be used for loss computation. Other tokens would just be ignored.

labels_example

attention_mask

When we print out the attention masks, we would see that only the padding token <im_end> would be ignored. If the <im_end> is not padding, it would not be ignored.

attention_mask_example

Image encoder

In this section we would introduce the image encoder used in LLaVA. From the LLaVA 1.5 paper it’s suggesting that it’s using the openai/clip-vit-large-patch14-336.

Regarding CLIP, the reference sources are

Blog
Paper

Key description from the paper

We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet

The description above indicates that the CLIP embedding is the SOTA image representation at that time and the primary baselines it compared with in the paper is ImageNet.

Background

Previously, the vision models are built based on labeled dataset
- VGG, ImageNet
It requires specifying visual concept
Directly learning the image representation from raw text is a promising alternative. It enables zero-shot transferable learning

Key results

The embedding could perform zero-shot transferable tasks
Match the accuracy of ResNet-50 on ImageNet without using their 1.28M training examples

Components

Natural language supervision

Using natural language as training signal
- It’s much easier to scale on the natural language side compared to crowd-source-based for image labeling
- Connects representation to language and enable flexible zero-shot transfer

Large datasets

Compose the datasets from the internet
400M pairs
The dataset is named as WIT for WebImageText
The details of constructing this dataset is missing. Here is the discussion thread and Meta’s effort to reproduce how to construct the dataset
Blend YFCC100M dataset

Efficient pre-training

Tried using CNN and text-transformer to generate the caption of a given image (Transformer language model)
- Generating exact caption is a very hard task
- It’s not working. On the other hand if we just predict BOW encoding we could speed up reaching the ImageNet performance by 3x

Projection (alignment)

The paper tried both linear and non-linear projection
- No big difference
- Hypothesize here the language model signal is strong enough to guide the projection
- Hypothesize non-linear projection is good for image only self supervised representation
Other alignment example
- Google ALIGN
- Google Brain LiT

Model training

InfoNCE loss. The loss is symmetric as computed with both image -> text and text -> image
The batch size here is 32768