Inside Multimodal LLaMA 3.2: Understanding Meta’s Vision-Language Model Architecture

12 min readNov 28, 2024

In September 2024, Meta released Llama 3.2 with multimodal (MLLaMA), their latest advancement in multimodal AI that integrates vision and language capabilities. While their blog post highlights the model’s impressive performance, the technical details of how MLLaMA actually bridges visual and language understanding remain largely unexplored. Most discussions focus on benchmarks and capabilities, leaving a gap in understanding its architectural innovations.

As researchers and practitioners in the field of multimodal AI, we’ve seen various approaches to combining Vision and Language Models (VLM). Models like LLaVA uses vision encoder output then project it as the input for language model, like the following graph.

Typical VLM structure. Taken from Huggingface “Vision Language Models Explained” blog

MLLaMA takes a particularly interesting approach with its two-stage vision processing and strategic cross-attention integration.

Through examination of the model architecture and source code, this technical deep dive aims to unpack MLLaMA’s architectural decisions and their implications. The focus is on Llama 3.2 11B vision model, which can be found at Huggingface. We’ll explore how its vision encoder differs from traditional Vision Transformers (ViT), despite similar foundations, and how it strategically preserves multi-level visual features through intermediate layer outputs. Of particular interest is its integration strategy, which uses cross-attention at specific intervals rather than the more common approaches of early fusion or late fusion.

The model consists of three main components:

A vision encoder with a unique two-stage architecture: a 32-layer encoder followed by an 8-layer global encoder
A language model based on LLaMA 3.1 architecture with 40 transformer layers (8 Cross Attention layers for integration and 32 self attention layers from LLM)
An integration mechanism using projected concatenated features and strategically placed cross-attention layers

In this post, we’ll do a detailed technical analysis of each component, examining their architectures, interconnections, and the reasoning behind these design choices. We’ll look at specific implementation details from the source code and discuss how these architectural decisions enable effective multimodal understanding. The source code to load and inference can be found at Huggingface

The Big Picture: MLLaMA’s Architecture Overview

Core Architecture

Like many other VLMs, the core architecture can be categorized into 3 components. 1. Vision Encoder 2. Language Model 3. Integration Mechanism. MLLaMA’s architecture also consists of those three primary components.

Vision Encoder: The vision processing pipeline diverges from traditional ViT implementations by introducing a two-stage process. The first stage consists of a 32-layer transformer that processes patched image inputs while preserving intermediate representations. This is followed by an 8-layer global encoder with gated attention mechanisms. The key innovation here is the concatenation of intermediate features (1280 dimensions) with the final output, creating a rich visual representation that captures multiple levels of visual understanding.
Language Model: Building on LLaMA 3.1’s architecture, the language component implements a 40-layer decoder-only transformer with a 4096-dimensional hidden size. What makes this implementation particularly interesting is its integration of cross-attention layers at regular intervals (every 5th layer), allowing for systematic visual grounding throughout the text generation process.
Integration Mechanism: The integration strategy employs a projection layer that maps the concatenated vision features to the language model’s 4096-dimensional space. This projection serves as more than just a dimensionality reduction — it learns to align visual and language semantic spaces.

Following Graph is an overview of the high level relationships of different modules.

The Vision Encoder: Two Stages of Understanding

After examining MLLaMA’s vision encoder implementation, it becomes clear that it introduces a two-stage architecture compared to the standard Vision Transformer approach. Let’s dive deep into how this two-stage vision processing system works and why its design choices matter.

First, I will put the vision model below so we can have a clear reference.

(vision_model): MllamaVisionModel(
    (patch_embedding): Conv2d(3, 1280, kernel_size=(14, 14), stride=(14, 14), padding=valid, bias=False)
    (gated_positional_embedding): MllamaPrecomputedPositionEmbedding(
      (tile_embedding): Embedding(9, 8197120)
    )
    (pre_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
      (embedding): Embedding(9, 5120)
    )
    (post_tile_positional_embedding): MllamaPrecomputedAspectRatioEmbedding(
      (embedding): Embedding(9, 5120)
    )
    (layernorm_pre): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    (layernorm_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    (transformer): MllamaVisionEncoder(
      (layers): ModuleList(
        (0-31): 32 x MllamaVisionEncoderLayer(
          (self_attn): MllamaVisionSdpaAttention(
            (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          )
          (mlp): MllamaVisionMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          )
          (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (global_transformer): MllamaVisionEncoder(
      (layers): ModuleList(
        (0-7): 8 x MllamaVisionEncoderLayer(
          (self_attn): MllamaVisionSdpaAttention(
            (q_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (o_proj): Linear(in_features=1280, out_features=1280, bias=False)
          )
          (mlp): MllamaVisionMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          )
          (input_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )

Before we discuss the encoders, it’s crucial to understand how MLLaMA preprocesses images. The model has config of “image_size”: 448, “patch_size”: 14. So presumably the model creates 32×32 grid of patches, each encoded into a 1280-dimensional vector. This creates a higher than average image resolution compared to other VLM. Typically other VLM uses clip-vit-large-patch14–336 which creates 24×24 grid of patches.

First Stage Vision Encoder

self.transformer = MllamaVisionEncoder(config, config.num_hidden_layers, is_gated=False)

This 32-layer encoder actually processes all patches globally, similar to a standard ViT. The key innovation lies in how it preserves information throughout the processing pipeline.The crucial innovation comes from saving intermediate layer outputs.

# Collect intermediate layer outputs from encoder output
all_intermediate_hidden_states = output[1]
intermediate_hidden_states = torch.stack(all_intermediate_hidden_states, dim=-1)
intermediate_hidden_states = intermediate_hidden_states[..., self.intermediate_layers_indices]

From config we can find intermediate_layers_indices. The model specifically preserves outputs from layers 3, 7, 15, 23, and 30, creating a multi-scale feature hierarchy. This design choice enables the model to maintain access to different levels of visual abstraction.

The Global Encoder: Integration and Gating

self.global_transformer = MllamaVisionEncoder(config, config.num_global_layers, is_gated=True)

The 8-layer global encoder introduces two key architectural innovations:

Gated Attention Mechanisms:

class MllamaVisionEncoderLayer(nn.Module):
    def __init__(self, config: MllamaVisionConfig, is_gated: bool = False):
        super().__init__()

        self.hidden_size = config.hidden_size
        self.num_attention_heads = config.attention_heads
        self.is_gated = is_gated
        self.intermediate_size = config.intermediate_size

        self.self_attn = MLLAMA_VISION_ATTENTION_CLASSES[config._attn_implementation](config)
        self.mlp = MllamaVisionMLP(config)

        self.input_layernorm = nn.LayerNorm(self.hidden_size, eps=config.norm_eps)
        self.post_attention_layernorm = nn.LayerNorm(self.hidden_size, eps=config.norm_eps)

        if is_gated:
            self.gate_attn = nn.Parameter(torch.ones(1) * math.pi / 4)
            self.gate_ffn = nn.Parameter(torch.ones(1) * math.pi / 4)

These learnable gates provide fine-grained control over information flow, allowing the model to selectively emphasize or suppress different aspects of visual features.

Feature Integration:

The global encoder processes the outputs while maintaining connections to the saved intermediate features. This creates a rich visual representation that preserves both high-level understanding and detailed visual information.

The Feature Concatenation Strategy

Perhaps the most interesting aspect is how MLLaMA handles the final feature representation:

# Concatenate final hidden state and intermediate hidden states
hidden_state = torch.cat([hidden_state, intermediate_hidden_states], dim=-1)

This concatenation results in a 7680-dimensional vector (1280 × 6), combining:

The final global encoder output
Five sets of intermediate features

This high-dimensional representation carries multiple levels of visual understanding, from low-level features to high-level concepts, providing the language model with a rich set of visual features to draw from during text generation. Here is an illustration of this two stage feature extraction

Illustration of two stage feature extraction

Architectural Implications

This two-stage design with intermediate feature preservation offers several advantages:

Multi-scale Understanding: The model maintains access to different levels of visual abstraction, useful for various types of visual reasoning.
Information Preservation: Important visual details that might be lost in deep processing are preserved through intermediate features.
Controlled Integration: The gated mechanisms in the global encoder provide learnable control over how different levels of visual information are combined.

MLLaMA’s vision encoder demonstrates a multilevel integration of vision features. Enlarged resolution also probably helps it with more visual details.

The Language Model: LLaMA 3.1 as the Foundation

While MLLaMA’s vision processing introduces notable innovations, its language model component builds upon the established LLaMA 3.1 architecture. However, the implementation includes specific modifications and design choices that make it particularly well-suited for multimodal processing. Let’s examine how this language model component is structured and adapted for vision-language tasks. Here is the language model over view. I only leave the details of first MllamaSelfAttentionDecoderLayer and MllamaCrossAttentionDecoderLayer.

(language_model): MllamaForCausalLM(
  (model): MllamaTextModel(
    (embed_tokens): Embedding(128264, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-2): 3 x MllamaSelfAttentionDecoderLayer(
        (self_attn): MllamaTextSelfSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MllamaTextMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
      )
      (3): MllamaCrossAttentionDecoderLayer(
        (cross_attn): MllamaTextCrossSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_norm): MllamaTextRMSNorm((128,), eps=1e-05)
          (k_norm): MllamaTextRMSNorm((128,), eps=1e-05)
        )
        (input_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
        (mlp): MllamaTextMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (post_attention_layernorm): MllamaTextRMSNorm((4096,), eps=1e-05)
      )
      (4-7): 4 x MllamaSelfAttentionDecoderLayer
      (8): MllamaCrossAttentionDecoderLayer
      (9-12): 4 x MllamaSelfAttentionDecoderLayer
      (13): MllamaCrossAttentionDecoderLayer
      (14-17): 4 x MllamaSelfAttentionDecoderLayer
      (18): MllamaCrossAttentionDecoderLayer
      (19-22): 4 x MllamaSelfAttentionDecoderLayer
      (23): MllamaCrossAttentionDecoderLayer
      (24-27): 4 x MllamaSelfAttentionDecoderLayer
      (28): MllamaCrossAttentionDecoderLayer
      (29-32): 4 x MllamaSelfAttentionDecoderLayer
      (33): MllamaCrossAttentionDecoderLayer
      (34-37): 4 x MllamaSelfAttentionDecoderLayer
      (38): MllamaCrossAttentionDecoderLayer
      (39): MllamaSelfAttentionDecoderLayer
    (norm): MllamaTextRMSNorm((4096,), eps=1e-05)
    (rotary_emb): MllamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)

Core Architecture Overview

The language model consists of 40 transformer layers with a hidden size of 4096 dimensions. What makes it particularly interesting is how it alternates between standard self-attention layers and cross-attention layers. Here’s the key architectural implementation:

class MllamaTextModel(MllamaPreTrainedModel):
    config_class = MllamaTextConfig
    base_model_prefix = "language_model.model"

    def __init__(self, config: MllamaTextConfig):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size
        self.embed_tokens = nn.Embedding(config.vocab_size + 8, config.hidden_size, self.padding_idx)
        self.cross_attention_layers = config.cross_attention_layers

        layers = []
        for layer_idx in range(config.num_hidden_layers):
            if layer_idx in self.cross_attention_layers:
                layers.append(MllamaCrossAttentionDecoderLayer(config, layer_idx))
            else:
                layers.append(MllamaSelfAttentionDecoderLayer(config, layer_idx))

        self.layers = nn.ModuleList(layers)
        self.norm = MllamaTextRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.rotary_emb = MllamaRotaryEmbedding(config=config)
        self.gradient_checkpointing = False
        self.post_init()

This alternating structure between the cross attention and self attention appears every 5 layers (at layers 3, 8, 13, 18, 23, 28, 33, and 38), creating regular checkpoints where visual information can influence text generation. We can also see the design principles from Llama 3.2 blog from Meta:

To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.

We could guess that the language model self attention layers are directly loaded from the Llama 3.1 models, and the main training goals are focused on the vision encoder and cross attention layers. For more detailed discussion of the cross attention layers, it will be focused in the next section.

Bridging Vision and Language: The Integration Strategy

The integration of vision and language features represents one of the most crucial architectural decisions in multimodal models. MLLaMA’s approach is particularly interesting as it implements a multi-stage integration strategy that goes beyond simple feature concatenation or single-point fusion. Let’s examine how MLLaMA bridges these two modalities effectively.

The Multimodal Projection Layer

At the heart of MLLaMA’s integration strategy lies the multimodal projector, which serves as the first bridge between visual and linguistic features:

self.multi_modal_projector = nn.Linear(
            config.vision_config.vision_output_dim,
            config.text_config.hidden_size,
            bias=True,
        )

This projection is more sophisticated than it might appear at first glance. The 7680-dimensional input carries multiple levels of visual understanding:

Final outputs from the global encoder
Saved intermediate features from different processing stages
Combined low-level and global-level visual context

The projection to 4096 dimensions isn’t just dimensionality reduction — it learns to align visual semantic spaces with the language model’s understanding, creating a unified semantic space for cross-modal reasoning.

Strategic Cross-Attention Integration

The cross-attention layers are where the magic of multimodal integration happens. These layers are specifically designed to process visual information. The cross-attention mechanism is implemented through specialized decoder layers: MllamaCrossAttentionDecoderLayer

The output of vision model turns into cross_attention_states, and they are used for the key and value calculation in the MllamaTextCrossAttention module.

The cross-attention operation itself is carefully designed:

Queries come from the language model’s text processing
Keys and values are derived from the projected visual features
Gating mechanisms control information flow
Separate normalization ensures stable training

Multi-Point Integration Strategy

The placement of cross-attention layers at regular intervals (every 5 layers) creates a multi-point integration strategy that offers several advantages:

Progressive Refinement:

Early cross-attention layers can ground initial text understanding in visual context
Middle layers can refine multimodal understanding
Later layers can ensure generation remains visually grounded

Feature Reuse:

The model can efficiently reuse projected visual features across different cross-attention layers, maintaining computational efficiency.

Architectural Implications

This integration strategy results in several important capabilities:

Flexible Attention:

The model can attend to different aspects of visual information at different generation stages
Gating mechanisms learn optimal integration patterns during training

Information Preservation:

The high-dimensional visual features preserve multiple levels of visual understanding
Regular integration points prevent loss of visual context during generation

Controlled Generation:

The model can modulate visual influence based on the generation task
Cross-attention masks enable selective use of visual information

The success of MLLaMA’s multimodal capabilities largely stems from this sophisticated integration strategy, which maintains both the richness of visual features and the fluidity of language generation while enabling meaningful cross-modal interactions.

Putting It All Together

Having examined each component of MLLaMA in detail, let’s walk through how the entire architecture works together to process an image and generate text. Understanding this end-to-end flow helps illuminate why certain architectural decisions were made and how they contribute to the model’s capabilities.

When you input an image and ask “What’s in this image?”, here’s what happens:

1. Vision Processing Stage: First, the image goes through the vision processing pipeline:

The image is:

Split into patches (32×32 grid)
Processed through the 32-layer local encoder
Key intermediate features are saved (layers 3, 7, 15, 23, and 30)
Processed through the 8-layer global encoder
Features are concatenated into a 7680-dimensional representation

2. Feature Projection Stage: The rich visual features are then projected into the language model’s semantic space

3. Language Processing Stage: As the model processes your text prompt “What’s in this image?”, several interesting things happen:

The text is tokenized and embedded
Each transformer layer processes the text
At cross-attention layers (every 5th layer), the model can “look” at the visual features
Gating mechanisms control how much visual information influences the text generation

A Concrete Example

Let’s see how this works with a specific example. Imagine processing an image of a cat sitting on a windowsill:

1.Vision Understanding:

Early local encoder layers detect edges, textures, and basic shapes
Middle layers identify the cat’s features, the window frame, light patterns
Later layers understand the spatial relationship between cat and windowsill
Global encoder integrates these features into a coherent scene understanding

2.Feature Integration:

The projected features maintain information about:
The cat’s appearance and pose
The window’s structure and lighting
The spatial layout of the scene
Abstract concepts like “relaxation” or “observation”

2.Text Generation: When generating the description, the model can:

Access low-level visual details when describing specific features
Use high-level understanding for overall scene interpretation
Maintain consistency through regular visual grounding
Generate contextually appropriate descriptions by selectively attending to relevant visual features

Conclusion: MLLaMA’s Architectural Innovations and Future Implications

After diving deep into MLLaMA’s architecture, we can appreciate how Meta’s approach to multimodal AI represents a thoughtful evolution in vision-language model design.

The key insights from MLLaMA’s design teach us several important lessons about multimodal architecture design. First, the preservation of multi-level visual features shows us that maintaining access to different levels of understanding is crucial for robust visual reasoning. The model’s ability to reference both low-level details and high-level concepts allows it to generate more nuanced and accurate descriptions than architectures that rely solely on final-layer features.

Second, MLLaMA’s integration strategy demonstrates the importance of controlled, strategic information flow. Rather than trying to merge visual and linguistic information all at once or waiting until the final stages, the regular cross-attention checkpoints with learnable gates allow the model to develop sophisticated patterns of multimodal reasoning.

For practitioners working on multimodal AI systems, MLLaMA’s design offers several practical insights other than early fusion model designs. It works well reveals the potential of a multi-level integration between vision encoder and language model. Furthermore, we can freeze the language model entirely for the training.

As we continue to push the boundaries of multimodal AI, there requires more open source efforts to share with the community. Not only just releasing model and code but the architectural details, design principles, training details, data usages. That’s the key to include and inspire more people for research.

Citation

Cited as:

Qi, Jianing. (Nov 2024). Inside MLLaMA 3.2: Understanding Meta’s Vision-Language Model Architecture.. Medium. https://j-qi.medium.com/inside-mllama-3-2-understanding-metas-vision-language-model-architecture-ae12ad24dcbf

@article{qi24insidemllama,
  title   = "Inside MLLaMA 3.2: Understanding Meta's Vision-Language Model Architecture",
  author  = "Qi, Jianing",
  journal = "Medium",
  year    = "2024",
  month   = "Nov",
  url     = "https://j-qi.medium.com/inside-mllama-3-2-understanding-metas-vision-language-model-architecture-ae12ad24dcbf"
}

Inside Multimodal LLaMA 3.2: Understanding Meta’s Vision-Language Model Architecture

The Big Picture: MLLaMA’s Architecture Overview

Core Architecture

The Vision Encoder: Two Stages of Understanding

First Stage Vision Encoder

The Global Encoder: Integration and Gating

The Feature Concatenation Strategy

Architectural Implications

The Language Model: LLaMA 3.1 as the Foundation

Core Architecture Overview

Bridging Vision and Language: The Integration Strategy

The Multimodal Projection Layer

Strategic Cross-Attention Integration

Multi-Point Integration Strategy

Architectural Implications

Putting It All Together

A Concrete Example

Conclusion: MLLaMA’s Architectural Innovations and Future Implications

Citation

Written by Jianing Qi

Responses (2)