Architecture at a Glance - huggingface/nanoVLM

Source Files

This page is generated from the following source files:

nanoVLM implements a compact Vision-Language Model architecture designed for educational purposes and efficient training. The system integrates a Vision Transformer (ViT) encoder, a Grouped Query Attention (GQA) based language model, and a modality projector that bridges the two modalities through pixel shuffle operations and linear projections.

System Architecture Overview

The architecture follows a three-stage pipeline pattern: visual feature extraction, cross-modal projection, and autoregressive text generation. The system is designed to process interleaved image-text inputs and generate coherent multimodal responses.

正在加载图表渲染器...

Key Architecture Points:

Vision Encoder: Uses SigLIP2-base-patch16-512 with 12 transformer blocks, processing 512×512 images into 1020 patches (32×32 grid + optional CLS token) with 768-dimensional features (models/config.py:6-15).
Modality Projector: Implements pixel shuffle with factor 4, reducing 1020 visual tokens to 64 image tokens (16× reduction) through spatial rearrangement before linear projection (models/modality_projector.py:23-38).
Language Model: SmolLM2-360M backbone with 32 blocks, 15 query heads and 5 KV heads (3:1 grouping ratio), supporting up to 8192 position embeddings via RoPE (models/config.py:17-34).
Token Fusion: Image embeddings replace special placeholder tokens (<|image|>) in the text embedding sequence, enabling seamless multimodal context processing (models/vision_language_model.py:85-92).

Vision Encoder

The Vision Transformer encoder implements a patch-based architecture with multi-head self-attention, designed to extract visual features from input images.

Patch Embeddings

The ViTPatchEmbeddings class converts input images into a sequence of patch embeddings using a convolutional projection:

python
1# Patch extraction via 2D convolution
2self.conv = nn.Conv2d(
3    in_channels=3,
4    out_channels=self.embd_dim,  # 768
5    kernel_size=self.patch_size,  # 16
6    stride=self.patch_size,
7    padding="valid",
8)

Implementation Details:

Input Processing: Images (B, 3, 512, 512) are divided into 32×32 patches of size 16×16, yielding 1024 patches (models/vision_transformer.py:18-24).
CLS Token: Optional learnable CLS token prepended to the sequence (disabled by default: vit_cls_flag=False) (models/config.py:14).
Position Embeddings: Learnable positional encodings added to patch embeddings, with shape (1, num_patches + cls_flag, 768) (models/vision_transformer.py:26-30).

Multi-Head Attention

The ViTMultiHeadAttention module implements bidirectional self-attention with 12 heads:

python
1# Combined QKV projection
2self.qkv_proj = nn.Linear(self.embd_dim, 3 * self.embd_dim, bias=True)
3# Output projection
4self.out_proj = nn.Linear(self.embd_dim, self.embd_dim, bias=True)

Key Features:

SDPA Optimization: Uses torch.nn.functional.scaled_dot_product_attention when available for fused attention kernels (models/vision_transformer.py:66-86).
Bidirectional Attention: No causal masking (is_causal=False), allowing each patch to attend to all other patches (models/vision_transformer.py:85).
Head Dimension: 768 / 12 = 64 dimensions per head (models/vision_transformer.py:51-54).

Data Flow:

Input: (B, 1020, 768) patch embeddings
QKV projection: (B, 1020, 2304)
Split into Q, K, V: each (B, 12, 1020, 64)
Attention computation: (B, 12, 1020, 64)
Output projection: (B, 1020, 768)

Language Model

The language model implements a decoder-only Transformer with Grouped Query Attention (GQA) and Rotary Position Embeddings (RoPE) for efficient autoregressive generation.

Grouped Query Attention

GQA reduces memory and computation by sharing key-value heads across multiple query heads:

python
1# GQA configuration
2self.n_heads = 15        # Query heads
3self.n_kv_heads = 5      # Key-Value heads
4self.n_kv_groups = 3     # Queries per KV head

Architecture Benefits:

Memory Efficiency: 5 KV heads instead of 15 reduces KV-cache size by 67% (models/language_model.py:183-192).
Inference Speed: Fewer KV projections improve throughput during autoregressive decoding.
Quality Preservation: 3:1 grouping ratio maintains most of the multi-head attention expressiveness.

KV-Cache Implementation:

python
1# Cache management during decode
2if not is_prefill and block_kv_cache['key'] is not None:
3    k = torch.cat([block_kv_cache['key'], k_rotated], dim=2)
4    v = torch.cat([block_kv_cache['value'], v_curr], dim=2)
5    block_kv_cache['key'] = k
6    block_kv_cache['value'] = v

The cache stores historical key-value states, enabling O(1) complexity per generation step instead of O(T) (models/language_model.py:239-252).

Rotary Position Embeddings

RoPE encodes positional information through rotation matrices applied to query and key vectors:

python
1def rotate_half(x: torch.Tensor) -> torch.Tensor:
2    x1, x2 = x.chunk(2, dim=-1)
3    return torch.cat((-x2, x1), dim=-1)

Implementation Details:

Base Frequency: 100,000 for fine-grained position encoding (models/config.py:20).
Max Positions: Supports up to 8192 tokens (models/config.py:21).
Application: Applied to Q and K before attention computation, preserving relative position information (models/language_model.py:236).

Autoregressive Generation

The generation loop demonstrates KV-cache usage:

python
1# Prefill phase
2prompt_output, kv_cache_list = self.forward(
3    generated_outputs, 
4    attention_mask=None,
5    kv_cache=None,
6    start_pos=0
7)
8
9# Decode phase with cache
10for i in range(max_new_tokens):
11    decode_step_output, kv_cache_list = self.forward(
12        next_output, 
13        attention_mask=None,
14        kv_cache=kv_cache_list,
15        start_pos=current_token_start_pos
16    )

Generation Flow:

Prefill: Process entire prompt, initialize KV-cache (models/language_model.py:502-507).
Decode: For each new token, compute Q for new token only, retrieve cached K/V, compute attention (models/language_model.py:527-533).
Update Cache: Append new K/V to cache for next iteration.

Modality Projector

The modality projector bridges the vision and language modalities through a two-stage process: pixel shuffle for token reduction followed by linear projection.

Pixel Shuffle Operation

Pixel shuffle (also known as space-to-depth) reduces the number of visual tokens by spatially rearranging features:

python
1def pixel_shuffle(self, x):
2    bsz, seq, embed_dim = x.size()
3    seq_root = int(seq**0.5)  # 32 for 1024 patches
4    height = width = seq_root
5    
6    # Reshape to spatial grid
7    x = x.view(bsz, height, width, embed_dim)
8    
9    # Rearrange: (H, W, C) -> (H/4, W/4, C*16)
10    x = x.reshape(bsz, h_out, self.scale_factor, 
11                  w_out, self.scale_factor, embed_dim)
12    x = x.permute(0, 1, 3, 2, 4, 5).contiguous()
13    x = x.reshape(bsz, h_out * w_out, embed_dim * self.scale_factor**2)

Token Reduction:

Input: 1020 patches (32×32 grid + CLS) with 768 dims
Scale Factor: 4, reducing grid to 8×8
Output: 64 tokens with 12,288 dims (768 × 16) (models/modality_projector.py:23-38)

Linear Projection

After pixel shuffle, a linear layer projects the expanded features to the language model embedding dimension:

python
1self.input_dim = cfg.vit_hidden_dim * (cfg.mp_pixel_shuffle_factor**2)  # 768 * 16 = 12288
2self.output_dim = cfg.lm_hidden_dim  # 960
3self.proj = nn.Linear(self.input_dim, self.output_dim, bias=False)

Projection Details:

Weight Initialization: Normal distribution with mean=0.0, std=0.02 (models/modality_projector.py:16-20).
No Bias: Disabled to reduce parameters and match common practice (models/modality_projector.py:12).
Final Output: 64 image tokens with 960 dimensions, matching the language model embedding space (models/config.py:37-38).

VLM Architecture Integration

The VisionLanguageModel class orchestrates the three components in a unified forward pass, handling multimodal input processing and autoregressive generation.

Multimodal Generation Pipeline

正在加载图表渲染器...

Pipeline Stages:

Image Processing: Images are resized, normalized, and converted to tensors via get_image_processor (train.py:117).
Vision Encoding: The ViT encoder processes images through 12 transformer blocks, outputting 1020 patch features (models/vision_language_model.py:89).
Modality Projection: Pixel shuffle reduces tokens to 64, then linear projection aligns to LM dimension (models/vision_language_model.py:90).
Embedding Fusion: Image embeddings replace placeholder tokens in the text embedding sequence (models/vision_language_model.py:92).
Prefill Phase: Full sequence processed through LM, KV-cache initialized (models/vision_language_model.py:98-103).
Decode Phase: Autoregressive token generation with KV-cache optimization (models/vision_language_model.py:116-151).

Token Sampling Strategies

The generation method supports both greedy and nucleus sampling:

python
1if greedy:
2    next_token_id = torch.argmax(current_logits, dim=-1, keepdim=True)
3else:
4    filtered_logits = top_k_top_p_filtering(current_logits, top_k=top_k, top_p=top_p)
5    probs = torch.softmax(filtered_logits / temperature, dim=-1)
6    next_token_id = torch.multinomial(probs, num_samples=1)

Sampling Parameters:

Greedy: Deterministic selection of highest probability token (models/vision_language_model.py:117-118).
Top-K: Keep only top 50 tokens before softmax (models/vision_language_model.py:120).
Top-P: Nucleus sampling with cumulative probability threshold 0.9 (models/utils.py:38-49).
Temperature: Softmax temperature 0.5 for controlled randomness (models/vision_language_model.py:121).

EOS Token Handling

Post-generation processing ensures clean output termination:

python
1# Find first EOS token in each sequence
2eos_mask = (generated_ids == self.tokenizer.eos_token_id)
3first_eos_indices = torch.min(masked_col_indices, dim=1).values
4
5# Replace all tokens after first EOS with EOS
6replace_mask = col_indices_for_comparison > actual_first_eos_indices.unsqueeze(1)
7generated_ids[replace_mask] = self.tokenizer.eos_token_id

This ensures that once the model generates an EOS token, all subsequent positions are also marked as EOS, preventing garbage output (models/vision_language_model.py:159-181).

Core Design Decisions

1. Separate Learning Rates for Components

The training configuration uses different learning rates for each component:

Modality Projector: 0.00512 (highest, trains from scratch)
Vision Backbone: 5e-5 (low, fine-tuning pre-trained weights)
Language Backbone: 5e-5 (low, fine-tuning pre-trained weights)

This prevents catastrophic forgetting in pre-trained components while allowing rapid adaptation of the projector (models/config.py:59-61).

2. Pixel Shuffle vs. MLP Projector

The architecture uses pixel shuffle instead of a multi-layer perceptron for token reduction:

Advantages:

Parameter Efficiency: Single linear layer vs. 2-layer MLP
Spatial Preservation: Maintains 2D spatial relationships through rearrangement
Computational Efficiency: No activation functions, pure tensor operations

Trade-offs:

Fixed Reduction: 16× reduction is fixed by pixel shuffle factor 4
Less Expressiveness: Single projection may be less adaptive than MLP

3. Grouped Query Attention Selection

GQA with 15:5 query:KV ratio balances efficiency and quality:

Rationale:

Inference Speed: 3× reduction in KV-cache memory bandwidth
Quality: Minimal degradation compared to full multi-head attention
Compatibility: Matches SmolLM2 architecture for weight loading

4. No CLS Token in Vision Encoder

The configuration disables CLS token (vit_cls_flag=False):

Reasoning:

All-Patch Information: All patches contribute to the representation
Projector Design: Pixel shuffle operates on spatial grid, CLS would require special handling
SigLIP Training: Pre-trained weights may not use CLS token

5. Extra Token Slots for VLM

The vocabulary is extended by 66 tokens for VLM-specific needs:

python
1extra_token_amount: int = 66  # Image tokens, special markers
2lm_vocab_size: int = lm_base_vocab_size + extra_token_amount  # 49152 + 66 = 49218

This includes image placeholder tokens, global image tokens, and grid position tokens (r1c1 through r8c8) for potential multi-image or high-resolution support (models/config.py:23-51).

Technology Stack

Component	Technology	Purpose	Selection Rationale	Alternative
Vision Encoder	SigLIP2-base-patch16-512	Image feature extraction	Strong zero-shot performance, 512px native resolution	CLIP, EVA-CLIP
Language Model	SmolLM2-360M-Instruct	Text generation	Compact size, instruction-tuned, GQA support	Qwen2, Gemma2
Attention Mechanism	Grouped Query Attention	Efficient inference	67% KV-cache reduction, minimal quality loss	Multi-Head Attention
Position Encoding	RoPE	Sequence position encoding	Relative position preservation, length extrapolation	ALiBi, Absolute PE
Token Reduction	Pixel Shuffle	Visual token compression	Parameter-efficient, spatial preservation	Average pooling, MLP
Framework	PyTorch	Deep learning framework	Wide adoption, SDPA support, distributed training	JAX, TensorFlow
Tokenizer	HuggingFace Tokenizers	Text processing	Fast, compatible with SmolLM2	SentencePiece
Training Data	FineVision Dataset	Multimodal instruction tuning	Diverse tasks, quality ratings	LLaVA, ShareGPT4V

Module Dependency Graph

正在加载图表渲染器...

Dependency Highlights:

Configuration Centralization: VLMConfig serves as the single source of truth for all architectural hyperparameters, ensuring consistency across components (models/config.py:5-54).
Loose Coupling: Vision encoder, language model, and projector can be independently modified by changing configuration values without code changes.
Data-Model Separation: Dataset and collator components depend on configuration but not on model implementations, enabling independent testing and optimization.

Key Configuration Parameters

Vision Encoder Configuration

Parameter	Default	Description
`vit_hidden_dim`	768	Embedding dimension for patches
`vit_patch_size`	16	Size of each image patch (16×16)
`vit_img_size`	512	Input image resolution (512×512)
`vit_n_heads`	12	Number of attention heads
`vit_n_blocks`	12	Number of transformer layers
`vit_cls_flag`	False	Whether to prepend CLS token
`vit_model_type`	google/siglip2-base-patch16-512	Pre-trained weights identifier

Language Model Configuration

Parameter	Default	Description
`lm_hidden_dim`	960	Embedding dimension
`lm_n_heads`	15	Number of query heads
`lm_n_kv_heads`	5	Number of key-value heads
`lm_n_blocks`	32	Number of transformer layers
`lm_max_position_embeddings`	8192	Maximum sequence length
`lm_vocab_size`	49218	Vocabulary size (base + extra)
`lm_re_base`	100000	RoPE base frequency

Projector Configuration

Parameter	Default	Description
`mp_pixel_shuffle_factor`	4	Spatial reduction factor
`mp_image_token_length`	64	Number of output image tokens

Training Configuration

Parameter	Default	Description
`lr_mp`	0.00512	Learning rate for projector
`lr_vision_backbone`	5e-5	Learning rate for ViT
`lr_language_backbone`	5e-5	Learning rate for LM
`batch_size`	2	Per-device batch size
`gradient_accumulation_steps`	8	Effective batch size multiplier
`max_training_steps`	40000	Total training iterations

Startup and Initialization Flow

The model initialization follows a structured sequence to properly load pre-trained weights and configure components:

Configuration Loading: VLMConfig and TrainConfig are instantiated with default or provided parameters (models/config.py:5-87).
Tokenizer Initialization: Tokenizer is loaded from SmolLM2 with extra VLM tokens added to vocabulary (train.py:118).
Image Processor Setup: Image processor configured with max size and resize settings (train.py:117).
Model Construction: VisionLanguageModel instantiates ViT, LM, and projector submodules.
Weight Loading: If vlm_load_backbone_weights=True, pre-trained weights are loaded from HuggingFace Hub for ViT and LM backbones (models/config.py:52).
DataLoader Creation: Datasets are loaded, wrapped with collators, and batched for training (train.py:213-235).
Training Loop: Gradient accumulation, logging, and checkpoint saving are configured based on TrainConfig.