This page is generated from the following source files:
nanoVLM implements a compact Vision-Language Model architecture designed for educational purposes and efficient training. The system integrates a Vision Transformer (ViT) encoder, a Grouped Query Attention (GQA) based language model, and a modality projector that bridges the two modalities through pixel shuffle operations and linear projections.
The architecture follows a three-stage pipeline pattern: visual feature extraction, cross-modal projection, and autoregressive text generation. The system is designed to process interleaved image-text inputs and generate coherent multimodal responses.
正在加载图表渲染器...
Key Architecture Points:
Vision Encoder: Uses SigLIP2-base-patch16-512 with 12 transformer blocks, processing 512×512 images into 1020 patches (32×32 grid + optional CLS token) with 768-dimensional features (models/config.py:6-15).
Modality Projector: Implements pixel shuffle with factor 4, reducing 1020 visual tokens to 64 image tokens (16× reduction) through spatial rearrangement before linear projection (models/modality_projector.py:23-38).
Language Model: SmolLM2-360M backbone with 32 blocks, 15 query heads and 5 KV heads (3:1 grouping ratio), supporting up to 8192 position embeddings via RoPE (models/config.py:17-34).
Token Fusion: Image embeddings replace special placeholder tokens (<|image|>) in the text embedding sequence, enabling seamless multimodal context processing (models/vision_language_model.py:85-92).
The Vision Transformer encoder implements a patch-based architecture with multi-head self-attention, designed to extract visual features from input images.
The ViTPatchEmbeddings class converts input images into a sequence of patch embeddings using a convolutional projection:
python1# Patch extraction via 2D convolution 2self.conv = nn.Conv2d( 3 in_channels=3, 4 out_channels=self.embd_dim, # 768 5 kernel_size=self.patch_size, # 16 6 stride=self.patch_size, 7 padding="valid", 8)
Implementation Details:
vit_cls_flag=False) (models/config.py:14).The ViTMultiHeadAttention module implements bidirectional self-attention with 12 heads:
python1# Combined QKV projection 2self.qkv_proj = nn.Linear(self.embd_dim, 3 * self.embd_dim, bias=True) 3# Output projection 4self.out_proj = nn.Linear(self.embd_dim, self.embd_dim, bias=True)
Key Features:
torch.nn.functional.scaled_dot_product_attention when available for fused attention kernels (models/vision_transformer.py:66-86).is_causal=False), allowing each patch to attend to all other patches (models/vision_transformer.py:85).Data Flow:
The language model implements a decoder-only Transformer with Grouped Query Attention (GQA) and Rotary Position Embeddings (RoPE) for efficient autoregressive generation.
GQA reduces memory and computation by sharing key-value heads across multiple query heads:
python1# GQA configuration 2self.n_heads = 15 # Query heads 3self.n_kv_heads = 5 # Key-Value heads 4self.n_kv_groups = 3 # Queries per KV head
Architecture Benefits:
KV-Cache Implementation:
python1# Cache management during decode 2if not is_prefill and block_kv_cache['key'] is not None: 3 k = torch.cat([block_kv_cache['key'], k_rotated], dim=2) 4 v = torch.cat([block_kv_cache['value'], v_curr], dim=2) 5 block_kv_cache['key'] = k 6 block_kv_cache['value'] = v
The cache stores historical key-value states, enabling O(1) complexity per generation step instead of O(T) (models/language_model.py:239-252).
RoPE encodes positional information through rotation matrices applied to query and key vectors:
python1def rotate_half(x: torch.Tensor) -> torch.Tensor: 2 x1, x2 = x.chunk(2, dim=-1) 3 return torch.cat((-x2, x1), dim=-1)
Implementation Details:
The generation loop demonstrates KV-cache usage:
python1# Prefill phase 2prompt_output, kv_cache_list = self.forward( 3 generated_outputs, 4 attention_mask=None, 5 kv_cache=None, 6 start_pos=0 7) 8 9# Decode phase with cache 10for i in range(max_new_tokens): 11 decode_step_output, kv_cache_list = self.forward( 12 next_output, 13 attention_mask=None, 14 kv_cache=kv_cache_list, 15 start_pos=current_token_start_pos 16 )
Generation Flow:
The modality projector bridges the vision and language modalities through a two-stage process: pixel shuffle for token reduction followed by linear projection.
Pixel shuffle (also known as space-to-depth) reduces the number of visual tokens by spatially rearranging features:
python1def pixel_shuffle(self, x): 2 bsz, seq, embed_dim = x.size() 3 seq_root = int(seq**0.5) # 32 for 1024 patches 4 height = width = seq_root 5 6 # Reshape to spatial grid 7 x = x.view(bsz, height, width, embed_dim) 8 9 # Rearrange: (H, W, C) -> (H/4, W/4, C*16) 10 x = x.reshape(bsz, h_out, self.scale_factor, 11 w_out, self.scale_factor, embed_dim) 12 x = x.permute(0, 1, 3, 2, 4, 5).contiguous() 13 x = x.reshape(bsz, h_out * w_out, embed_dim * self.scale_factor**2)
Token Reduction:
After pixel shuffle, a linear layer projects the expanded features to the language model embedding dimension:
python1self.input_dim = cfg.vit_hidden_dim * (cfg.mp_pixel_shuffle_factor**2) # 768 * 16 = 12288 2self.output_dim = cfg.lm_hidden_dim # 960 3self.proj = nn.Linear(self.input_dim, self.output_dim, bias=False)
Projection Details:
The VisionLanguageModel class orchestrates the three components in a unified forward pass, handling multimodal input processing and autoregressive generation.
正在加载图表渲染器...
Pipeline Stages:
Image Processing: Images are resized, normalized, and converted to tensors via get_image_processor (train.py:117).
Vision Encoding: The ViT encoder processes images through 12 transformer blocks, outputting 1020 patch features (models/vision_language_model.py:89).
Modality Projection: Pixel shuffle reduces tokens to 64, then linear projection aligns to LM dimension (models/vision_language_model.py:90).
Embedding Fusion: Image embeddings replace placeholder tokens in the text embedding sequence (models/vision_language_model.py:92).
Prefill Phase: Full sequence processed through LM, KV-cache initialized (models/vision_language_model.py:98-103).
Decode Phase: Autoregressive token generation with KV-cache optimization (models/vision_language_model.py:116-151).
The generation method supports both greedy and nucleus sampling:
python1if greedy: 2 next_token_id = torch.argmax(current_logits, dim=-1, keepdim=True) 3else: 4 filtered_logits = top_k_top_p_filtering(current_logits, top_k=top_k, top_p=top_p) 5 probs = torch.softmax(filtered_logits / temperature, dim=-1) 6 next_token_id = torch.multinomial(probs, num_samples=1)
Sampling Parameters:
Post-generation processing ensures clean output termination:
python1# Find first EOS token in each sequence 2eos_mask = (generated_ids == self.tokenizer.eos_token_id) 3first_eos_indices = torch.min(masked_col_indices, dim=1).values 4 5# Replace all tokens after first EOS with EOS 6replace_mask = col_indices_for_comparison > actual_first_eos_indices.unsqueeze(1) 7generated_ids[replace_mask] = self.tokenizer.eos_token_id
This ensures that once the model generates an EOS token, all subsequent positions are also marked as EOS, preventing garbage output (models/vision_language_model.py:159-181).
The training configuration uses different learning rates for each component:
This prevents catastrophic forgetting in pre-trained components while allowing rapid adaptation of the projector (models/config.py:59-61).
The architecture uses pixel shuffle instead of a multi-layer perceptron for token reduction:
Advantages:
Trade-offs:
GQA with 15:5 query:KV ratio balances efficiency and quality:
Rationale:
The configuration disables CLS token (vit_cls_flag=False):
Reasoning:
The vocabulary is extended by 66 tokens for VLM-specific needs:
python1extra_token_amount: int = 66 # Image tokens, special markers 2lm_vocab_size: int = lm_base_vocab_size + extra_token_amount # 49152 + 66 = 49218
This includes image placeholder tokens, global image tokens, and grid position tokens (r1c1 through r8c8) for potential multi-image or high-resolution support (models/config.py:23-51).
| Component | Technology | Purpose | Selection Rationale | Alternative |
|---|---|---|---|---|
| Vision Encoder | SigLIP2-base-patch16-512 | Image feature extraction | Strong zero-shot performance, 512px native resolution | CLIP, EVA-CLIP |
| Language Model | SmolLM2-360M-Instruct | Text generation | Compact size, instruction-tuned, GQA support | Qwen2, Gemma2 |
| Attention Mechanism | Grouped Query Attention | Efficient inference | 67% KV-cache reduction, minimal quality loss | Multi-Head Attention |
| Position Encoding | RoPE | Sequence position encoding | Relative position preservation, length extrapolation | ALiBi, Absolute PE |
| Token Reduction | Pixel Shuffle | Visual token compression | Parameter-efficient, spatial preservation | Average pooling, MLP |
| Framework | PyTorch | Deep learning framework | Wide adoption, SDPA support, distributed training | JAX, TensorFlow |
| Tokenizer | HuggingFace Tokenizers | Text processing | Fast, compatible with SmolLM2 | SentencePiece |
| Training Data | FineVision Dataset | Multimodal instruction tuning | Diverse tasks, quality ratings | LLaVA, ShareGPT4V |
正在加载图表渲染器...
Dependency Highlights:
Configuration Centralization: VLMConfig serves as the single source of truth for all architectural hyperparameters, ensuring consistency across components (models/config.py:5-54).
Loose Coupling: Vision encoder, language model, and projector can be independently modified by changing configuration values without code changes.
Data-Model Separation: Dataset and collator components depend on configuration but not on model implementations, enabling independent testing and optimization.
| Parameter | Default | Description |
|---|---|---|
vit_hidden_dim | 768 | Embedding dimension for patches |
vit_patch_size | 16 | Size of each image patch (16×16) |
vit_img_size | 512 | Input image resolution (512×512) |
vit_n_heads | 12 | Number of attention heads |
vit_n_blocks | 12 | Number of transformer layers |
vit_cls_flag | False | Whether to prepend CLS token |
vit_model_type | google/siglip2-base-patch16-512 | Pre-trained weights identifier |
| Parameter | Default | Description |
|---|---|---|
lm_hidden_dim | 960 | Embedding dimension |
lm_n_heads | 15 | Number of query heads |
lm_n_kv_heads | 5 | Number of key-value heads |
lm_n_blocks | 32 | Number of transformer layers |
lm_max_position_embeddings | 8192 | Maximum sequence length |
lm_vocab_size | 49218 | Vocabulary size (base + extra) |
lm_re_base | 100000 | RoPE base frequency |
| Parameter | Default | Description |
|---|---|---|
mp_pixel_shuffle_factor | 4 | Spatial reduction factor |
mp_image_token_length | 64 | Number of output image tokens |
| Parameter | Default | Description |
|---|---|---|
lr_mp | 0.00512 | Learning rate for projector |
lr_vision_backbone | 5e-5 | Learning rate for ViT |
lr_language_backbone | 5e-5 | Learning rate for LM |
batch_size | 2 | Per-device batch size |
gradient_accumulation_steps | 8 | Effective batch size multiplier |
max_training_steps | 40000 | Total training iterations |
The model initialization follows a structured sequence to properly load pre-trained weights and configure components:
Configuration Loading: VLMConfig and TrainConfig are instantiated with default or provided parameters (models/config.py:5-87).
Tokenizer Initialization: Tokenizer is loaded from SmolLM2 with extra VLM tokens added to vocabulary (train.py:118).
Image Processor Setup: Image processor configured with max size and resize settings (train.py:117).
Model Construction: VisionLanguageModel instantiates ViT, LM, and projector submodules.
Weight Loading: If vlm_load_backbone_weights=True, pre-trained weights are loaded from HuggingFace Hub for ViT and LM backbones (models/config.py:52).
DataLoader Creation: Datasets are loaded, wrapped with collators, and batched for training (train.py:213-235).
Training Loop: Gradient accumulation, logging, and checkpoint saving are configured based on TrainConfig.