This page is generated from the following source files:
nanoVLM is a lightweight, educational Vision-Language Model (VLM) implementation designed to provide the simplest possible codebase for training and fine-tuning small-scale multimodal models. The project emphasizes code readability and approachability, with the entire model definition and training logic fitting in approximately 750 lines of pure PyTorch code. Inspired by Andrej Karpathy's nanoGPT philosophy, nanoVLM serves as an educational platform for understanding VLM architecture and training pipelines without the complexity of production frameworks (README.md:26-28).
The project demonstrates that a 222M parameter model—combining SigLIP-B/16-224-85M as the vision backbone and SmolLM2-135M as the language decoder—can achieve 35.3% accuracy on the MMStar benchmark after training on approximately 1.7M samples from the Cauldron dataset for 6 hours on a single H100 GPU. This makes nanoVLM an accessible platform for researchers and practitioners to experiment with vision-language architectures, explore training configurations, and understand the interplay between visual and textual modalities (README.md:31-38).
| Component | Technology | Version/Specification |
|---|---|---|
| Core Framework | PyTorch | Pure implementation, no external trainers |
| Vision Encoder | SigLIP-B/16-224 | 85M parameters, patch16-224 resolution |
| Language Decoder | SmolLM2-135M | 135M parameters |
| Model Hub | Hugging Face Hub | Model sharing and distribution |
| Tokenizer | Custom tokenizer | With image token support |
| Training Data | The Cauldron | ~1.7M samples |
| Evaluation | MMStar | Benchmark for VLM capabilities |
nanoVLM/
├── models/
│ ├── vision_language_model.py # Main VLM orchestration (~100 lines)
│ ├── vision_transformer.py # Vision backbone (~150 lines)
│ ├── language_model.py # Language decoder (~250 lines)
│ ├── modality_projector.py # Cross-modal projection (~50 lines)
│ ├── config.py # Configuration dataclasses
│ └── utils.py # Helper functions
├── data/
│ ├── processors.py # Data preprocessing logic
│ └── collators.py # Batch collation functions
├── eval/
│ └── measure_vram.py # VRAM profiling utilities
├── train.py # Training loop (~200 lines)
├── generate.py # Inference utilities
└── README.md # Documentation
Educational Codebase Design: The entire implementation spans approximately 750 lines across four core modules, making it feasible to read and understand the complete VLM architecture in a single session. Each component is self-contained and clearly documented (README.md:26-26).
Pure PyTorch Implementation: The project deliberately avoids dependencies on high-level training frameworks like transformers.Trainer, accelerate, or deepspeed. This design choice ensures maximum transparency into the training process and allows developers to modify any aspect of the pipeline without navigating abstraction layers.
Flexible Multimodal Fusion: The architecture supports arbitrary numbers of images per sample, with the _replace_img_tokens_with_embd method handling variable image counts within a single batch. This flexibility enables training on diverse multimodal datasets without rigid preprocessing requirements (models/vision_language_model.py:36-49).
Efficient Inference with KV-Cache: The generate method implements autoregressive token sampling with KV-cache optimization, reducing computational overhead during inference by caching key-value states across decoding steps (models/vision_language_model.py:82-118).
Hugging Face Hub Integration: Trained models can be directly pushed to the Hugging Face Hub with automatic generation of config files, safetensors weights, and model cards. This integration facilitates model sharing and reproducibility (README.md:155-169).
Hardware Accessibility: The default 222M parameter model requires approximately 4.5GB VRAM for training with batch size 1, making it accessible on consumer-grade GPUs. VRAM requirements scale predictably with batch size, reaching ~38GB for batch size 128 (README.md:191-218).
Configurable Architecture: The VLMConfig system enables rapid experimentation with different backbone models, hidden dimensions, and architectural parameters without code modifications.
Comprehensive VRAM Profiling: The included measure_vram.py script allows developers to benchmark memory requirements for specific configurations and batch sizes on their target hardware.
The nanoVLM architecture follows a modular design pattern with clear separation of concerns between visual encoding, cross-modal projection, and language generation. The VisionLanguageModel class serves as the central orchestrator, managing data flow between components during both training and inference.
正在加载图表渲染器...
Architecture Evidence: The VisionLanguageModel.__init__ method instantiates the three core components: vision_encoder (ViT), decoder (LanguageModel), and MP (ModalityProjector). The load_backbone parameter controls whether to load pretrained weights or initialize from scratch (models/vision_language_model.py:21-34).
Data Flow Evidence: The forward method demonstrates the complete pipeline: image tensors pass through the vision encoder and modality projector, then replace image token placeholders in the text embeddings before language model processing (models/vision_language_model.py:62-80).
The VisionLanguageModel class (models/vision_language_model.py) serves as the central coordinator for all multimodal operations. It manages the lifecycle of three sub-components and provides unified interfaces for both training (via forward) and inference (via generate).
Responsibility Boundary: The orchestrator handles image preprocessing, embedding fusion, and loss computation but delegates actual encoding/decoding to specialized modules. It does not implement attention mechanisms or layer normalization directly.
Key APIs:
__init__(cfg: VLMConfig, load_backbone=True): Initializes components with optional pretrained loadingforward(input_ids, images, attention_mask, targets): Training forward pass returning logits and lossgenerate(input_ids, images, attention_mask, max_new_tokens, ...): Autoregressive generation with samplingfrom_pretrained(repo_id_or_path): Class method for loading saved modelsCritical Data Structures:
input_ids: Tensor[B, T_seq] - Tokenized text with image token placeholdersimages: List[Tensor] or Tensor[B, C, H, W] - Raw or preprocessed imagestoken_embd: Tensor[B, T_seq, D_lm] - Text embeddings before fusionimage_embd: Tensor[num_images, mp_token_length, D_lm] - Projected visual featuresError Handling: The _process_images method gracefully handles empty image lists by returning None, allowing text-only inference without special casing. The from_pretrained method validates the existence of both config.json and model.safetensors before attempting to load, raising descriptive ValueError exceptions for missing files (models/vision_language_model.py:51-60, models/vision_language_model.py:185-210).
The _replace_img_tokens_with_embd method implements the core multimodal fusion strategy. Rather than concatenating image and text embeddings along the sequence dimension, nanoVLM uses a placeholder token approach where image embeddings replace designated image token positions in the text embedding sequence.
Implementation Details:
This approach supports variable numbers of images per sample and maintains temporal alignment between visual and textual features. The method assumes that the number of image token placeholders in input_ids matches the number of processed image embeddings (models/vision_language_model.py:36-49).
The generate method implements a two-phase inference process: multimodal prefill followed by autoregressive decoding with KV-cache.
正在加载图表渲染器...
Prefill Phase: The initial prompt (text + image embeddings) is processed through the language model once, generating the first logits and populating the KV-cache with attention key-value pairs for all prompt positions.
Decode Phase: Each subsequent token is generated by:
EOS Handling: Post-generation processing identifies the first EOS token in each sequence and replaces all subsequent tokens with EOS, ensuring clean output boundaries (models/vision_language_model.py:82-118).
The Modality Projector (models/modality_projector.py) bridges the dimensionality gap between the vision encoder's output space and the language model's embedding space. This component transforms visual features into a format compatible with the language decoder's input expectations.
Design Pattern: The projector typically implements a linear transformation (optionally with activation) that maps from the vision encoder's hidden dimension to the language model's embedding dimension. The output is reshaped to span multiple token positions (mp_image_token_length), allowing visual information to occupy a variable number of sequence positions.
The train.py script (~200 lines) implements a complete training pipeline including data loading, optimization, and checkpointing. The training loop follows standard PyTorch patterns with gradient accumulation, learning rate scheduling, and periodic evaluation.
Key Characteristics:
The data/processors.py module handles preprocessing of images and text, including tokenization, image normalization, and formatting for the model's expected input structure. The data/collators.py module implements batch collation functions that pad sequences and stack images for efficient batched processing.
| Metric | Value | Notes |
|---|---|---|
| Total Parameters (Default) | 222M | SigLIP (85M) + SmolLM2 (135M) |
| Code Lines (Core Modules) | ~750 | Excluding boilerplate |
| MMStar Accuracy | 35.3% | After 6h training on H100 |
| Minimum VRAM (Training) | ~4.5 GB | Batch size 1 |
| Maximum Batch Size (80GB GPU) | ~256 | Before OOM |
| Supported Image Inputs | Variable | Per-sample flexibility |
| Generation Strategies | 3 | Greedy, top-k, top-p |
Educational Research: nanoVLM provides an ideal platform for understanding VLM internals without the complexity of production systems. Students and researchers can trace data flow from raw images to generated text through a manageable codebase.
Architecture Experimentation: The modular design enables rapid prototyping of alternative vision encoders, projection strategies, or language decoders. Researchers can swap individual components while keeping the rest of the pipeline constant.
Resource-Constrained Development: With VRAM requirements accessible on consumer GPUs, nanoVLM enables VLM development and experimentation without access to large-scale computing infrastructure.
Fine-tuning Studies: The simple training loop facilitates experiments with learning rate schedules, data augmentation strategies, and regularization techniques specific to multimodal models.
Benchmark Development: The clean implementation serves as a reference for developing new VLM benchmarks or evaluation methodologies, ensuring that results reflect model capabilities rather than implementation quirks.
The following diagram illustrates the recommended reading order for this technical analysis report, showing dependencies between sections and the logical flow from high-level concepts to implementation details.
正在加载图表渲染器...
Recommended Reading Path: