Quick Start - huggingface/nanoVLM

Source Files

This page is generated from the following source files:

nanoVLM is a lightweight, educational repository for training and fine-tuning Vision-Language Models (VLMs) implemented in pure PyTorch. The codebase emphasizes readability and simplicity, with the core model definition and training logic fitting in approximately 750 lines of code. The architecture consists of a Vision Backbone, Language Decoder, Modality Projection layer, and the VLM itself, along with a straightforward training loop README.md:26-34.

The repository provides multiple entry points for getting started, including an interactive Colab notebook and a local development environment setup. Users can choose between cloning the repository for local development or directly opening the project in Google Colab for immediate experimentation README.md:40-42.

Environment Requirements

Operating System and Runtime

nanoVLM is compatible with standard Linux distributions and macOS environments that support Python 3.12. The project requires a CUDA-capable GPU for training, with VRAM requirements varying based on batch size and model configuration.

Core Dependencies

The following table summarizes the required Python packages and their purposes:

Package	Purpose
`torch`	Core deep learning framework
`numpy`	Numerical computations
`torchvision`	Image processing utilities
`pillow`	Image loading and manipulation
`datasets`	Training dataset management from Hugging Face
`huggingface-hub`	Model hub integration
`transformers`	Pretrained backbone loading
`wandb`	Training logging and monitoring

Hardware Requirements

Based on benchmarks for the default nanoVLM model (222M parameters) on a single NVIDIA H100 GPU, the following VRAM usage was observed:

Batch Size	Peak VRAM Usage
1	4,448.58 MB
2	4,465.39 MB
4	4,532.29 MB
8	5,373.46 MB
16	7,604.36 MB
32	12,074.31 MB

Minimum requirements include approximately 4.5 GB of VRAM for batch size 1, and approximately 8 GB of VRAM for batch sizes up to 16 README.md:193-218.

Installation Steps

Cloning the Repository

Begin by cloning the nanoVLM repository to the local machine:

bash
1git clone https://github.com/huggingface/nanoVLM.git
2cd nanoVLM

This creates a local copy of the project with all necessary source files README.md:49-53.

Recommended Installation with uv

The project maintainers recommend using uv as the package manager. To set up the environment with uv:

bash
1uv init --bare --python 3.12
2uv sync --python 3.12
3source .venv/bin/activate
4uv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb

This approach creates an isolated virtual environment with Python 3.12 and installs all required dependencies README.md:55-61.

Alternative Installation with pip

For users preferring traditional package management, pip can be used to install dependencies directly:

bash
1pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb

This method provides the same functionality without the uv package manager README.md:64-67.

Optional Dependencies for Evaluation

For integration with the lmms-eval evaluation toolkit, additional installation from source is required:

bash
1uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

This enables comprehensive benchmark evaluation capabilities README.md:118-119.

Quick Execution Path

Option 1: Google Colab (Recommended for Beginners)

The fastest way to start experimenting with nanoVLM is through the pre-configured Google Colab notebook. This requires no local installation:

Navigate to the nanoVLM Colab notebook
Execute the cells sequentially to train and evaluate the model

This approach provides access to free GPU resources and eliminates environment setup complexity README.md:40-42.

Option 2: Local Inference with Pretrained Model

For local inference using a pretrained model, execute the following minimal commands:

bash
1# Clone and setup (one-time)
2git clone https://github.com/huggingface/nanoVLM.git
3cd nanoVLM
4pip install torch numpy torchvision pillow datasets huggingface-hub transformers
5
6# Run inference
7python generate.py

This downloads and uses the default pretrained model from the Hugging Face Hub README.md:89-94.

Option 3: Local Training

To train a model locally, authentication with both Weights & Biases and Hugging Face is required:

bash
1# Authenticate with services
2wandb login --relogin
3huggingface-cli login
4
5# Start training
6python train.py

The training script uses default configurations from models/config.py and uploads the trained model to the Hugging Face Hub upon completion README.md:79-87.

Verification and Expected Output

Verifying Installation

After completing the installation steps, verify the environment by importing the core modules:

python
1from models.vision_language_model import VisionLanguageModel
2from data.processors import get_image_processor, get_tokenizer

Successful import without errors indicates proper dependency installation.

Verifying Inference

Running the generation script with the default pretrained model produces output describing the contents of the example image (assets/image.png). The expected output format demonstrates the model's ability to recognize and describe visual content:

Input: 
Image + 'What is this?'

Outputs:
Generation 1:  This is a cat sitting on the ground. I think this is a cat sitting on the ground.
Generation 2:  This picture is clicked outside. In the center there is a brown color cat seems to be sitting on
Generation 3:  This is a cat sitting on the ground, which is of white and brown in color. This cat
Generation 4:  This is a cat sitting on the ground. I think this is a cat sitting on the ground.
Generation 5:  This is a cat sitting on the ground, which is covered with a mat. I think this is

This output confirms the model is correctly loaded and generating coherent descriptions README.md:100-111.

Verifying Training Initialization

When starting training, the following console output indicates successful initialization:

Dataset loading messages showing the path being accessed
Dataloader creation confirmation
Model parameter count display
Training progress logs with loss values

The training script initializes by loading datasets from the configured path and creating dataloaders for both training and validation splits train.py:114-116.

Common Issues and Troubleshooting

Issue 1: Tokenizer Parallelism Warning

Symptom: Warning messages about tokenizer parallelism appear during training or inference.

Solution: The training script automatically sets the environment variable to disable this warning:

python
1os.environ["TOKENIZERS_PARALLELISM"] = "false"

If running custom scripts, add this line at the beginning of the script or set it in the shell environment train.py:37-39.

Issue 2: CUDA Out of Memory Errors

Symptom: Training fails with "CUDA out of memory" errors.

Solution: The training script configures expandable segments to help manage memory:

python
1os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

Additionally, reduce the batch size in the configuration file. Refer to the VRAM usage table in the Environment Requirements section to select an appropriate batch size for available GPU memory train.py:40.

Issue 3: PNG Image Loading Errors

Symptom: "Decompressed data too large" error when loading certain PNG images.

Solution: The training script includes a fix for large PNG metadata chunks:

python
1import PIL.PngImagePlugin
2PIL.PngImagePlugin.MAX_TEXT_CHUNK = 100 * 1024 * 1024

This increases the maximum text chunk size to 100MB, accommodating images with extensive metadata train.py:45-47.

Issue 4: Dataset Loading Failures

Symptom: Warnings about failed dataset shard loading or "No valid datasets were loaded" error.

Solution: Verify the dataset path and configuration names are correct. The training script attempts to load datasets and continues with available data if some shards fail. Check that the Hugging Face Hub credentials are properly configured using huggingface-cli login train.py:139-155.

Issue 5: Distributed Training Initialization Issues

Symptom: Errors related to process group initialization or rank identification.

Solution: Ensure the LOCAL_RANK environment variable is set correctly when launching distributed training. The script expects this variable to determine the local rank for each process train.py:54-57.

Next Steps

After successfully setting up the environment and running basic inference or training, explore the following advanced topics:

Loading Custom Pretrained Weights

The VisionLanguageModel.from_pretrained() method supports loading models from both the Hugging Face Hub and local paths. This enables experimentation with different model checkpoints and fine-tuned variants.

Hub Integration

Trained models can be shared on the Hugging Face Hub using the model.push_to_hub() method. This facilitates collaboration and reproducibility of research results.

Evaluation with lmms-eval

Comprehensive benchmark evaluation is available through the lmms-eval integration. This supports multiple benchmarks including MMStar, MME, MMMU, and OCRBench for thorough model assessment.

Configuration Customization

The default training and model configurations can be modified in models/config.py to experiment with different architectures, learning rates, and training parameters. The modular design allows easy customization of individual components.