This page is generated from the following source files:
nanoVLM is a lightweight, educational repository for training and fine-tuning Vision-Language Models (VLMs) implemented in pure PyTorch. The codebase emphasizes readability and simplicity, with the core model definition and training logic fitting in approximately 750 lines of code. The architecture consists of a Vision Backbone, Language Decoder, Modality Projection layer, and the VLM itself, along with a straightforward training loop README.md:26-34.
The repository provides multiple entry points for getting started, including an interactive Colab notebook and a local development environment setup. Users can choose between cloning the repository for local development or directly opening the project in Google Colab for immediate experimentation README.md:40-42.
nanoVLM is compatible with standard Linux distributions and macOS environments that support Python 3.12. The project requires a CUDA-capable GPU for training, with VRAM requirements varying based on batch size and model configuration.
The following table summarizes the required Python packages and their purposes:
| Package | Purpose |
|---|---|
torch | Core deep learning framework |
numpy | Numerical computations |
torchvision | Image processing utilities |
pillow | Image loading and manipulation |
datasets | Training dataset management from Hugging Face |
huggingface-hub | Model hub integration |
transformers | Pretrained backbone loading |
wandb | Training logging and monitoring |
Based on benchmarks for the default nanoVLM model (222M parameters) on a single NVIDIA H100 GPU, the following VRAM usage was observed:
| Batch Size | Peak VRAM Usage |
|---|---|
| 1 | 4,448.58 MB |
| 2 | 4,465.39 MB |
| 4 | 4,532.29 MB |
| 8 | 5,373.46 MB |
| 16 | 7,604.36 MB |
| 32 | 12,074.31 MB |
Minimum requirements include approximately 4.5 GB of VRAM for batch size 1, and approximately 8 GB of VRAM for batch sizes up to 16 README.md:193-218.
Begin by cloning the nanoVLM repository to the local machine:
bash1git clone https://github.com/huggingface/nanoVLM.git 2cd nanoVLM
This creates a local copy of the project with all necessary source files README.md:49-53.
The project maintainers recommend using uv as the package manager. To set up the environment with uv:
bash1uv init --bare --python 3.12 2uv sync --python 3.12 3source .venv/bin/activate 4uv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb
This approach creates an isolated virtual environment with Python 3.12 and installs all required dependencies README.md:55-61.
For users preferring traditional package management, pip can be used to install dependencies directly:
bash1pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb
This method provides the same functionality without the uv package manager README.md:64-67.
For integration with the lmms-eval evaluation toolkit, additional installation from source is required:
bash1uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
This enables comprehensive benchmark evaluation capabilities README.md:118-119.
The fastest way to start experimenting with nanoVLM is through the pre-configured Google Colab notebook. This requires no local installation:
This approach provides access to free GPU resources and eliminates environment setup complexity README.md:40-42.
For local inference using a pretrained model, execute the following minimal commands:
bash1# Clone and setup (one-time) 2git clone https://github.com/huggingface/nanoVLM.git 3cd nanoVLM 4pip install torch numpy torchvision pillow datasets huggingface-hub transformers 5 6# Run inference 7python generate.py
This downloads and uses the default pretrained model from the Hugging Face Hub README.md:89-94.
To train a model locally, authentication with both Weights & Biases and Hugging Face is required:
bash1# Authenticate with services 2wandb login --relogin 3huggingface-cli login 4 5# Start training 6python train.py
The training script uses default configurations from models/config.py and uploads the trained model to the Hugging Face Hub upon completion README.md:79-87.
After completing the installation steps, verify the environment by importing the core modules:
python1from models.vision_language_model import VisionLanguageModel 2from data.processors import get_image_processor, get_tokenizer
Successful import without errors indicates proper dependency installation.
Running the generation script with the default pretrained model produces output describing the contents of the example image (assets/image.png). The expected output format demonstrates the model's ability to recognize and describe visual content:
Input:
Image + 'What is this?'
Outputs:
Generation 1: This is a cat sitting on the ground. I think this is a cat sitting on the ground.
Generation 2: This picture is clicked outside. In the center there is a brown color cat seems to be sitting on
Generation 3: This is a cat sitting on the ground, which is of white and brown in color. This cat
Generation 4: This is a cat sitting on the ground. I think this is a cat sitting on the ground.
Generation 5: This is a cat sitting on the ground, which is covered with a mat. I think this is
This output confirms the model is correctly loaded and generating coherent descriptions README.md:100-111.
When starting training, the following console output indicates successful initialization:
The training script initializes by loading datasets from the configured path and creating dataloaders for both training and validation splits train.py:114-116.
Symptom: Warning messages about tokenizer parallelism appear during training or inference.
Solution: The training script automatically sets the environment variable to disable this warning:
python1os.environ["TOKENIZERS_PARALLELISM"] = "false"
If running custom scripts, add this line at the beginning of the script or set it in the shell environment train.py:37-39.
Symptom: Training fails with "CUDA out of memory" errors.
Solution: The training script configures expandable segments to help manage memory:
python1os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
Additionally, reduce the batch size in the configuration file. Refer to the VRAM usage table in the Environment Requirements section to select an appropriate batch size for available GPU memory train.py:40.
Symptom: "Decompressed data too large" error when loading certain PNG images.
Solution: The training script includes a fix for large PNG metadata chunks:
python1import PIL.PngImagePlugin 2PIL.PngImagePlugin.MAX_TEXT_CHUNK = 100 * 1024 * 1024
This increases the maximum text chunk size to 100MB, accommodating images with extensive metadata train.py:45-47.
Symptom: Warnings about failed dataset shard loading or "No valid datasets were loaded" error.
Solution: Verify the dataset path and configuration names are correct. The training script attempts to load datasets and continues with available data if some shards fail. Check that the Hugging Face Hub credentials are properly configured using huggingface-cli login train.py:139-155.
Symptom: Errors related to process group initialization or rank identification.
Solution: Ensure the LOCAL_RANK environment variable is set correctly when launching distributed training. The script expects this variable to determine the local rank for each process train.py:54-57.
After successfully setting up the environment and running basic inference or training, explore the following advanced topics:
The VisionLanguageModel.from_pretrained() method supports loading models from both the Hugging Face Hub and local paths. This enables experimentation with different model checkpoints and fine-tuned variants.
Trained models can be shared on the Hugging Face Hub using the model.push_to_hub() method. This facilitates collaboration and reproducibility of research results.
Comprehensive benchmark evaluation is available through the lmms-eval integration. This supports multiple benchmarks including MMStar, MME, MMMU, and OCRBench for thorough model assessment.
The default training and model configurations can be modified in models/config.py to experiment with different architectures, learning rates, and training parameters. The modular design allows easy customization of individual components.