Best Vision-Language Models
Vision-language models have become genuinely useful for OCR, document understanding, and UI automation. These are the strongest open-source options for local image+text reasoning.
- 1
Moondream 2
1.8B paramsUltra-compact vision model. Only 1GB. Answers questions about images.
Min VRAM: 1.5GBQuant: Q4_K_MSize: 1GBLicense: apache-2.0 - 2
Gemma 3 12B
12B paramsHigh quality 12B model. Excellent for iPad Pro and Mac.
Min VRAM: 7.3GBQuant: Q4_K_MSize: 6.799GBLicense: gemma - 3
Qwen2-VL 2B
2.2B paramsCompact vision-language model. Default multimodal model. Can understand images and answer questions about them.
Min VRAM: 1.42GBQuant: Q4_K_MSize: 0.918GBLicense: apache-2.0 - 4
Gemma 3 4B
4B paramsBalanced 4B model with strong reasoning. Great for iPhones.
Min VRAM: 2.82GBQuant: Q4_K_MSize: 2.319GBLicense: gemma - 5
Phi-3.5 Vision
4.2B paramsVision-language model from Microsoft. Can understand images and documents.
Min VRAM: 3.2GBQuant: Q4_K_MSize: 2.5GBLicense: mit - 6
Gemma 3 27B
27B paramsGoogle's flagship open model. Near GPT-4 quality. Needs 20GB+ RAM.
Min VRAM: 15.91GBQuant: Q4_K_MSize: 15.41GBLicense: gemma - 7
LLaVA 1.6 7B
7B paramsMultimodal vision-language model. Understands images and answers questions about them.
Min VRAM: 5GBQuant: Q4_K_MSize: 4.4GBLicense: apache-2.0 - 8
PaliGemma 3B
3B paramsGoogle's vision model. Strong at visual QA, captioning, and OCR.
Min VRAM: 2.5GBQuant: Q4_K_MSize: 2GBLicense: gemma - 9
MiniCPM-V 2.6
2B paramsEfficient multimodal model with strong image understanding. Optimized for edge devices.
Min VRAM: 2.1GBQuant: Q4_K_MSize: 1.6GBLicense: apache-2.0