Best Vision-Language Models

Vision-language models have become genuinely useful for OCR, document understanding, and UI automation. These are the strongest open-source options for local image+text reasoning.

  1. 1

    Moondream 2

    1.8B params

    Ultra-compact vision model. Only 1GB. Answers questions about images.

    Min VRAM: 1.5GBQuant: Q4_K_MSize: 1GBLicense: apache-2.0
  2. 2

    Gemma 3 12B

    12B params

    High quality 12B model. Excellent for iPad Pro and Mac.

    Min VRAM: 7.3GBQuant: Q4_K_MSize: 6.799GBLicense: gemma
  3. 3

    Qwen2-VL 2B

    2.2B params

    Compact vision-language model. Default multimodal model. Can understand images and answer questions about them.

    Min VRAM: 1.42GBQuant: Q4_K_MSize: 0.918GBLicense: apache-2.0
  4. 4

    Gemma 3 4B

    4B params

    Balanced 4B model with strong reasoning. Great for iPhones.

    Min VRAM: 2.82GBQuant: Q4_K_MSize: 2.319GBLicense: gemma
  5. 5

    Phi-3.5 Vision

    4.2B params

    Vision-language model from Microsoft. Can understand images and documents.

    Min VRAM: 3.2GBQuant: Q4_K_MSize: 2.5GBLicense: mit
  6. 6

    Gemma 3 27B

    27B params

    Google's flagship open model. Near GPT-4 quality. Needs 20GB+ RAM.

    Min VRAM: 15.91GBQuant: Q4_K_MSize: 15.41GBLicense: gemma
  7. 7

    LLaVA 1.6 7B

    7B params

    Multimodal vision-language model. Understands images and answers questions about them.

    Min VRAM: 5GBQuant: Q4_K_MSize: 4.4GBLicense: apache-2.0
  8. 8

    PaliGemma 3B

    3B params

    Google's vision model. Strong at visual QA, captioning, and OCR.

    Min VRAM: 2.5GBQuant: Q4_K_MSize: 2GBLicense: gemma
  9. 9

    MiniCPM-V 2.6

    2B params

    Efficient multimodal model with strong image understanding. Optimized for edge devices.

    Min VRAM: 2.1GBQuant: Q4_K_MSize: 1.6GBLicense: apache-2.0

Related