The open-source VLM lineup
VLMs moved faster than any modality in 2024. We benchmark per project - pick the right one rather than defaulting to a favourite.
Qwen-VL (2.5 / 7B / 72B)
AlibabaBest open VLM lineup. Strong OCR, document parsing, table extraction, multilingual.
InternVL 2.5
Shanghai AI LabFrontier-quality open VLM. Excellent on charts, diagrams, and dense documents.
LLaVA-NeXT
LLaVA teamStrong general-purpose VLM. Good baseline for visual QA and screen understanding.
Pixtral
MistralMulti-image reasoning, native multimodal. Good for comparative visual tasks.
Phi-3.5 Vision
MicrosoftTiny VLM (4B) for edge / on-device document AI. Surprisingly capable for the size.
What customers actually build with VLMs
Document AI
Invoices, contracts, forms, KYC docs. Extract structured data with sub-1% field error rates.
OCR + reasoning
Beyond OCR - answer questions about the document, not just transcribe it.
Screen / UI understanding
Agents that look at screenshots, understand layouts, and decide next actions.
Visual QA
Open-ended questions about images for content moderation, e-commerce, accessibility.
Production document extraction in four stages
A demo of Qwen-VL answering questions about a PDF is fun. A pipeline that handles 50k documents a day with audit trails is different work.
Page-level layout parse
Detect blocks, tables, figures, headers. Routes each region to the right downstream extractor.
VLM extraction
Qwen-VL or InternVL extracts structured fields per region. JSON-mode output enforced via schema.
Validation + reconcile
Cross-check extracted values against business rules. Flag anything that fails for human review.
Delivery
Structured output pushed into the customer system - ERP, CRM, doc store. Audit trail preserved.