Vision-Language Models · Document AI & visual reasoning

Models that see, read, and reason about images and documents.

Qwen-VL, InternVL, LLaVA-NeXT, Pixtral, Phi-3.5 Vision - deployed for document AI, visual QA, screen understanding, and OCR + reasoning. On your hardware, on your data, with structured output you can wire into a real system.

<1%
field error on production invoice extraction
JSON
structured output enforced via schema
Edge
Phi-3.5 Vision (4B) on Jetson
100%
docs stay in your VPC
Models we deploy

The open-source VLM lineup

VLMs moved faster than any modality in 2024. We benchmark per project - pick the right one rather than defaulting to a favourite.

Qwen

Qwen-VL (2.5 / 7B / 72B)

Alibaba

Best open VLM lineup. Strong OCR, document parsing, table extraction, multilingual.

OCRTablesMultilingual
InternLM

InternVL 2.5

Shanghai AI Lab

Frontier-quality open VLM. Excellent on charts, diagrams, and dense documents.

ChartsDiagrams
LLaVA

LLaVA-NeXT

LLaVA team

Strong general-purpose VLM. Good baseline for visual QA and screen understanding.

Visual QAScreen UI
Mistral

Pixtral

Mistral

Multi-image reasoning, native multimodal. Good for comparative visual tasks.

Multi-image
Azure

Phi-3.5 Vision

Microsoft

Tiny VLM (4B) for edge / on-device document AI. Surprisingly capable for the size.

EdgeTiny
Use cases

What customers actually build with VLMs

Document AI

Invoices, contracts, forms, KYC docs. Extract structured data with sub-1% field error rates.

OCR + reasoning

Beyond OCR - answer questions about the document, not just transcribe it.

Screen / UI understanding

Agents that look at screenshots, understand layouts, and decide next actions.

Visual QA

Open-ended questions about images for content moderation, e-commerce, accessibility.

Document AI pipeline

Production document extraction in four stages

A demo of Qwen-VL answering questions about a PDF is fun. A pipeline that handles 50k documents a day with audit trails is different work.

STAGE 01

Page-level layout parse

Detect blocks, tables, figures, headers. Routes each region to the right downstream extractor.

STAGE 02

VLM extraction

Qwen-VL or InternVL extracts structured fields per region. JSON-mode output enforced via schema.

STAGE 03

Validation + reconcile

Cross-check extracted values against business rules. Flag anything that fails for human review.

STAGE 04

Delivery

Structured output pushed into the customer system - ERP, CRM, doc store. Audit trail preserved.

Have a document or vision workload?

Bring sample documents and the fields you want extracted. We'll come back with a model + accuracy benchmark.

Talk to us about VLMs
See case studies