Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Today we’re excited to announce Granite 4.0 3B Vision, a compact vision-language model (VLM) designed for enterprise document understanding. It’s purpose-built for reliable information extraction from complex documents, forms, and structured visuals. Granite 4.0 3B Vision excels on the following capabilities:

Table Extraction: Accurately parsing complex table structures (e.g., multi-row, multi-column, etc.) from document images
Chart Understanding: Converting charts and figures into structured machine-readable formats, summaries, or executable code
Semantic Key-Value Pair (KVP) Extraction: Identifying and grounding semantically meaningful key-value field pairs across diverse document layouts

The model ships as a LoRA adapter on top of Granite 4.0 Micro, our dense language model, keeping vision and language modular for text-only fallbacks and seamless integration into mixed pipelines. It continues to support vision-language tasks such as producing detailed natural-language descriptions from images (e.g., “Describe this image in detail”). The model can be used standalone or in tandem with Docling to enhance document processing pipelines with deep visual understanding capabilities.

How Granite 4.0 3B Vision Was Built

Granite 4.0 3B Vision’s performance is the result

To finish reading, please visit source site