Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Today we’re excited to announce Granite 4.0 3B Vision, a compact vision-language model (VLM) designed for enterprise document understanding. It’s purpose-built for reliable information extraction from complex documents, forms, and structured visuals. Granite 4.0 3B Vision excels on the following capabilities:
- Table Extraction: Accurately parsing complex table structures (e.g., multi-row, multi-column, etc.) from document images
- Chart Understanding: Converting charts and figures into structured machine-readable formats, summaries, or executable code
- Semantic Key-Value Pair (KVP) Extraction: Identifying and grounding semantically meaningful key-value field pairs across diverse document layouts
The model ships as a LoRA adapter on top of Granite 4.0 Micro, our dense language model, keeping vision and language modular for text-only fallbacks and seamless integration into mixed pipelines. It continues to support vision-language tasks such as producing detailed natural-language descriptions from images (e.g., “Describe this image in detail”). The model can be used standalone or in tandem with Docling to enhance document processing pipelines with deep visual understanding capabilities.
How Granite 4.0 3B Vision Was Built
Granite 4.0 3B Vision’s performance is the result