Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?

Models are becoming quite good at understanding text on its own, but what about text in images, which gives important contextual information? For example, navigating a map, or understanding a meme? The ability to reason about the interactions between the text and visual context in images can power many real-world applications, such as AI assistants, or tools to assist the visually impaired.

We refer to these tasks as “context-sensitive text-rich visual reasoning tasks”.

At the moment, most evaluations of instruction-tuned large multimodal models (LMMs) focus on testing how well models can respond to human instructions posed as questions or

To finish reading, please visit source site