Multimodal Embedding & Reranker Models with Sentence Transformers

Sentence Transformers is a Python library for using and training embedding and reranker models for applications like retrieval augmented generation, semantic search, and more. With the v5.4 update, you can now encode and compare texts, images, audio, and videos using the same familiar API. In this blogpost, I’ll show you how to use these new multimodal capabilities for both embedding and reranking.

Multimodal embedding models map inputs from different modalities into a shared embedding space, while multimodal reranker models score the relevance of mixed-modality pairs. This opens up use cases like visual document retrieval, cross-modal search, and multimodal RAG pipelines.

What are Multimodal Models?

Traditional embedding models convert text into fixed-size vectors. Multimodal embedding models extend

To finish reading, please visit source site

Multimodal Embedding & Reranker Models with Sentence Transformers

Table of Contents

What are Multimodal Models?