Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Tom Aarsen's avatar

Sentence Transformers is a Python library for using and training embedding and reranker models for applications like retrieval augmented generation, semantic search, and more. In my previous blogpost, I introduced the new multimodal capabilities, showing how to use embedding and reranker models that handle text, images, audio, and video. In this blogpost, I’ll show you how to train or finetune these multimodal models on your own data.

As a practical example, I’ll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model’s 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

Model size vs NDCG

 

 

<span id=

 

To finish reading, please visit source site