Fine-tuning Florence-2 – Microsoft’s Cutting-edge Vision Language Models

Florence-2, released by Microsoft in June 2024, is a foundation vision-language model. This model is very attractive because of its small size (0.2B and 0.7B) and strong performance on a variety of computer vision and vision-language tasks.

Florence supports many tasks out of the box: captioning, object detection, OCR, and more. However, your task or domain might not be supported, or you may want to better control the model’s output for your task. That’s when you will need to fine-tune.

In this post, we show an example on fine-tuning Florence on DocVQA. The authors report that Florence 2 can

To finish reading, please visit source site