SigLIP 2: A better multilingual vision language encoder
Today Google releases a new and better family of multilingual vision-language encoders, SigLIP 2. The authors have extended the training objective of SigLIP (sigmoid loss) with additional objectives for improved semantic understanding, localization, and dense features.
SigLIP 2 models outperform the older SigLIP ones at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs).
A cherry on top is the dynamic resolution (naflex) variant. This is useful for downstream tasks sensitive to aspect ratio and resolution.
Here is a list of all the models released: