Vision Language Model Alignment in TRL ⚡️

Vision Language Models (VLMs) are getting stronger, but aligning them to human preferences still matters. In TRL, we already showed how to post-train VLMs with Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This time, we’re going further.

tl;dr Here’s what’s new in TRL:

  • Mixed Preference Optimization (MPO)
  • Group Relative Policy Optimization (GRPO)
  • Group Sequence Policy Optimization (GSPO) (a variant of GRPO)

These go beyond pairwise DPO, extracting richer signals from preference data and scaling better with modern VLMs.

We’ve also extended existing methods to support VLMs: