Direct Preference Optimization Beyond Chatbots

In April, we released DharmaOCR, our specialized structured OCR model (available on Hugging Face) along with a paper detailing the methodology behind it and a benchmark demonstrating its superior quality and cost efficiency.
The paper benchmarked leading vision-language model families - both open-source and commercial - on a structured document extraction task: OCR on Brazilian Portuguese text. Among the reported metrics was text degeneration rate: the frequency with which a model produces a repetition loop instead of a transcription.

Across the tested open-source families, vanilla degeneration rates ranged from below 1% to above 33%. Supervised fine-tuning reduced those rates for most models - but rarely to production-acceptable levels. The pattern points to a structural limitation: SFT optimizes for correct outputs, but does not explicitly penalize degeneration. There appears to be a ceiling on how much task-focused fine-tuning alone can reduce this failure mode (Text Degeneration Article).

A second training stage - applied after supervised fine-tuning (SFT), on

To finish reading, please visit source site