Fixing Gradient Accumulation

Our friends at Unsloth shared an issue regarding gradient accumulation yesterday that is affecting the transformers Trainer. The initial report comes from @bnjmn_marie (kudos to him!).

Gradient accumulation is supposed to be mathematically equivalent to full batch training; however, losses did not match between training runs where the setting was toggled on and off.

Where does it stem from?

Inside the modeling code of each model, transformers offers a “default” loss function that’s the most typically

To finish reading, please visit source site