Highlights from Machine Translation and Multilinguality in March 2022

Here is a monthly summary of what I found most interesting on arXiv this month
from machine translation and mutlilinguality. This month was the camera-ready
deadline for ACL 2022, so many of the interesting papers are accepted to ACL.

Overlapping BPE

When training, BPE merges actually do not have to follow the simple objective
of merging the most frequent token pair. In massively multilingual models,
there is an imbalance between languages, and some of them got segmented almost
down to characters. Therefore, we might want to have a higher vocabulary
overlap between languages. A paper from IIT Bombay and
Google
that will appear at ACL suggests
mixing the interpolate the bigram frequency with a factor telling in how many
languages the particular merge would appear. This leads to

 

 

To finish reading, please visit source site

Leave a Reply