Machine Translation Weekly 88: Text Segmentation and Multilinguality

With the semester start, it is also time to renew MT Weekly. My new year’s
resolution was to make it to 100 issues, so let’s see if I can keep it. Today,
I will talk about a paper by my colleagues from LMU Munich that will appear in
the Findings of EMNLP 2021 which deals with a perpetual problem of NLP – input
text segmentation. The title of the paper is Wine is Not v i n. On the
Compatibility of Tokenizations Across
Languages
and it discusses the role of
text subword segmentation in multilingual representation models and shows that
different granularity in different languages might be problematic.

Subwords were originally introduced in machine translation where the
segmentation incompatibility probably never was a really big deal. BPE
segmentation keeps

 

 

To finish reading, please visit source site

Leave a Reply