Issue #2 – Data Cleaning for Neural MT

25 Jul18

Issue #2 – Data Cleaning for Neural MT

Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic

“Garbage in, Garbage out” – noisy data is a big problem for all machine learning tasks, and MT is no different. By noisy data, we mean bad alignments, poor translations, misspellings, and other inconsistencies in the data used to train the systems. Statistical MT systems are more robust, and can cope with up to 10% noise in the training data without significant impact on translation quality. Thus in many cases more data is better, even if a bit noisy. According to a recent paper (Khayrallah and Koehn. 2018) the same can not be said for Neural MT, which is much more sensitive to noise.

Where is the problem?

Let’s look at their comparison of the impact on the BLEU score of several types of noise in the training data for German into English machine translation.

The most harmful type of noise are segments of the source language copied untranslated into the target, e.g. German aligned with German. With only 5% of this type of noise, the BLEU score drops from 27 points to less than 18, and with
To finish reading, please visit source site

Leave a Reply