Issue #121 – Finding the Optimal Vocabulary Size for Neural Machine Translation

11 Mar21 Issue #121 – Finding the Optimal Vocabulary Size for Neural Machine Translation Author: Akshai Ramesh, Machine Translation Scientist @ Iconic Introduction Sennrich et al. (2016) introduced a variant of byte pair encoding (BPE) (Gage, 1994) for word segmentation, which is capable of encoding open vocabularies with a compact symbol vocabulary of variable-length subword units. With the use of BPE, the Neural Machine Translation (NMT) systems are capable of open-vocabulary translation by representing rare and unseen words as a […]

What is Tokenization in NLP? Here’s All You Need To Know

Highlights Tokenization is a key (and mandatory) aspect of working with text data We’ll discuss the various nuances of tokenization, including how to handle Out-of-Vocabulary words (OOV) Introduction Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge. […]