Highlights from Machine Translation and Multilinguality in May 2024

Here are short summaries of three pre-prints that I enjoyed reading in May.

Zero-Shot Tokenizer Transfer

Folks from the University of Cambridge and the Univerisity of Edinburgh propose a nice trick for changing the vocabulary of an already trained language model. They train a hyper-network (a neural network that predicts parameters of a different neural network) that predicts what embeddings a token would have if it were trained with the rest of the model. For each training batch, they build an ad-hoc vocabulary using a simplified and randomized version of the SentencePiece Unigram tokenizer (they just count the substring frequencies and add some random noise). They run the hypernetwork to generate the embeddings for this new tokenizer and compute the language modeling cross-entropy as if they were training

 

 

To finish reading, please visit source site

Leave a Reply