Highlights from Machine Translation and Multilinguality in November 2024

Mitigating Metric Bias in Minimum Bayes Risk Decoding Minimum Bayes Risk Decoding tries to get the most typical output from a language or machine translation model rather than the most probable one. The main idea is that the probability scores do not consider how semantically similar sentences are. Therefore, the most probable sequence might not be the most typical from a meaning perspective. The weak point is that we have to decide what metric to use to estimate the similarity, […]

Read more

Notes from EMNLP 2024

Last week, I was at EMNLP in Miami, and here are a few notes about what I saw at the conference. Keynotes The conference had three keynotes: two good and one amazing. In the first keynote, Percy Liang talked about research on LLMs that they do at Stanford. One topic was LLM-based agents: Percy Liang predicts that LLMs are awaiting their AlphaGo moment so that we will move from coded agents; soon, the big topic will be agents trained with […]

Read more

Highlights from Machine Translation and Multilinguality in October 2024

Here are summaries of a few pre-preprints that I noticed on arXiv during October. LangSAMP: Language-Script Aware Multilingual Pretraining Folks from LMU Munich try a relatively simple trick to improve multilingual encoder models, particularly non-Latin-script and low-resource languages. They use additional information about the language identity and the script, but only during training, so at the inference, we can still use the model without caring about what language we feed in. They add static language and script embeddings before the […]

Read more

Highlights from Machine Translation and Multilinguality in Summer 2024

Here are summaries of a few papers that I liked during the (long academic) summer. BertaQA: How Much Do Language Models Know About Local Culture? People from the University of the Basque Country prepared a QA dataset consisting of local knowledge about the Basque Country, hopefully including facts that might now exist on the English-speaking Internet and contrast that with global (but it probably means Western) facts. The questions are in the multiple-choice style. Then, they asked professional translators to […]

Read more

Lessons learned from analyzing values in multilingual encoders and what it means for LLMs

This post is a hindsight on two studies on multilingual sentence embeddings we published a year ago and comments on what I think people analyzing LLMs today should take away from them. In late 2022, we (which mainly was the work of Kathy Hämmerl from Munich and Björn Diesenroth and Patrick Schramowski from Darmstadt) finished a paper called Speaking Multiple Languages Affects the Moral Bias of Language Models (later published in Findings of ACL 2023) where we tried to compare […]

Read more

Highlights from Machine Translation and Multilinguality in May 2024

Here are short summaries of three pre-prints that I enjoyed reading in May. Zero-Shot Tokenizer Transfer Folks from the University of Cambridge and the Univerisity of Edinburgh propose a nice trick for changing the vocabulary of an already trained language model. They train a hyper-network (a neural network that predicts parameters of a different neural network) that predicts what embeddings a token would have if it were trained with the rest of the model. For each training batch, they build […]

Read more

Highlights from Machine Translation and Multilinguality in April 2024

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation Folks from the University of the Basque Country prepared an English-Spanish dataset for natural langauge inference (i.e., deciding if sentences follow from each other, are in contradiction, or have nothing to do with each other) with metaphorical expressions. Unlike the standard version of this task (XNLI), which does not use figurative language, there is a large gap between in-language training and language transfer. (Transfer means that we finetune a multilingual […]

Read more

Highlights from Machine Translation and Multilinguality in March 2024

Did Translation Models Get More Robust Without Anyone Even Noticing? Folks from Lisbon study how robust the newest MT systems are against source-side noise. Machine translation using large models, including translation-specific NLLB or via LLMs (such as Tower or GPT-3.5), is much more robust both towards synthetic noise (the nice feature of synthetic noise is that you can check the translation quality for different noise levels) and also real-world noisy data from social networks. Tracing the Roots of Facts in […]

Read more

Highlights from Machine Translation and Multilinguality in February 2024

With a new month, here are a few papers that I noticed on arXiv in February. Linear-time Minimum Bayes Risk Decoding with Reference Aggregation A preprint from the University of Zurich proposes a linear time version of Minimum Bayes Risk (MBR) decoding in machine translation. This decoding algorithm does not aim to generate the most probable sequence given the model but the most typical one. This is typically done by sampling dozens of candidate output sentences, from which we select […]

Read more

Highlights from Machine Translation and Multilinguality in December 2023 and January 2024

Many things happened in the field in December: EMNLP, Google released Gemini, and Mixtral appeared. January was seemingly not that packed with new events, but plenty of new interesting work popped up on arXiv. Predicting Human Translation Difficulty with Neural Machine Translation Folks from the University of Melbourne found out that features from NMT, most notably the target sentence perplexity and something they call flow features, are a good predictor of human translation time. Turning English-centric LLMs Into Polyglots: How […]

Read more
1 2 3 11