Issue #46 – Augmenting Self-attention with Persistent Memory

18 Jul19

Issue #46 – Augmenting Self-attention with Persistent Memory

Author: Dr. Rohit Gupta, Sr. Machine Translation Scientist @ Iconic

In Issue #32 we introduced the Transformer model as the new state-of-the-art in Neural Machine Translation. Subsequently, in Issue #41 we looked at some approaches that were aiming to improve upon it. In this post, we take a look at significant change in the Transformer model, proposed by Sukhbaatar et al. (2019), which further improves its performance.

Each Transformer layer consists of two types of layers: 1) attention layer(s) and 2) feedforward layer. Often, because of their novelty, attention layers are discussed more when we talk about Transformers. However, the feed forward layer is another major component in this approach and it should not be ignored. The Transformer model requires both types of layers to function properly.

Sukhbaatar et al. introduce a new layer that merges the attention and feedforward sub-layers into a single unified attention layer, as illustrated in the figure below. 

A single unified attention layer in the Transformer model

This unified layer,
To finish reading, please visit source site

Leave a Reply