Issue #46 – Augmenting Self-attention with Persistent Memory
18 Jul19
Issue #46 – Augmenting Self-attention with Persistent Memory
Author: Dr. Rohit Gupta, Sr. Machine Translation Scientist @ Iconic
In Issue #32 we introduced the Transformer model as the new state-of-the-art in Neural Machine Translation. Subsequently, in Issue #41 we looked at some approaches that were aiming to improve upon it. In this post, we take a look at significant change in the Transformer model, proposed by Sukhbaatar et al. (2019), which further improves its performance.
Each Transformer layer consists of two types of layers: 1) attention layer(s) and 2) feedforward layer. Often, because of their novelty, attention layers are discussed more when we talk about Transformers. However, the feed forward layer is another major component in this approach and it should not be ignored. The Transformer model requires both types of layers to function properly.
Sukhbaatar et al. introduce a new layer that merges the attention and feedforward sub-layers into a single unified attention layer, as illustrated in the figure below.

This unified layer,
To finish reading, please visit source site