Issue #36 – Average Attention Network for Neural MT

09 May19

Issue #36 – Average Attention Network for Neural MT

Author: Dr. Rohit Gupta, Sr. Machine Translation Scientist @ Iconic

In Issue#32, we covered the Transformer model for neural machine translation which is the state of the art in neural MT. In this post we explore a technique presented by Zhang et. al. 2018, which modifies the transformer model and speeds up the translation process by 4-7 times across a range of different engines.

Where is the bottleneck?

In the Transformer, we generate translation by looking at all the previous target words. The longer the sentences we’re translating, the higher the complexity of the translation. This essentially means that translating long sentences can be quite slow.

Average Attention Network (AAN)

AAN replaces the original dynamically computed attention weights by the self-attention network in the decoder of the neural Transformer with simple and fixed average weights.

Given an input layer y={y1, y2,…,ym}, AAN employs a cumulative-average operation (equation 1, Average Layer), followed by a feed-forward network and layer normalisation (equation 2 &3, Gating Layer) as follows:

Average Attention Layer

The following
To finish reading, please visit source site

Leave a Reply