Attention
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#self-attention
I mainly borrowed the content from Lil'log. I put the original link above. There are a lot of treasures in her blog. I just made notes in here for my own interests.

A Family of Attention Mechanisms
For a traditional seq2seq model, the difficulty of memorizing the important information is still an issue. So Attention mechanism is used to memorize long source sequences in a neural network. Rather than building a single context vector out of the encoder’s last hidden state, the key of attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.
The core of an attention is how to calculate the similarity or context of the entire input sequence. There are several popular attention mechanisms.
Content-based Attention
Additive Attention
Luong:
Bahdanau:
Location-based Attention:
It is an attention mechanism in which the alignment scores are computed from solely the target hidden state as follows:
General Attention:
where is a trainable weight matrix in the attention layer.
Dot-product
Scaled Dot-product
very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.
It adds a scaling factor ( ), motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.
Self Attention
Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.
In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.

Soft v.s. Hard Attention
Soft attention: the alignment weights are learned and placed “softly” over ALL patches in the source image; essentially the same type of attention as in Bahdanau Attention
Pro: The model is smooth and differentiable
Con: Expensive when the source input is large
Hard Attention: only selects one patch of the image to attend at a time.
Pro: less calculation at the inference time
Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train
Global v.s. Local Attention
Global attention: similar to soft attention above
Local attention: a blend between hard and soft attention.
Make the hard attention differentiable
First, the model predicts a single aligned position for the current target word
Then, a window centered around the source position is used to compute a context vector.

Reference
Last updated
Was this helpful?