Attention

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#self-attention

I mainly borrowed the content from Lil'log. I put the original link above. There are a lot of treasures in her blog. I just made notes in here for my own interests.

A Family of Attention Mechanisms

For a traditional seq2seq model, the difficulty of memorizing the important information is still an issue. So Attention mechanism is used to memorize long source sequences in a neural network. Rather than building a single context vector out of the encoder’s last hidden state, the key of attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

The core of an attention is how to calculate the similarity or context of the entire input sequence. There are several popular attention mechanisms.

  • Content-based Attention

    score(st,hi)=cosine[st,hi]\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \text{cosine}[\boldsymbol{s}_t, \boldsymbol{h}_i]

  • Additive Attention

    • Luong:score(st,hi)=stWhi\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = s_t^\top\mathbf{W}h_i

    • Bahdanau: score(st,hi)=vatanh(W1st+W2hi])\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \mathbf{v}_a^\top \tanh(\mathbf{W}_1\boldsymbol{s}_t+ \mathbf{W}_2\boldsymbol{h}_i])

  • Location-based Attention:

    αt,i=softmax(Wast)\alpha_{t,i} = \text{softmax}(\mathbf{W}_a \boldsymbol{s}_t)

    • It is an attention mechanism in which the alignment scores are computed from solely the target hidden state sts_t as follows:

  • General Attention:

    score(st,hi)=stWahi\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \boldsymbol{s}_t^\top\mathbf{W}_a\boldsymbol{h}_i

    • where Wa\mathbf{W}_a is a trainable weight matrix in the attention layer.

  • Dot-product

    score(st,hi)=sthi\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \boldsymbol{s}_t^\top\boldsymbol{h}_i

  • Scaled Dot-product

    score(st,hi)=sthin\text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \frac{\boldsymbol{s}_t^\top\boldsymbol{h}_i}{\sqrt{n}}

    • very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.

    • It adds a scaling factor ( 1n\frac{1}{\sqrt{n}} ), motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

Self Attention

Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.

In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.

Soft v.s. Hard Attention

  • Soft attention: the alignment weights are learned and placed “softly” over ALL patches in the source image; essentially the same type of attention as in Bahdanau Attention

    • Pro: The model is smooth and differentiable

    • Con: Expensive when the source input is large

  • Hard Attention: only selects one patch of the image to attend at a time.

    • Pro: less calculation at the inference time

    • Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train

Global v.s. Local Attention

  • Global attention: similar to soft attention above

  • Local attention: a blend between hard and soft attention.

    • Make the hard attention differentiable

    • First, the model predicts a single aligned position for the current target word

    • Then, a window centered around the source position is used to compute a context vector.

Reference

Last updated

Was this helpful?