Transformer

The follow-up of Attention mechanism.

High-level look

The Transformer used the encoder-decoder structure. There are 6 encoder blocks in encoder and 6 decoder blocks in decoder

Detailed Encoder-Decoder blocks

One Encoder block: there are two sub-layers
- Self-attention mechanism sub-layer: it used to help the current word obtain the context of a sequence
- Feed-forward sub-layer:
- After each layer, there is a residual network: $LayerNorm(x+sublayer(x))$
One Decoder block: there are three sub-layers:
- Masked self-attention sub-layer: The mask is used to hide the information that is later than current position
- Encoder-Decoder attention sub-layer: find the attention score of the encoder input
- Feed-forward sub-layer:
Final linear and softmax layer
- Linear layer: convert the results from the decoder to the vocab-size
- Softmax layer: convert the result and output the probabilities of the words with highest probability
Position-wise feed-forward network:
- Add later

Self-Attention

Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

High-level of self-attention:

In attention mechanism, the embedding vector, Source, can be regarded as a list of <Key, Value>. Given a new element, Query, we can obtain the relevance of this new element to all the source by calculating their similarities. Then, we can multiply the values of the relevance (or weights) to get the final attention value.

Essentially, the attention mechanism is a weighted sum of the values in Source. And the Query and Key is used to calculate the corresponding weights of those values.

$Attention(Query, Source) = \sum_{i=1}^{L_x}Similarity(Query, Key_i)\times Value_i$

Specifically, there are three stages to calculate the Attention: 1) calculate the weights based on Query and Key; 2) calculate the weighted sum of Values.

In the first stage, there are different ways to calculate the similarity between Query and Key.
1. Dot product: $Similarity(Query, Key_i) = Query \cdot Key_i$
2. Cosine Similarity: $Similarity(Query, Key_i) = \frac{Query \cdot Key_i}{||Query||\cdot||Key_i||}$
3. MLP Similarity: $Similarity(Query, Key_i) = MLP(Query, Key_i)$
Then, we use softmax to normalize the result: $a_i = Softmax(Sim_i) = \frac{e^{Sim_i}}{\sum_{j=1}^{L_x}e^{Sim_j}}$
Calculate the Attention values: $Attention(Query, Source) = \sum_{i=1}^{L_x}Similarity(Query, Key_i)\times Value_i$

Procedure of self-attention:

Overall steps:

Convert input words to embedding vectors
Calculate q, k, v based on $W^Q, W^K, W^V$
Calculate a score: $score = q\cdot k$
Score the score by $\sqrt{d_k}$ : $score = score/\sqrt{d_k}$
Apply softmax function: $weight = Softmax(score)$
Weighted Sum of Values: $z = \sum weight_i*Value_i$

Concrete Steps:

1.Self-attention needs to calculate three vectors: Query, Key, Value. Those vectors have smaller dimension than the embedding vector. The embedding vector is 512, while the dimension of Query, Key, Value is 64. --> Those smaller dimension is used to make the computation of multi-headed attention constant.

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

What are the "Query", "Key", and "Value"?

Query: represents the information of word in current position
Key: represents the information of word in the different positions in a sequence
Value: represents the vector information in the different positions.

2. Calculate a score for current word corresponding to all the other words in a sentence. We calculate the score of current word by taking the dot product of the query vector $q_i$ with the key vector $k_j$

--> This score determines how much focus to place on other parts of the input sentence as we encode a word at a position.

3. Divide the scores by $\sqrt{d_k} = 8$ . $d_k$ is the dimension of the key vectors used in the paper, 64.

--> This step is used to having a more stable gradients.

Then, pass the result through a softmax operation to make sure the scores are positive and sum is 1

--> This softmax score determines how much each word will be related at this position. It is useful to attend to another word that is relevant to the current word.

4. Multiply each value vector $v_i$ by the softmax score.

--> It is used to keep attention to the values of words that we want to focus on, and filter out the irrelevant words (by multiplying a small score number)

5. Lastly, sum up the weighted value vectors.

--> It produces the output of the self-attention layer at the current position

Matrix Calculation of Self Attention

Calculate the Query, Key, and Value matrices through a matrix calculation.

Each row of X is a word in the input sentence. By multiplying the $W^Q, W^K, W^V$ , we are able to get the smaller sizes of Q, K, V.

2. We can merge the steps 2 to 5 in the previous section to one formula:

Multi-headed Attention

Another improvement of self-attention is adding a mechanism called Multi-headed attention. Specifically, instead of initializing a set of $\{W^Q, W^K, W^V\}$ , we can initialize 8 sets of the transforming matrix.

It helps improve the performance of the attention layer in two ways:

It expands the model's ability to focus on different positions.
It gives the attention layer multiple "representation subspaces"

Since the feed-forward layer only expects a single matrix instead of 8, we need to condense the 8 outputs to a single matrix:

Summary of multi-headed self-attention

Positional Encoding

To describe a way to account for the order of words in the input sequence, transformer adds a vector to each input embedding. The vectors follow a specific pattern that helps the model to determine the position of each word, or the distance between different words in a sequence.

Then, the positional encoding is added to the word embedding to a new embedding with the time signal. This new embedding vector is sent to the next level.

--> The positional encoding provides meaningful distance between embedding vectors once they are projected into Q/K/V vectors and during dot-product attention

Example of Positional Encoding

In the Transformer paper, it used the following methods. The $pos$ is the position of current word. $i$ is the index in a vector. For the even number position, it encoded by sine, and for the odd number position, it encoded by cosine.

$PE(pos, 2i) = sin(pos/10000^{2i}/d_{model})$
$PE(pos, 2i+1) = cos(pos/10000^{2i+1}/d_{model})$

For a sequence with 20 words and 512 dimension in embedding, the positional encoding looks like below. Each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1.

Layer Normalization

After each self-attention layer, there is a residual block and a layer normalization operation.

The residual block is used to add missing information of the input $x_i$ to the output of model $z_i$ .

The idea of normalization is transform the input values to values that fit N(0, 1) in order to avoid the flatten area of the activation function.

Batch normalization is one of the most popular method, which normalize a batch of data across the different samples like the left figure below.

Layer normalization normalize the values across all values in the same sample.

Decoder Side

We find the decoder side and encoder side are very similar. The only difference is that there is Masked Multi-head Attention in the decoder side.

Mask

Mask is used to hidden information so that the information do not have impact to the results. There are two types of masks in Attention:

Padding mask: used in all scaled dot-product attention
Sequence mask: used only in self-attention in decoder side.

Padding Mask

It is used to deal with the sequences with multiple lengths. If the sequence length is too long, we need to cut off the exceeding part. If the sequence length is too short, we need to use padding mask to add 0 to make all sequences have the same length.

Sequence Mask

Sequence mask is used to hide the future information that decoder should not see. At time t, a decode can only see the information before t and use those information as input.

Specifically, we just need an upper triangle matrix and set all upper triangle elements to 0. Therefore,

For the self-attention in decoder, we need to use both padding mask and sequence mask as the attention mask. So we just need to add those two masks together
For the other situation, we just need to set the attention mask equals to the padding mask

Output Layer

To convert the output of the decoder to real words, we only need a fully connect feed-forward network and a softmax layer. The output of the FFN and the softmax layer should be the size of the vocabulary.

Reference:

https://arxiv.org/pdf/1706.03762.pdf
http://jalammar.github.io/illustrated-transformer/
https://zhuanlan.zhihu.com/p/63191028
https://terrifyzhao.github.io/2019/01/11/Transformer%E6%A8%A1%E5%9E%8B%E8%AF%A6%E8%A7%A3.html

PreviousAttention NextBERT

Last updated 5 years ago

Was this helpful?