Transformers Illustrated!

18 min readOct 31, 2021

I was greatly inspired by Jay Alammar’s take on transformers’ explanation. Later, I decided to explain transformers in a way I understood, and after taking a session in Meetup, the feedback further motivated me to write it down in medium.

Most of the image credits goes to Jay Alammar.

1. Introduction

🤗 Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 2000+ pre-trained models in 100+ languages available in TensorFlow 2.0 and PyTorch, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

🤗 Transformers provide APIs to quickly download and use those pre-trained models on a given text, fine-tune them on your own datasets then share them with the community on their model hub.

1.1 Why should I use 🤗 Transformers?

Easy-to-use state-of-the-art models
Lower compute costs, smaller carbon footprint
Choose the right framework for every part of a model’s lifetime
Easily customize a model or an example to your needs

🤗 Transformers provides the following tasks out of the box:

Sentiment analysis
Text generation (in English)
Name entity recognition (NER)
Question answering
Filling masked text
Summarization
Translation

And by translation, I didn’t mean this…

A moment from FRIENDS. Let’s help Joey with translation, shall we?

1.2 What was the need for 🤗 Transformers?

Recurrent neural networks (RNN) are capable of looking at previous inputs to predict the next possible word. But RNN’s curse of the shorter window of reference, resulting in Vanishing Gradient, makes it difficult to capture the context of a story when the story gets longer. This is still true for Gated Recurrent Units (GRU’s) and Long-short Term Memory (LSTM’s) networks, although they do have a bigger capacity to achieve longer-term memory compared to RNN.

Not only that, RNN is slow to train. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation. But what’s even worse is that LSTM is even slower to train.

The attention mechanism, in theory, have an infinite window to reference from, therefore being capable of using the entire context of the story. In terms of training, Transformers is definitely faster because of the parallel processing capability. Let’s find out more!

2 🤗 Transformers Architecture

2.1 High-level look

I believe the following one looks familiar and “professional” to you!

Credit: “Attention is all you need” Paper

2.2 Encoder — in depth!

Now, we will deep dive into the Encoder section. This is the “professional” view.

Encoder Section from “Attention is all you need” paper

We pass a sentence as an input (of course), but machine can only understand 0s and 1s (again, of course). So we need to translate the words in a sentence into matrix.

2.2.1 Inputs and Input Embedding

This model is trained from corpus of ~30,000 unique words. Each of these words have a unique ID, known as vocabulary index.

The next step is to convert the input word into it’s corresponding word embedding. Word embedding is the vector representation of each words in the vocabulary.

For simplicity in explanation, I used 3 dimension d over here, but in reality, it is 512, 768 or even 1024. The more, the better.

Each of these dimensions captures “some” linguistic feature about that word. Since the model decides these features itself during the training, it can be non-trivial to find out what exactly each of the dimensions represents.

These vectors are randomly initialized and IT IS THESE that will get fine-tuned during the model training, and will ultimately generate the contextual representation of the words to be leveraged during the inference time.

Visual example of how embedding changes before and after training. `sports` and `exercise` have similar embedding value post training because they are closely related.

2.2.2 Positional Encoding

In recurrent networks like LSTMs and GRUs, the network processes the input sequentially, token after token. The hidden state at position t+1 depends on the hidden state from position t. This way, the network has a reference to identify the relative positions of each token by accumulating information. However, Transformers has no notion of word order. That's why it is faster but we do need the information of word's positions.

Changing the position of `not` changed the meaning of the sentence!

Hence the requirement of positional encoding. So how do we do it?

Strategy 1: Add vector of positions IDs (0,1,2,...,(N-1)) with the Word Vectors.

But there is a problem. Adding numbers like these will distort the word embedding value, specially those of the ones appearing in the later part of the text.

Strategy 2: Add vector of fractions of positions IDs (0*1/(N-1), 1*1/(N-1), 2*1/(N-1),...,(N-1)*1/(N-1)) with the Word Vectors.

But there is another problem. Different sentences will have different number of words — so even when we try to get fractional value for those positional vectors, it will be different for the same position for different sentences. This positional vectors needs to be constant for their corresponding position.

Implemented Strategy

Hence, the authors propose a cyclic (dynamic) solution where sine and cosine function with different frequencies is added to each word embedding. The formula goes like this:

Let us try understanding the sin part of the formula to compute the position embeddings:

Here pos refers to the position of the word in the sequence. P0 refers to the position embedding of the first word, d means the size of the word/token embedding (here it is d=5). Finally, i refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4).

While d is fixed, pos and i vary. Let us try understanding the latter two.
1. pos

If we plot a sin curve and vary pos (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different values position embeddings values.

There is a problem though. Since sin curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different (word) positions. This is where the i part in the equation comes into play.
2. i

If you vary i in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.

For every odd index on the position vector, we pass the cosine function and for every even index, the sine function.

Finally, add these positional vectors to their corresponding input embedding vectors. This successfully gives the network information on the position of each word.

2.2.3 Multi-Headed Self-Attention Mechanism

2.2.3.1 Self-Attention

Self-attention allows the models to associate each word in the input to other words.
Example #1

Example #2
Say the following sentence is an input sentence we want to translate:

The animal didn't cross the street because it was too tired

What does it in this sentence refer to? Is it referring to street or animal? As the model processes each word, self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

So how is it working?

What motivated to have this architecture? This analogy can be partially motivated in the way retrieval system works.

Step 1: 3 Linear Components

We feed the positional embedding input into 3 distinct linear layers comprising of randomly initialized weight matrix to create 3 vectors — Query, Key and Value.

Multiplying 𝑋1 by the 𝑊𝑄 weight matrix produces q1, the “query” vector associated with that word. Likewise, we end up creating a “key” and a “value” projection of each word in the input sentence.

NOTE : These new vectors are smaller in dimension than the positional embedding vector. We will come back to this.

Step 2 : Getting the Attention Weight

Now, Query and transpose of Key undergoes dot product matrix multiplication to generate a score matrix, which determines how much focus should a word be put on other words. Higher score means more focus. This is how the queries are mapped to the keys.

Then, the scores get scaled down by getting divided by the square root of the dimension of key vector. This is to allow for more stable gradients, as multiplying values can have exploding effects. Next, you take the softmax of the scaled score to get the attention weights (or filters), which gives you probability values between 0 and 1.

By doing a softmax the higher scores get enhanced, and lower scores are depressed. This allows the model to be more confident about which words to attend.

Step 3 : Mapping the Attention Weight with Original Matrix

Then you take the attention weights and multiply it by your Value vector to get an output matrix Z. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words.

Why is this multiplication done? The best way to explain the reason for implementing this technique is in the context of computer vision.

Imagine encountering Yahiko from the 6 path of Pain. In reality, the entire view is like this.

But you need to focus on Yahiko.

This is achieved using the following way.

Final Step/Summary

So, this is how self-attention works! The following formula gives you the summary:

2.2.3.2 Multi-Headed

The paper further refined the self-attention layer by adding a mechanism called multi-headed attention.

Each self-attention process we learned above is called a head. Stacking up multiple self-attention will give us multi-headed attention. In the paper, we have 8 heads. So the 512 d input gets segmented to 8 64 d vectors. In case of BERT, there are 12 64 d vectors, resulting in (12*64=)768 d vectors.

Why is this technique implemented?

In theory, each head would learn something different therefore giving the encoder model more representation power. Another visual example will help.

You are now encountering all the members of the 6 path of pain.

Now, we decided to process 2 individual at a time to process the entire scenario, thereby keeping an eye on everyone.

In this multi-headed attention computation, each head has Query, Key and Value weight matrices which are randomly initialized and mutually exclusive that will help to project the positional embeddings into a different representation subspace.

Now, if we perform the same self-attention calculation as outlined in the previous section, 8 different times with different weight matrices, we end up with 8 different Z matrices.

An example to clearly understand it.

However, this leaves us with a bit of a challenge. The upcoming feed-forward neural network (FFNN) layer is not expecting 8 matrices — it’s expecting a single matrix (a vector for each word). Hence, we concatenate the matrices, then pass it through another linear layer (again comprising of weights matrices) 𝑊𝑂 and get the original vector dimension back (for example, to 512).

Recap

A quick recap of the operations and steps performed in multi-headed self-attention mechanism.

2.2.4 Residual Connections and Layer Normalization

The multi-headed attention output vector is added to the original positional input embedding. This is called a residual connection. The purpose of this component is to preserve the original context, thereby tackling the Vanishing Gradient problem.

The output of the residual connection goes through a layer normalization. This is placed after each sub-layer (self-attention, FFNN) for each encoder.

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons using batch normalization. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. So, researchers transposed batch normalization into layer normalization.

Just to have a better understanding of this, take a look at this visual.

Difference between Batch & Layer Normalization

In batch normalization, the statistics are computed across the batch. In contrast, in layer normalization, the statistics are computed across each feature and are independent of other examples.

2.2.5 Feed-Forward Neural Network

The penultimate layer in the block is a pack of feed-forward networks. Each word vector in the sentence (up to the capped sentence length) is given its own feed-forward network. Thus each position in the sentence is learned independently of each other position. This network consists of 2 linear layers (2 1D convolutions with kernel size 1) with a ReLU activation in between.

The output of this network is further normalized by first performing residual connection and layer normalization.

But, WHY do we need this layer?

It’s main purpose is to process the output from one attention layer in a way to better fit the input for the next attention layer.

This layer which usually appear near the end of a network.

After the attention layer, the latent representation of each words contains information from other words. However, we want to consolidate a unique representation for each words. This is done using a localized layer, which does not consider neighbors or other positions, and simply transforms the local representation on its own.

Encoder — Wrap up!

That wraps up the encoder layer. All of these operations are to encode the input to a continuous representation with attention information. This will help the decoder focus on the appropriate words in the input during the decoding process. You can stack the encoder N times to further encode the information, where each layer has the opportunity to learn different attention representations therefore potentially boosting the predictive power of the transformer network.

Based on this, BERT came into the picture.

Not this BERT actually…

A simple example of BERT implementation is done in the image below:

On September 2020, Google published BigBird (again inspired by Sesame Street).

This Transformer based model is expected to handle larger input sequences. It incorporates Sparse Attention Mechanism which enables it to process sequences of length up to 8 times more than what was possible with BERT. Using this, researchers decreased the complexity of 𝑂(𝑛2) (of BERT) to just 𝑂(𝑛).

Link to the paper can be found here.

2.3 Decoder — in depth!

Now, we will deep dive into the Decoder section. This is the “professional” view.

Decoder Section from “Attention is all you need” paper

The decoder’s job is to generate text sequences. The decoder has a similar sub-layer as the encoder. it has 2 multi-headed attention layers, a pointwise feed-forward layer, residual connections, and layer normalization. These sub-layers behave similarly to the layers in the encoder but each multi-headed attention layer has a different job. The decoder is capped off with a linear layer that acts as a classifier, and a softmax to get the word probabilities.

The decoder is autoregressive. This is how it operates:

It begins with a special token <start>.
This token’s corresponding vector gets calculated with encoder outputs that contain the attention information and generates a possible word.
Then it takes the previous output(s) as input(s) and again that encoder outputs.
Then it generates the next possible word, and this process goes on.
The decoder stops decoding when it generates <eos> (short for end-of-sentence) token as an output.

2.3.1 Output Embedding and Positional Encoding

The beginning of the decoder is pretty much the same as the encoder. The input goes through an embedding layer and positional encoding layer to get positional embeddings.

2.3.2 First Multi-Headed Self Attention Mechanism

This multi-headed attention layer operates slightly differently from the encoder one. Since the decoder is autoregressive and generates the sequence word by word, one need to prevent it from conditioning to future tokens. For example, when computing attention scores on the word “am”, one should not have access to the word “fine”, because that word is a future word that was generated after. The word “am” should only have access to itself and the words before it.

This is true for all other words, where they can only attend to previous words.

So, when Ross says…

… he IS fine!

So, how do we prevent computing attention scores for future words?

This is done using Look-Ahead Mask. The mask is a matrix that has the same size as the attention scores filled with values of 0’s and negative infinities. When you add the mask to the scaled attention scores, you get a matrix of the scores, with the top right triangle filled with negativity infinities.

Once you take the softmax of the masked scores, the negative infinities becomes 0, leaving a zero attention scores for future tokens.

This component also has multiple heads, in each of them the mask is being applied and then getting concatenated. Again, an example for clear understanding.

Then similar to the encoder, the model employ residual connections followed by layer normalization.

2.3.3 Second Multi-Headed Attention Mechanism

For this layer, the inputs are:

Query — output of the masked multi-headed attention layer of decoder
Key — Encoder’s output
Value — Encoder’s output

This process matches the encoder’s output to the decoder’s output, allowing the decoder to decide which encoder section is relevant to put a focus on. In other words, the decoder predicts the next word by looking at the encoder output and self-attending to its own output.

Hence, this layer is also called encoder-decoder attention or source-target attention. The following picture will help you to understand this.

An example to understand it better.

Then, again, the model performs residual connection followed by layer normalization.

2.3.4 Feed-Forward Neural Network

Just like in Encoder, the output of encoder-decoder attention is fed to a FFNN to process it in an acceptable form to be fed to the final layer.

2.3.5 Linear Classifier and Softmax Function

The decoder stack outputs a vector of floats. How do we turn that into a word? That is the job of the final Linear layer which is followed by a softmax layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector. So this layer is basically a classifier. The classifier is as big as the number of classes you have. With respect to this paper, this layer ~30,000 classes for ~30,000 words. This would make the logits vector ~30,000 cells wide — each cell corresponding to the score of a unique word.
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

2.3.6 Optimizer and Loss Function

Authors of the paper used Adam optimizer with a custom learning rate that varied over the course of training. This is achieved using the formula:

where warmup_steps = 4000.

As for loss function, the paper is using Categorical Cross Entropy.

Sample output of Decoder. Credit: Jay Alammar

2.3.7 Final view of the decoder output

If we go back to the translation example, the output from the decoder will be as follows:

Ground Truth

Predicted Answer

Decoder — Wrap up!

That wraps up the decoder layer. Now, the decoder will be able to map the relevant information with the encoder output, capturing the context and generating the result. You can stack the decoder N times just like it was done in encoder to further process and decode the information, where each layer has the opportunity to learn different attention representations therefore potentially boosting the predictive power of the transformer network.

The OpenAI GPT-2 model uses these decoder-only blocks. Here is a sample output of GPT-2.

GIF showing working example of GPT2. Credit: Jay Alammar

3 🤗 Transformers — Wrap up!

So, we covered individually, how each components in Encoder and Decoder works. Let’s take a look at how they work together.

For simplicity, we have taken a stack of 2 encoders and decoders and performing French-to-English translation.

Encoding & Decoding (Part 1). Credit: Jay Alammar

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence.

The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

This step is repeated till the model spits out <eos> token indicating the end of process (here, translation).