Transformers are a type of neural network architecture which is popularly used for text generations, machine translations, etc. But before transformers, models like RNNs and LSTMs were used for sequence data. Unfortunately, they had issues with long-range dependencies because the sequential processing made it hard to parallielize and capture context from far away in sequence
Transformers solved this issue by using attention mechanism to weigh the importance of different parts of the input allowing model to focus on relevant parts regardless of their position.
This is what attention under the hood looks like, we will be deep diving into every component one by one. Coming back to our topic which is
So attention is a way to compute a weights sum of values (from the input) where the weights are determined by the relevance between the query and the keys.
In the context of a sentance, each word (or token) has a query, key and value vectors. The query from a word is compared with the keys of all other words to determine how much attention to pay to each one. The value are then aggregated based on these attention weights.
They usually come from linear transformations of the input embeddings. So each input token is embedded into a vector, and then multiplied by weight matrices \(Wq, Wk, Wv\) to get the query, key and value vectors. Then, the attention scores between token are calculated using the dot product of the query and key vectors. These scores are scaled by the square root of the dimension of the key vectors to prevent the dot products from getting too large, which could lead to vanishing gradients.
After scaling, a softmax is applied to convert the scores into probabilities (weights) that sum to 1. These weights are then used to take a weighted average of the value vectors. The result is the output of the attention mechanism for each position. This process is called self-attention because each position attends to all positions in the same sequence.
Let \( \mathbf{x} \in \mathbb{R}^{d} \) be the input embedding for a token, where \( d \) is the dimensionality of the embedding space.
The query, key, and value vectors are generated by applying linear transformations to the input embedding:
where \( \mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v \in \mathbb{R}^{d \times d} \) are learnable weight matrices.
The attention scores between tokens are calculated using the dot product of the query and key vectors:
where \( \mathbf{A} \in \mathbb{R}^{n \times n} \) is the attention score matrix, \( n \) is the number of tokens, and \( \sqrt{d} \) is the scaling factor to prevent vanishing gradients.
Note: \( \top \) denotes the transpose operation, and \( \mathbb{R}^{d \times d} \) represents the set of \( d \times d \) matrices with real-valued entries.
Transformers don't stop at one Attention mechanism - they use Multi-Head Attention, instead of computing a single attention mechanism, the model splits the input into multiple heads, each computing their own attention. This allows the model to focus on different parts of the input in different ways. The outputs from each head are concatenated and then linearly transformed again to get the final output
Why do they do this? Maybe to capture different types of relationships or dependencies in the data, this also allows the model to attent to multiple relationship in the data at once, making it incredibly flexible and expressive
We will discuss more about Multi-Head Attention in the upcoming blog :)
In Transformers, the most common type of attention is Self Attention, where Q, K and V all come from the same input sequence. This means every token in the sequence can attend to every other token, including itself, For example:
The cat slept because it was tired
Self-Attention lets "it" focus heavily on "cat"to understand the reference
This biredirectional (or non-bidirectional) awareness is what makes Transformers so good at capturing context, unlike RNNs, which only look backward or forward
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V \]
The Transformer consists of an Encoder and a Decoder, both stacked with multiple layers (e.g., 6 or 12). Attention plays different roles in each:
Single most phrase which made me undertand attention is
Think of Attention as a spotlight operator in a theater. The Query is the director saying, “Focus on the lead actor!” The Keys are all the actors on stage, and the Values are their lines. The spotlight (Attention) adjusts its beam based on who’s most relevant, illuminating the scene dynamically as the play unfolds.