Pre-Requisites to know

Multi-Head Attention from Beginners POV

In Transformers, Multi-Head Attention takes the same Query (Q), Key (K), and Value (V) inputs as single-head attention but splits the work across multiple “heads.” Each head focuses on different aspects of the input—like grammar, meaning, or context—and together, they give the model a more nuanced understanding. It's the secret sauce behind why Transformers are so good at everything from translation to generating text like this!

This is how Multi-Head Attention works inside, we will discuss everything about this diagram later in this blog. For now lets cover a main topic which is

Why "Scale" the Dot Product ?

In the original dot-product attention, the raw scores can become very large if the dimensionality of the query and key vectors \( (d_k)\) is high. This can cause the softmax function to produce extremely small gradients, leading to numerical instability during training.

To address this issue, the scaled dot-product attention introduces a scalling factor

\[\text{Attention{Q, K, V}} = \text{softmax} \left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V\]

Here, \( (d_k)\), is the dimensionality of the key vectors. Dividing by \(\sqrt{d_k}\) ensures that the dot product are scaled appropriately, preventing the softmax function from saturing

After all this you might get up with a question

Why Multi-Head Attention ?

Single head attention is great, but it's limited. It computes one set of attention weights and mixes all the information into single output. That's like trying to hear every instrument in symphony with one ear it works, but you miss the layers. Multi-Head Attention says, "Why settle for one prespective?" By running attention multiple times in parallel. each with its own lens (or "head"), the model captures diverse relationships in the data-like how " it" refers to "cat" in one head, while another head notices the verb tense.

Plus, its still parallelizable (unlike RNN's), so its fast. More heads = more insights, without slowing things down. Genius, right ?

Now one would wonder if Multi Head Attention helps so much, how does it work under the hood ? So lets start with our second most important topic which is

How Does Multi-Head Attention Work ?

Let’s get into the nitty-gritty. Here’s the step-by-step breakdown, with all the crazy details:

The Math, Condensed

Here’s the full formula:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) \cdot W_O\]

where:

\[\text{head}_i = \text{softmax}\left(\frac{(X \cdot W_Q^i) \cdot (X \cdot W_K^i)^T}{\sqrt{d_k}}\right) \cdot (X \cdot W_V^i)\]

Okkay now we are heading to the part which everyone liked in the previous blog. Yupp the INTUITION. Sorry if I cannot carry up the hype, but I will try my best <3

Intuition : (Heist Edition)

Your are a Detective | 刑事

There has been a recent Bank Heist in your nearby bank and you are piecing together the details of the heist : The thief escaped after the guard dozed off. With single-head attention, you are like one detective with a flashlight, sweeping the crime scene and connecting clus. You might lock onto "thief" and "escaped" but miss how "guard" and "dozed off" set the stage for the getaway !!

Now Multi Head attention steps in like a team of detectives, each with their own flashlight and intelligence

The result? A richer, multi-layered picture of the heist—way more detailed than what one lone detective could crack on their own.