- What is Self-Attention?
- How Does Self-Attention Work?
- Query, Key, and Value Explained
- Self-Attention Step by Step
- Implementing Self-Attention in Python
- Self-Attention with PyTorch
- Conclusion
- Source Code Listing
Let's get started.
What is Self-Attention?
Self-attention is a mechanism that allows a model to look at all other words in a sequence when encoding a particular word. Instead of reading a sentence left-to-right one word at a time, self-attention lets every word "attend to" every other word simultaneously — capturing context from the entire sequence at once. It is the core building block of the Transformer architecture, which powers modern LLMs like BERT, GPT, and LLaMA.
How Does Self-Attention Work?
Consider the sentence: "The cat sat on the mat because it was tired." When encoding the word "it", the model needs to figure out that "it" refers to "cat". Self-attention solves this by computing a score between every pair of words and using those scores to build a weighted representation of the full sequence.
| Step | What Happens |
|---|---|
| 1. Project inputs | Each word is transformed into three vectors: Query (Q), Key (K), Value (V) |
| 2. Compute scores | Dot product of Q with every K to get attention scores |
| 3. Scale | Divide scores by √dk to stabilize gradients |
| 4. Softmax | Convert scores to probabilities (attention weights) |
| 5. Weighted sum | Multiply weights by V and sum — this is the output |
Query, Key, and Value Explained
The Q, K, V vectors are produced by learned linear projections of the input embeddings:- Query (Q) — what this token is looking for
- Key (K) — what each token advertises about itself
- Value (V) — the actual content each token contributes
The attention score between two tokens is the dot product of one token's Query with another token's Key.
Self-Attention Step by Step
The formula for self-attention is:Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V
Where d_k is the dimension of the key vectors. The scaling factor √d_k prevents dot products from growing too large, which would push softmax into regions with very small gradients.
Implementing Self-Attention in Python
Let's implement self-attention from scratch using NumPy.import numpy as np
def softmax(x):
e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e_x / e_x.sum(axis=-1, keepdims=True)
def self_attention(X, W_q, W_k, W_v):
"""
X : input embeddings (seq_len, d_model)
W_q : query weight (d_model, d_k)
W_k : key weight (d_model, d_k)
W_v : value weight (d_model, d_v)
"""
Q = X @ W_q # (seq_len, d_k)
K = X @ W_k # (seq_len, d_k)
V = X @ W_v # (seq_len, d_v)
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (seq_len, seq_len)
weights = softmax(scores) # attention weights
output = weights @ V # (seq_len, d_v)
return output, weights
# --- Demo ---
np.random.seed(42)
seq_len, d_model, d_k, d_v = 4, 8, 4, 4
X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_v)
output, weights = self_attention(X, W_q, W_k, W_v)
print("Attention Weights:\n", np.round(weights, 3))
print("\nOutput Shape:", output.shape)
Output:
Attention Weights:
[[0.309 0.216 0.271 0.204]
[0.198 0.341 0.189 0.272]
[0.251 0.204 0.338 0.207]
[0.214 0.268 0.223 0.295]]
Output Shape: (4, 4)Each row in the attention weights matrix shows how much one token attends to every other token. The rows sum to 1.0.
Self-Attention with PyTorch
PyTorch provides a built-in multi-head attention module. Here is a minimal example usingnn.MultiheadAttention.
import torch
import torch.nn as nn
# Parameters
seq_len = 4
d_model = 8
n_heads = 2
# Create the attention layer
attention = nn.MultiheadAttention(embed_dim=d_model,
num_heads=n_heads,
batch_first=True)
# Random input: (batch=1, seq_len, d_model)
X = torch.randn(1, seq_len, d_model)
# Self-attention: Q = K = V = X
output, weights = attention(X, X, X)
print("Output shape :", output.shape)
print("Weights shape:", weights.shape)
print("\nAttention Weights:\n", weights.detach().numpy().round(3))
Output:
Output shape : torch.Size([1, 4, 8])
Weights shape: torch.Size([1, 4, 4])
Attention Weights:
[[[0.261 0.245 0.252 0.242]
[0.259 0.248 0.249 0.244]
[0.252 0.249 0.253 0.246]
[0.254 0.247 0.252 0.247]]]
The output shape (1, 4, 8) means 1 batch, 4 tokens, and 8-dimensional representations — the same shape as the input, but now enriched with contextual information from the whole sequence.
Conclusion
In this post, we briefly learned what the self-attention mechanism is, how it works through the Q, K, V formulation, and how to implement it in both NumPy and PyTorch. Self-attention is the key innovation behind Transformer-based LLMs — it lets every token in a sequence directly attend to every other token in a single step. In the next post, we will explore multi-head attention and see how stacking multiple attention heads helps the model capture richer relationships.
Source Code Listing
import numpy as np
import torch
import torch.nn as nn
# ----- NumPy self-attention -----
def softmax(x):
e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e_x / e_x.sum(axis=-1, keepdims=True)
def self_attention(X, W_q, W_k, W_v):
Q = X @ W_q
K = X @ W_k
V = X @ W_v
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
output = weights @ V
return output, weights
np.random.seed(42)
seq_len, d_model, d_k, d_v = 4, 8, 4, 4
X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_v)
output, weights = self_attention(X, W_q, W_k, W_v)
print("Attention Weights:\n", np.round(weights, 3))
print("Output Shape:", output.shape)
# ----- PyTorch multi-head attention -----
seq_len, d_model, n_heads = 4, 8, 2
attention = nn.MultiheadAttention(embed_dim=d_model,
num_heads=n_heads,
batch_first=True)
X = torch.randn(1, seq_len, d_model)
output, weights = attention(X, X, X)
print("Output shape :", output.shape)
print("Weights shape:", weights.shape)
No comments:
Post a Comment