DataTechNotes: Self-Attention Mechanism – A Practical Introduction in Python

In this post, we'll briefly learn what the self-attention mechanism is, how it works, and how to implement it from scratch in Python. The tutorial covers:

What is Self-Attention?
How Does Self-Attention Work?
Query, Key, and Value Explained
Self-Attention Step by Step
Implementing Self-Attention in Python
Self-Attention with PyTorch
Conclusion
Source Code Listing

Let's get started.

What is Self-Attention?

Self-attention is a mechanism that allows a model to look at all other words in a sequence when encoding a particular word. Instead of reading a sentence left-to-right one word at a time, self-attention lets every word "attend to" every other word simultaneously — capturing context from the entire sequence at once. It is the core building block of the Transformer architecture, which powers modern LLMs like BERT, GPT, and LLaMA.

How Does Self-Attention Work?

Consider the sentence: "The cat sat on the mat because it was tired." When encoding the word "it", the model needs to figure out that "it" refers to "cat". Self-attention solves this by computing a score between every pair of words and using those scores to build a weighted representation of the full sequence.

Step	What Happens
1. Project inputs	Each word is transformed into three vectors: Query (Q), Key (K), Value (V)
2. Compute scores	Dot product of Q with every K to get attention scores
3. Scale	Divide scores by √d_k to stabilize gradients
4. Softmax	Convert scores to probabilities (attention weights)
5. Weighted sum	Multiply weights by V and sum — this is the output

Query, Key, and Value Explained

The Q, K, V vectors are produced by learned linear projections of the input embeddings:

Query (Q) — what this token is looking for
Key (K) — what each token advertises about itself
Value (V) — the actual content each token contributes

The attention score between two tokens is the dot product of one token's Query with another token's Key.

Self-Attention Step by Step

The formula for self-attention is:

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

Where d_k is the dimension of the key vectors. The scaling factor √d_k prevents dot products from growing too large, which would push softmax into regions with very small gradients.

Implementing Self-Attention in Python

Let's implement self-attention from scratch using NumPy.

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def self_attention(X, W_q, W_k, W_v):
    """
    X    : input embeddings  (seq_len, d_model)
    W_q  : query weight      (d_model, d_k)
    W_k  : key weight        (d_model, d_k)
    W_v  : value weight      (d_model, d_v)
    """
    Q = X @ W_q          # (seq_len, d_k)
    K = X @ W_k          # (seq_len, d_k)
    V = X @ W_v          # (seq_len, d_v)

    d_k = Q.shape[-1]
    scores  = Q @ K.T / np.sqrt(d_k)   # (seq_len, seq_len)
    weights = softmax(scores)           # attention weights
    output  = weights @ V              # (seq_len, d_v)
    return output, weights

# --- Demo ---
np.random.seed(42)
seq_len, d_model, d_k, d_v = 4, 8, 4, 4

X   = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_v)

output, weights = self_attention(X, W_q, W_k, W_v)

print("Attention Weights:\n", np.round(weights, 3))
print("\nOutput Shape:", output.shape)

Output:

Attention Weights:
 [[0.309 0.216 0.271 0.204]
  [0.198 0.341 0.189 0.272]
  [0.251 0.204 0.338 0.207]
  [0.214 0.268 0.223 0.295]]

Output Shape: (4, 4)

Each row in the attention weights matrix shows how much one token attends to every other token. The rows sum to 1.0.

Self-Attention with PyTorch

PyTorch provides a built-in multi-head attention module. Here is a minimal example using nn.MultiheadAttention.

import torch
import torch.nn as nn

# Parameters
seq_len = 4
d_model = 8
n_heads = 2

# Create the attention layer
attention = nn.MultiheadAttention(embed_dim=d_model,
                                  num_heads=n_heads,
                                  batch_first=True)

# Random input: (batch=1, seq_len, d_model)
X = torch.randn(1, seq_len, d_model)

# Self-attention: Q = K = V = X
output, weights = attention(X, X, X)

print("Output shape :", output.shape)
print("Weights shape:", weights.shape)
print("\nAttention Weights:\n", weights.detach().numpy().round(3))

Output:

Output shape : torch.Size([1, 4, 8])
Weights shape: torch.Size([1, 4, 4])

Attention Weights:
 [[[0.261 0.245 0.252 0.242]
   [0.259 0.248 0.249 0.244]
   [0.252 0.249 0.253 0.246]
   [0.254 0.247 0.252 0.247]]]

The output shape (1, 4, 8) means 1 batch, 4 tokens, and 8-dimensional representations — the same shape as the input, but now enriched with contextual information from the whole sequence.

Conclusion

In this post, we briefly learned what the self-attention mechanism is, how it works through the Q, K, V formulation, and how to implement it in both NumPy and PyTorch. Self-attention is the key innovation behind Transformer-based LLMs — it lets every token in a sequence directly attend to every other token in a single step. In the next post, we will explore multi-head attention and see how stacking multiple attention heads helps the model capture richer relationships.

Source Code Listing

import numpy as np
import torch
import torch.nn as nn

# ----- NumPy self-attention -----
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def self_attention(X, W_q, W_k, W_v):
    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v
    d_k = Q.shape[-1]
    scores  = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    output  = weights @ V
    return output, weights

np.random.seed(42)
seq_len, d_model, d_k, d_v = 4, 8, 4, 4
X   = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_v)
output, weights = self_attention(X, W_q, W_k, W_v)
print("Attention Weights:\n", np.round(weights, 3))
print("Output Shape:", output.shape)

# ----- PyTorch multi-head attention -----
seq_len, d_model, n_heads = 4, 8, 2
attention = nn.MultiheadAttention(embed_dim=d_model,
                                  num_heads=n_heads,
                                  batch_first=True)
X = torch.randn(1, seq_len, d_model)
output, weights = attention(X, X, X)
print("Output shape :", output.shape)
print("Weights shape:", weights.shape)

DataTechNotes

Pages

Self-Attention Mechanism – A Practical Introduction in Python