DataTechNotes: LLM Embeddings – A Practical Introduction in Python

In this post, we'll briefly learn what LLM embeddings are, how they work, and how to generate and use them in Python. The tutorial covers:

What are Embeddings?
How LLMs Generate Embeddings
Types of Embeddings
Generating Embeddings with Sentence Transformers
Generating Embeddings with OpenAI API
Measuring Semantic Similarity
Visualizing Embeddings with TSNE
Conclusion
Source Code Listing

Let's get started.

What are Embeddings?

Embeddings are dense numerical vectors that represent the meaning of text. Instead of working with raw strings, LLMs convert words, sentences, or documents into fixed-size arrays of floating-point numbers that capture semantic relationships. The key idea is that similar meanings produce similar vectors. For example, the embeddings for "king" and "queen" will be closer together in vector space than the embeddings for "king" and "bicycle". This geometric property makes embeddings extremely useful for search, clustering, classification, and retrieval-augmented generation (RAG).

Use Case	What Embeddings Enable
Semantic search	Find documents by meaning, not just keyword match
RAG systems	Retrieve relevant chunks to inject into LLM context
Text clustering	Group similar documents without labels
Classification	Use embeddings as features for a classifier
Duplicate detection	Find near-identical texts even when worded differently

How LLMs Generate Embeddings

When text is fed into an LLM, every token is first converted into a vector by an embedding layer — a learned lookup table that maps each token ID to a high-dimensional vector (e.g., 768 or 4096 dimensions). As the text passes through the Transformer layers, these vectors are updated by the self-attention mechanism to incorporate context from the entire sequence. The final embedding for a sentence is typically produced in one of two ways:

[CLS] token pooling — BERT-style models prepend a special [CLS] token and use its final hidden state as the sentence embedding.
Mean pooling — average the final hidden states of all tokens. Used by most sentence embedding models.

Embedding models are separate from generative models — they are optimised specifically for producing high-quality representations, not for generating text.

Types of Embeddings

Type	Description	Example Models
Word embeddings	One vector per word, context-free	Word2Vec, GloVe
Contextual embeddings	Token vectors depend on surrounding context	BERT, RoBERTa
Sentence embeddings	One vector per sentence or paragraph	all-MiniLM, text-embedding-3
Document embeddings	One vector per long document	Longformer, BigBird

Generating Embeddings with Sentence Transformers

The sentence-transformers library is the easiest way to generate high-quality sentence embeddings locally. It wraps Hugging Face models with a simple API.

Installation

% pip install sentence-transformers

Example — Generate sentence embeddings

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a lightweight sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat.",
    "A dog rested on the rug.",
    "LLMs are trained on large text datasets.",
]

# Generate embeddings
embeddings = model.encode(sentences)

print("Shape     :", embeddings.shape)
print("First vec :", np.round(embeddings[0][:6], 4))

Output:

Shape     : (3, 384)
First vec : [ 0.0231 -0.0412  0.0553  0.0187 -0.0329  0.0614]

Each sentence is represented as a 384-dimensional vector. The all-MiniLM-L6-v2 model is fast, small, and works well for most semantic similarity tasks.

Generating Embeddings with OpenAI API

OpenAI's text-embedding-3-small model produces 1536-dimensional embeddings and is one of the most widely used embedding APIs in production systems.

Installation

% pip install openai

Example — OpenAI Embeddings

from openai import OpenAI
import numpy as np

client = OpenAI()   # reads OPENAI_API_KEY from environment

texts = [
    "The cat sat on the mat.",
    "A dog rested on the rug.",
    "LLMs are trained on large text datasets.",
]

response = client.embeddings.create(
    input=texts,
    model="text-embedding-3-small"
)

# Extract vectors
vectors = np.array([d.embedding for d in response.data])

print("Shape     :", vectors.shape)
print("First vec :", np.round(vectors[0][:6], 4))

Output:

Shape     : (3, 1536)
First vec : [ 0.0142 -0.0381  0.0204  0.0519 -0.0173  0.0037]

The output shape (3, 1536) means 3 sentences, each represented as a 1536-dimensional vector. OpenAI embeddings are already L2-normalised, so cosine similarity can be computed with a simple dot product.

Measuring Semantic Similarity

The most common way to compare two embeddings is cosine similarity — it measures the angle between two vectors, returning a value between -1 and 1. A score close to 1 means the sentences are semantically similar.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat.",       # sentence A
    "A dog rested on the rug.",       # sentence B — similar to A
    "LLMs are trained on large text datasets.",  # sentence C — unrelated
]

embeddings = model.encode(sentences)

# Compute cosine similarity matrix
sim_matrix = cosine_similarity(embeddings)

print("Similarity Matrix:\n", sim_matrix.round(3))
print("\nA vs B (similar)  :", round(sim_matrix[0, 1], 3))
print("A vs C (unrelated) :", round(sim_matrix[0, 2], 3))

Output:

Similarity Matrix:
 [[1.    0.734 0.051]
  [0.734 1.    0.083]
  [0.051 0.083 1.   ]]

A vs B (similar)   : 0.734
A vs C (unrelated) : 0.051

The results are exactly what we expect — sentences A and B (both about a pet resting on a surface) score 0.734, while A and C (completely different topics) score only 0.051. The diagonal is always 1.0 since each sentence is perfectly similar to itself.

Visualizing Embeddings with TSNE

Since embeddings are high-dimensional, we use t-SNE to reduce them to 2D for visualization. Points that appear close together in the plot are semantically similar in the original embedding space.

from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    # Animals
    "The cat sat on the mat.",
    "A dog rested on the rug.",
    "The parrot repeated the word.",
    # AI / ML
    "LLMs are trained on large text datasets.",
    "Transformers use self-attention mechanisms.",
    "Embeddings capture semantic meaning as vectors.",
    # Food
    "Pizza is topped with cheese and tomato sauce.",
    "Sushi is a traditional Japanese dish.",
    "Pasta is a staple of Italian cuisine.",
]

labels = ["Animals"] * 3 + ["AI/ML"] * 3 + ["Food"] * 3
colors = {"Animals": "#58A6FF", "AI/ML": "#BC8CFF", "Food": "#3FB950"}

embeddings = model.encode(sentences)

# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=3)
coords = tsne.fit_transform(embeddings)

# Plot
fig, ax = plt.subplots(figsize=(7, 5))
for i, (x, y) in enumerate(coords):
    ax.scatter(x, y, color=colors[labels[i]], s=120)
    ax.annotate(labels[i], (x, y), fontsize=9,
                xytext=(5, 5), textcoords="offset points")
ax.set_title("Sentence Embeddings – TSNE Visualization")
plt.tight_layout()
plt.savefig("embeddings_tsne.png", dpi=150)
plt.show()

The resulting plot will show three visible clusters — Animals, AI/ML, and Food — grouped together in 2D space, confirming that the embeddings successfully capture topic-level similarity even without any labels.

Conclusion

In this post, we briefly learned what LLM embeddings are, how they are generated through the Transformer's embedding layer and pooling, and how to use them in Python with both sentence-transformers and the OpenAI API. We also measured semantic similarity using cosine similarity and visualized the embedding space with t-SNE. Embeddings are the foundation of modern semantic search and RAG pipelines — understanding them is essential for building real-world LLM applications. In the next post, we will build a simple semantic search engine using embeddings and FAISS vector store.

Source Code Listing

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# ----- Basic embedding -----
sentences = [
    "The cat sat on the mat.",
    "A dog rested on the rug.",
    "LLMs are trained on large text datasets.",
]
embeddings = model.encode(sentences)
print("Shape     :", embeddings.shape)
print("First vec :", np.round(embeddings[0][:6], 4))

# ----- Cosine similarity -----
sim_matrix = cosine_similarity(embeddings)
print("Similarity Matrix:\n", sim_matrix.round(3))
print("A vs B (similar)  :", round(sim_matrix[0, 1], 3))
print("A vs C (unrelated) :", round(sim_matrix[0, 2], 3))

# ----- TSNE visualization -----
all_sentences = [
    "The cat sat on the mat.",
    "A dog rested on the rug.",
    "The parrot repeated the word.",
    "LLMs are trained on large text datasets.",
    "Transformers use self-attention mechanisms.",
    "Embeddings capture semantic meaning as vectors.",
    "Pizza is topped with cheese and tomato sauce.",
    "Sushi is a traditional Japanese dish.",
    "Pasta is a staple of Italian cuisine.",
]
labels = ["Animals"] * 3 + ["AI/ML"] * 3 + ["Food"] * 3
colors = {"Animals": "#58A6FF", "AI/ML": "#BC8CFF", "Food": "#3FB950"}
all_emb = model.encode(all_sentences)
coords  = TSNE(n_components=2, random_state=42, perplexity=3).fit_transform(all_emb)
fig, ax = plt.subplots(figsize=(7, 5))
for i, (x, y) in enumerate(coords):
    ax.scatter(x, y, color=colors[labels[i]], s=120)
    ax.annotate(labels[i], (x, y), fontsize=9,
                xytext=(5, 5), textcoords="offset points")
ax.set_title("Sentence Embeddings – TSNE Visualization")
plt.tight_layout()
plt.savefig("embeddings_tsne.png", dpi=150)
plt.show()

DataTechNotes

Pages

LLM Embeddings – A Practical Introduction in Python

What are Embeddings?

How LLMs Generate Embeddings

Types of Embeddings

Generating Embeddings with Sentence Transformers

Generating Embeddings with OpenAI API

Measuring Semantic Similarity

Visualizing Embeddings with TSNE

Conclusion

Source Code Listing

No comments:

Post a Comment