In this post, we'll briefly learn what Semantic Text Similarity is, how LLM Embeddings enable it, and how to measure the semantic closeness between sentences in Python. The tutorial covers:
- What is Semantic Text Similarity?
- What are LLM Embeddings?
- Installation
- Loading an Embedding Model
- Cosine Similarity Between Two Sentences
- Ranking Sentences by Similarity
- Batch Similarity with a Query
- Similarity Heatmap for a Sentence Set
- Conclusion
Let's get started.
What is Semantic Text Similarity?
Semantic Text Similarity (STS) is the task of determining how similar two pieces of text are in meaning, regardless of the exact words used. Unlike simple keyword matching, STS understands that "The dog chased the ball" and "A puppy ran after the sphere" convey nearly the same idea. It is a foundational capability in many NLP applications such as duplicate question detection, information retrieval, paraphrase identification, and recommendation systems.
Traditional approaches relied on word overlap metrics like Jaccard similarity or TF-IDF cosine distance, which fail to capture meaning. Modern approaches instead map each sentence into a dense numerical vector — an embedding — using a large language model, and then measure the geometric distance between those vectors.
What are LLM Embeddings?
An embedding is a fixed-length numerical vector that encodes the semantic content of a piece of text. Two texts with similar meaning produce vectors that point in nearly the same direction in the embedding space, while unrelated texts produce vectors that are far apart. LLMs trained on large corpora learn rich representations that capture nuance, context, and domain knowledge.
The most common metric for comparing embedding vectors is cosine similarity, which measures the angle between two vectors and returns a value between -1 and 1. A score close to 1 means the texts are semantically very similar; a score near 0 means they are unrelated; a score near -1 means they are semantically opposite. In practice, sentence embedding models tend to produce scores in the 0–1 range for natural language text.
| Cosine Score Range | Interpretation |
|---|---|
| 0.90 – 1.00 | Near-duplicate or paraphrase |
| 0.70 – 0.89 | Highly similar in topic and meaning |
| 0.50 – 0.69 | Somewhat related |
| 0.30 – 0.49 | Weakly related or loosely connected |
| 0.00 – 0.29 | Unrelated or different topics |
Installation
We will use the sentence-transformers
library, which wraps Hugging Face Transformer models and provides convenient utilities
for generating sentence embeddings and computing similarity scores. Install it along
with numpy
and matplotlib
for numerical operations and visualisation.
pip install sentence-transformers numpy matplotlib
Loading an Embedding Model
We load the all-MiniLM-L6-v2
model, a compact but high-quality sentence embedding model available on the Hugging Face
Model Hub. It maps text of any length into a 384-dimensional vector and runs efficiently
on CPU. The
SentenceTransformer
class downloads and caches the model automatically on first use.
from sentence_transformers import SentenceTransformer
# Load a lightweight, high-quality sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded successfully.")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
Output:
Model loaded successfully. Embedding dimension: 384
Each sentence will be encoded into a vector of 384 floating-point numbers. You can swap in any other model from the SBERT model list simply by changing the model name string — larger models produce higher-dimensional embeddings and often better accuracy at the cost of speed.
Cosine Similarity Between Two Sentences
The most basic use case is comparing two individual sentences. We encode both sentences
into embeddings and then compute their cosine similarity using the utility function
provided by sentence_transformers.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Define sentence pairs to compare
pairs = [
("A dog is playing in the park.", "A puppy is running outside."),
("The stock market crashed today.", "Investors lost billions overnight."),
("She loves reading mystery novels.", "The weather forecast shows heavy rain."),
("Python is a popular programming language.","Many developers use Python for data science."),
]
print(f"{'Sentence A':<45} {'Sentence B':<45} {'Score':>6}")
print("-" * 100)
for sent_a, sent_b in pairs:
emb_a = model.encode(sent_a, convert_to_tensor=True)
emb_b = model.encode(sent_b, convert_to_tensor=True)
score = util.cos_sim(emb_a, emb_b).item()
print(f"{sent_a:<45} {sent_b:<45} {score:>6.4f}") Output:
Sentence A Sentence B Score
----------------------------------------------------------------------------------------------------
A dog is playing in the park. A puppy is running outside. 0.4189
The stock market crashed today. Investors lost billions overnight. 0.5578
She loves reading mystery novels. The weather forecast shows heavy rain. 0.0182
Python is a popular programming language. Many developers use Python for data science. 0.7595
Notice that the pair about the novel and the weather scores near zero — the model correctly identifies that they share no semantic content — while the dog/puppy pair scores above 0.8 despite using completely different words.
Ranking Sentences by Similarity
A common practical task is ranking a list of candidate sentences by their similarity to a fixed reference sentence. We encode all candidates at once using batch encoding for efficiency, then sort by descending cosine score.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
reference = "How do I train a neural network?"
candidates = [
"Steps to build a deep learning model.",
"What is the capital of France?",
"Gradient descent optimises model weights during training.",
"I enjoy hiking on weekends.",
"Backpropagation computes gradients in neural networks.",
"Which framework is best for machine learning?",
"The recipe requires two cups of flour.",
]
ref_emb = model.encode(reference, convert_to_tensor=True)
cand_emb = model.encode(candidates, convert_to_tensor=True)
scores = util.cos_sim(ref_emb, cand_emb).detach().cpu().numpy()
ranked = sorted(zip(scores, candidates), reverse=True)
print(f"Reference: \"{reference}\"\n")
print(f"{'Rank':<6} {'Score':<8} Candidate Sentence")
print("-" * 72)
for rank, (score, sentence) in enumerate(ranked, start=1):
print(f"{rank:<6} {score:<8.4f} {sentence}")
Output:
Reference: "How do I train a neural network?"
Rank Score Candidate Sentence
------------------------------------------------------------------------
1 0.5657 Steps to build a deep learning model.
2 0.4655 Backpropagation computes gradients in neural networks.
3 0.3500 Gradient descent optimises model weights during training.
4 0.2918 Which framework is best for machine learning?
5 0.0509 The recipe requires two cups of flour.
6 0.0474 What is the capital of France?
7 -0.1051 I enjoy hiking on weekends. The model surfaces machine-learning-related sentences at the top even though none of them contains the phrase "train a neural network" verbatim. The bottom-ranked sentences are semantically unrelated and score below 0.15.
Batch Similarity with a Query
The semantic_search
helper function in sentence_transformers.util
performs an efficient nearest-neighbour search and returns the top-k most similar
documents for one or more query embeddings in a single call. This mirrors the core
operation of a semantic search engine or a retrieval-augmented generation (RAG) system.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# A small document corpus
corpus = [
"Machine learning models learn patterns from data.",
"Deep neural networks have many hidden layers.",
"Paris is the capital and largest city of France.",
"The Eiffel Tower was built in 1889.",
"Convolutional networks excel at image recognition tasks.",
"Transfer learning reuses knowledge from pre-trained models.",
"French cuisine is famous around the world.",
"Recurrent networks are designed for sequential data.",
]
query = "What neural network architecture works well for images?"
# Encode corpus and query
corpus_emb = model.encode(corpus, convert_to_tensor=True)
query_emb = model.encode(query, convert_to_tensor=True)
# Retrieve top-3 results
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]
print(f"Query: \"{query}\"\n")
print(f"Top-3 Results:")
print("-" * 60)
for hit in hits:
print(f"Score: {hit['score']:.4f} | {corpus[hit['corpus_id']]}")
Output:Query: "What neural network architecture works well for images?"
Top-3 Results:
------------------------------------------------------------
Score: 0.4764 | Convolutional networks excel at image recognition tasks.
Score: 0.4165 | Deep neural networks have many hidden layers.
Score: 0.3095 | Recurrent networks are designed for sequential data.
The function returns a list of dictionaries, each with a
corpus_id
(the index of the matched document) and a
score.
In a production setting, the corpus embeddings would be pre-computed and stored in a
vector database such as FAISS, ChromaDB, or Pinecone for millisecond-latency retrieval
over millions of documents.
Similarity Heatmap for a Sentence Set
Visualising the pairwise similarity matrix as a heatmap is a quick way to understand
the semantic structure of a sentence collection at a glance. We compute the full
similarity matrix with
util.cos_sim
and render it using
matplotlib.
from sentence_transformers import SentenceTransformer, util
import matplotlib.pyplot as plt
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"I love machine learning.",
"Deep learning is a branch of AI.",
"Neural networks mimic the human brain.",
"I enjoy cooking Italian food.",
"Pasta and pizza are Italian dishes.",
"Cats and dogs are popular pets.",
]
# Compute all embeddings and full pairwise similarity matrix
embeddings = model.encode(sentences, convert_to_tensor=True)
# FIXED LINE
sim_matrix = util.cos_sim(embeddings, embeddings).detach().cpu().numpy()
#sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
# Short labels for the axes
labels = [s[:28] + "…" if len(s) > 28 else s for s in sentences]
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(sim_matrix, cmap="YlOrRd", vmin=0, vmax=1)
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=9)
ax.set_yticklabels(labels, fontsize=9)
# Annotate each cell with its score
for i in range(len(sentences)):
for j in range(len(sentences)):
ax.text(j, i, f"{sim_matrix[i, j]:.2f}",
ha="center", va="center", fontsize=8,
color="black" if sim_matrix[i, j] < 0.7 else "white")
plt.colorbar(im, ax=ax, label="Cosine Similarity")
ax.set_title("Pairwise Semantic Similarity Heatmap", fontsize=13, pad=14)
plt.tight_layout()
plt.savefig("similarity_heatmap.png", dpi=150)
plt.show()
Output:The heatmap reveals natural clusters: the three machine-learning sentences form a high-similarity block in the top-left corner, the two food-related sentences group together in the middle, and the pet sentence stands apart. Diagonal entries are always 1.00 since every sentence is identical to itself.
Conclusion
In this post, we briefly learned what Semantic Text Similarity is and how
LLM embeddings make it possible to measure meaning rather than just word
overlap. We used the sentence-transformers
library to load the all-MiniLM-L6-v2
model, compute cosine similarity between sentence pairs, rank candidates against a
reference query, perform batch semantic search, and visualise a pairwise similarity
heatmap. These building blocks are the foundation of modern semantic search and
retrieval-augmented generation (RAG) pipelines.

No comments:
Post a Comment