DataTechNotes: Semantic Text Similarity with LLM Embeddings in Python

In this post, we'll briefly learn what Semantic Text Similarity is, how LLM Embeddings enable it, and how to measure the semantic closeness between sentences in Python. The tutorial covers:

What is Semantic Text Similarity?
What are LLM Embeddings?
Installation
Loading an Embedding Model
Cosine Similarity Between Two Sentences
Ranking Sentences by Similarity
Batch Similarity with a Query
Similarity Heatmap for a Sentence Set
Conclusion

Let's get started.

What is Semantic Text Similarity?

Semantic Text Similarity (STS) is the task of determining how similar two pieces of text are in meaning, regardless of the exact words used. Unlike simple keyword matching, STS understands that "The dog chased the ball" and "A puppy ran after the sphere" convey nearly the same idea. It is a foundational capability in many NLP applications such as duplicate question detection, information retrieval, paraphrase identification, and recommendation systems.

Traditional approaches relied on word overlap metrics like Jaccard similarity or TF-IDF cosine distance, which fail to capture meaning. Modern approaches instead map each sentence into a dense numerical vector — an embedding — using a large language model, and then measure the geometric distance between those vectors.

What are LLM Embeddings?

An embedding is a fixed-length numerical vector that encodes the semantic content of a piece of text. Two texts with similar meaning produce vectors that point in nearly the same direction in the embedding space, while unrelated texts produce vectors that are far apart. LLMs trained on large corpora learn rich representations that capture nuance, context, and domain knowledge.

The most common metric for comparing embedding vectors is cosine similarity, which measures the angle between two vectors and returns a value between -1 and 1. A score close to 1 means the texts are semantically very similar; a score near 0 means they are unrelated; a score near -1 means they are semantically opposite. In practice, sentence embedding models tend to produce scores in the 0–1 range for natural language text.

Cosine Score Range	Interpretation
0.90 – 1.00	Near-duplicate or paraphrase
0.70 – 0.89	Highly similar in topic and meaning
0.50 – 0.69	Somewhat related
0.30 – 0.49	Weakly related or loosely connected
0.00 – 0.29	Unrelated or different topics

Installation

We will use the sentence-transformers library, which wraps Hugging Face Transformer models and provides convenient utilities for generating sentence embeddings and computing similarity scores. Install it along with numpy and matplotlib for numerical operations and visualisation.


 pip install sentence-transformers numpy matplotlib

Loading an Embedding Model

We load the all-MiniLM-L6-v2 model, a compact but high-quality sentence embedding model available on the Hugging Face Model Hub. It maps text of any length into a 384-dimensional vector and runs efficiently on CPU. The SentenceTransformer class downloads and caches the model automatically on first use.


from sentence_transformers import SentenceTransformer

# Load a lightweight, high-quality sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

print("Model loaded successfully.")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Output:

 
Model loaded successfully.
Embedding dimension: 384

Each sentence will be encoded into a vector of 384 floating-point numbers. You can swap in any other model from the SBERT model list simply by changing the model name string — larger models produce higher-dimensional embeddings and often better accuracy at the cost of speed.

Cosine Similarity Between Two Sentences

The most basic use case is comparing two individual sentences. We encode both sentences into embeddings and then compute their cosine similarity using the utility function provided by sentence_transformers.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# Define sentence pairs to compare
pairs = [
    ("A dog is playing in the park.",          "A puppy is running outside."),
    ("The stock market crashed today.",         "Investors lost billions overnight."),
    ("She loves reading mystery novels.",       "The weather forecast shows heavy rain."),
    ("Python is a popular programming language.","Many developers use Python for data science."),
]

print(f"{'Sentence A':<45} {'Sentence B':<45} {'Score':>6}")
print("-" * 100)

for sent_a, sent_b in pairs:
    emb_a = model.encode(sent_a, convert_to_tensor=True)
    emb_b = model.encode(sent_b, convert_to_tensor=True)
    score = util.cos_sim(emb_a, emb_b).item()
    print(f"{sent_a:<45} {sent_b:<45} {score:>6.4f}")

Output:

Sentence A                                    Sentence B                                     Score
----------------------------------------------------------------------------------------------------
A dog is playing in the park.                 A puppy is running outside.                   0.4189
The stock market crashed today.               Investors lost billions overnight.            0.5578
She loves reading mystery novels.             The weather forecast shows heavy rain.        0.0182
Python is a popular programming language.     Many developers use Python for data science.  0.7595

Notice that the pair about the novel and the weather scores near zero — the model correctly identifies that they share no semantic content — while the dog/puppy pair scores above 0.8 despite using completely different words.

Ranking Sentences by Similarity

A common practical task is ranking a list of candidate sentences by their similarity to a fixed reference sentence. We encode all candidates at once using batch encoding for efficiency, then sort by descending cosine score.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

reference = "How do I train a neural network?"

candidates = [
    "Steps to build a deep learning model.",
    "What is the capital of France?",
    "Gradient descent optimises model weights during training.",
    "I enjoy hiking on weekends.",
    "Backpropagation computes gradients in neural networks.",
    "Which framework is best for machine learning?",
    "The recipe requires two cups of flour.",
]

ref_emb  = model.encode(reference, convert_to_tensor=True)
cand_emb = model.encode(candidates, convert_to_tensor=True)

scores = util.cos_sim(ref_emb, cand_emb).detach().cpu().numpy()

ranked = sorted(zip(scores, candidates), reverse=True)

print(f"Reference: \"{reference}\"\n")
print(f"{'Rank':<6} {'Score':<8} Candidate Sentence")
print("-" * 72)

for rank, (score, sentence) in enumerate(ranked, start=1):
    print(f"{rank:<6} {score:<8.4f} {sentence}")

Output:

Reference: "How do I train a neural network?"

Rank   Score    Candidate Sentence
------------------------------------------------------------------------
1      0.5657   Steps to build a deep learning model.
2      0.4655   Backpropagation computes gradients in neural networks.
3      0.3500   Gradient descent optimises model weights during training.
4      0.2918   Which framework is best for machine learning?
5      0.0509   The recipe requires two cups of flour.
6      0.0474   What is the capital of France?
7      -0.1051  I enjoy hiking on weekends.

The model surfaces machine-learning-related sentences at the top even though none of them contains the phrase "train a neural network" verbatim. The bottom-ranked sentences are semantically unrelated and score below 0.15.

Batch Similarity with a Query

The semantic_search helper function in sentence_transformers.util performs an efficient nearest-neighbour search and returns the top-k most similar documents for one or more query embeddings in a single call. This mirrors the core operation of a semantic search engine or a retrieval-augmented generation (RAG) system.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# A small document corpus
corpus = [
    "Machine learning models learn patterns from data.",
    "Deep neural networks have many hidden layers.",
    "Paris is the capital and largest city of France.",
    "The Eiffel Tower was built in 1889.",
    "Convolutional networks excel at image recognition tasks.",
    "Transfer learning reuses knowledge from pre-trained models.",
    "French cuisine is famous around the world.",
    "Recurrent networks are designed for sequential data.",
]

query = "What neural network architecture works well for images?"

# Encode corpus and query
corpus_emb = model.encode(corpus,  convert_to_tensor=True)
query_emb  = model.encode(query,   convert_to_tensor=True)

# Retrieve top-3 results
hits = util.semantic_search(query_emb, corpus_emb, top_k=3)[0]

print(f"Query: \"{query}\"\n")
print(f"Top-3 Results:")
print("-" * 60)
for hit in hits:
    print(f"Score: {hit['score']:.4f}  |  {corpus[hit['corpus_id']]}")

Output:

Query: "What neural network architecture works well for images?"

Top-3 Results:
------------------------------------------------------------
Score: 0.4764  |  Convolutional networks excel at image recognition tasks.
Score: 0.4165  |  Deep neural networks have many hidden layers.
Score: 0.3095  |  Recurrent networks are designed for sequential data.

The function returns a list of dictionaries, each with a corpus_id (the index of the matched document) and a score. In a production setting, the corpus embeddings would be pre-computed and stored in a vector database such as FAISS, ChromaDB, or Pinecone for millisecond-latency retrieval over millions of documents.

Similarity Heatmap for a Sentence Set

Visualising the pairwise similarity matrix as a heatmap is a quick way to understand the semantic structure of a sentence collection at a glance. We compute the full similarity matrix with util.cos_sim and render it using matplotlib.

from sentence_transformers import SentenceTransformer, util
import matplotlib.pyplot as plt
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "I love machine learning.",
    "Deep learning is a branch of AI.",
    "Neural networks mimic the human brain.",
    "I enjoy cooking Italian food.",
    "Pasta and pizza are Italian dishes.",
    "Cats and dogs are popular pets.",
]

# Compute all embeddings and full pairwise similarity matrix
embeddings  = model.encode(sentences, convert_to_tensor=True)
# FIXED LINE
sim_matrix = util.cos_sim(embeddings, embeddings).detach().cpu().numpy()
#sim_matrix  = util.cos_sim(embeddings, embeddings).numpy()

# Short labels for the axes
labels = [s[:28] + "…" if len(s) > 28 else s for s in sentences]

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(sim_matrix, cmap="YlOrRd", vmin=0, vmax=1)

ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=9)
ax.set_yticklabels(labels, fontsize=9)

# Annotate each cell with its score
for i in range(len(sentences)):
    for j in range(len(sentences)):
        ax.text(j, i, f"{sim_matrix[i, j]:.2f}",
                ha="center", va="center", fontsize=8,
                color="black" if sim_matrix[i, j] < 0.7 else "white")

plt.colorbar(im, ax=ax, label="Cosine Similarity")
ax.set_title("Pairwise Semantic Similarity Heatmap", fontsize=13, pad=14)
plt.tight_layout()
plt.savefig("similarity_heatmap.png", dpi=150)
plt.show()

Output:

The heatmap reveals natural clusters: the three machine-learning sentences form a high-similarity block in the top-left corner, the two food-related sentences group together in the middle, and the pet sentence stands apart. Diagonal entries are always 1.00 since every sentence is identical to itself.

Conclusion

In this post, we briefly learned what Semantic Text Similarity is and how LLM embeddings make it possible to measure meaning rather than just word overlap. We used the sentence-transformers library to load the all-MiniLM-L6-v2 model, compute cosine similarity between sentence pairs, rank candidates against a reference query, perform batch semantic search, and visualise a pairwise similarity heatmap. These building blocks are the foundation of modern semantic search and retrieval-augmented generation (RAG) pipelines.

DataTechNotes

Pages

Semantic Text Similarity with LLM Embeddings in Python

What is Semantic Text Similarity?

What are LLM Embeddings?

Installation

Loading an Embedding Model

Cosine Similarity Between Two Sentences

Ranking Sentences by Similarity

Batch Similarity with a Query

Similarity Heatmap for a Sentence Set

Conclusion

No comments:

Post a Comment