DataTechNotes: Word Embedding Example with GloVe in Python

Word embeddings play an important role in representing words in a format that machines can comprehend. Among the various word embedding techniques, GloVe (Global Vectors for Word Representation) stands out as a powerful and widely-used approach.

In this blog post, we'll delve into the concept of word embeddings and its application with GloVe in Python. The tutorial covers:

The concept of word embedding
Overview of GloVe
Word embedding with Glove
T-SNE visualization of Word2Vec
Conclusion

Let's get started.

The concept of word embedding

Word embedding is a technique in NLP where words are represented as numerical vectors in a continuous space. In traditional NLP, words are often represented as discrete entities, devoid of any inherent relationship with one another. Word embeddings, however, transform words into continuous vector spaces, capturing semantic relationships and contextual meanings.

Overview of GloVe

GloVe, developed by the Stanford NLP Group, is an unsupervised learning algorithm designed to obtain vector representations for words. Unlike some approaches that rely solely on local context (such as Word2Vec) or global context (like Latent Semantic Analysis), GloVe achieves a balance by incorporating both local and global co-occurrence information.

Key features of GloVe:

Global Context: Utilizes global statistics of the entire corpus to capture word relationships.
Efficiency: Trains faster and more efficiently than some other methods.
Captures Word Analogies: GloVe embeddings often perform well in tasks like word analogy completion.

How GloVe works

GloVe is based on the idea that the meaning of a word can be inferred from the co-occurrence probabilities with other words. The core concept involves constructing a word co-occurrence matrix, which is then factorized to obtain dense vector representations for each word.

Steps in GloVe embedding:

Construct the Co-Occurrence Matrix: Count the number of times each word appears in the context of other words.
Compute Word Probabilities: Normalize the co-occurrence counts to obtain probabilities.
Define the Objective Function: Formulate an objective function that captures the relationship between word vectors.
Optimization: Use optimization techniques (e.g., gradient descent) to minimize the objective function and obtain the optimal word vectors.

Word embedding with GloVe

Let's delve into the practical implementation of GloVe using Python and the 'gensim' library. Ensure 'gensim' is installed using the following command:

 
 pip install gensim
 

Additionally, it's necessary to download the GloVe vectors file, which is available here. Utilizing the 'gensim' library, we proceed to load the GloVe word-to-vector file into a GloVe model. Subsequently, we extract vectors for the word of interest and identify five similar words.

 
from gensim.models import KeyedVectors

# Load GloVe vectors (change the path to the downloaded file)
glove_path = 'glove.6B.50d.txt'
glove_model = KeyedVectors.load_word2vec_format(glove_path, 
                binary=False, no_header=True)

# Word of interest
word_of_interest = 'ball'

# Get vector for the word of interest
vector_of_interest = glove_model[word_of_interest]

# Find similar words
similar_words = glove_model.most_similar(word_of_interest, topn=5)

print(f"Similar words to 'ball':, {similar_words}") 
 

Similar words are displayed below:

    
 Similar words to 'ball': [('kick', 0.864284873008728), ('catch', 0.8190028667449951), 
 ('off', 0.8133060336112976), ('kicking', 0.8079286813735962), ('got', 0.8033515214920044)]

T-SNE visualization of Word2Vec

GloVe embeddings represent words as high-dimensional vectors, with each dimension capturing a specific facet of the word's meaning. Although this information is rich, visualizing it in its raw form poses challenges. To address this, we employ T-SNE (T-distributed Stochastic Neighbor Embedding) visualization. T-SNE is a widely used technique for reducing high-dimensional data to two or three dimensions, enabling the visualization of relationships between words in our Word2Vec space. For this purpose, scikit-learn provides the TSNE class, designed for visualizing high-dimensional data.

In the following code snippet, we implement T-SNE visualization for words similar to 'ball'.

 
from gensim.models import KeyedVectors
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

# Load GloVe vectors (change the path to the downloaded file)
glove_path = 'glove.6B.50d.txt'
glove_model = KeyedVectors.load_word2vec_format(glove_path, binary=False, 
                no_header=True)

# Word of interest
word_of_interest = 'ball'

# Get vector for the word of interest
vector_of_interest = glove_model[word_of_interest]

# Find similar words
similar_words = glove_model.most_similar(word_of_interest, topn=5)

# Get vectors for similar words
similar_vectors = [glove_model[word] for word, _ in similar_words]

# Reduce perplexity to a lower value
perplexity = min(30, len(similar_vectors)-1)

# Apply t-SNE for dimensionality reduction with perplexity
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42) 
 # Convert the list to a 2D array 
all_vectors = np.vstack([vector_of_interest] + similar_vectors) 
vectors_tsne = tsne.fit_transform(all_vectors)

# Visualize t-SNE result
plt.figure(figsize=(8, 6))
plt.scatter(vectors_tsne[:, 0], vectors_tsne[:, 1], color='blue', marker='o')

# Annotate words
for i, word in enumerate([word_of_interest] + [word for word, _ in similar_words]):
    plt.annotate(word, (vectors_tsne[i, 0], vectors_tsne[i, 1]))

plt.title(f't-SNE Visualization of Similar Words to "{word_of_interest}"')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()
 

Conclusion

Word embeddings are dense vector representations of words in a continuous vector space. These representations capture semantic relationships between words, enabling machines to understand the contextual meaning of words in natural language processing tasks.

In this tutorial, we've briefly explored word embedding, its representation with GloVe, and T-SNE visualization in Python.

References:

DataTechNotes

Pages

Word Embedding Example with GloVe in Python

How GloVe works

Steps in GloVe embedding:

No comments:

Post a Comment