DataTechNotes: Understanding Bags of n-grams in Natural Language Processing with Python

Bags of n-grams is a concept in natural language processing (NLP) that involves representing text data by considering the frequency of contiguous sequences of n items (usually words) within a document. The term "bag" implies that the order of occurrence is not considered, and the focus is on the presence and frequency of individual n-grams.

In this blog post, we will explore bags of n-grams concept and its application in Python. The tutorial covers:

The concept of bags of n-grams
Bags of n-grams representation in Python
Conclusion

Let's get started.

The concept of bags of n-grams

The Bag of N-Grams is a fundamental concept in NLP, leveraging the simplicity of tokenization and the richness of context analysis. In essence, it involves breaking down a text into its constituent n-grams (sequences of 'n' consecutive words) and creating a bag, or set, of these n-grams.

N-grams play an important role in natural language processing (NLP) and text analysis. They capture local patterns, aiding in text representation and structure understanding. N-grams are essential for feature extraction in machine learning models, providing a straightforward method to convert text into numerical features. Additionally, they form the foundation for probabilistic language models, measure document similarity, assist in spell checking, enhance speech recognition, optimize search engine results, facilitate text generation, and contribute to named entity recognition and information extraction.

Bags of n-grams representation in Python

Bags of n-grams is a concept in natural language processing (NLP) that involves representing text data by considering the frequency of contiguous sequences of n items (usually words) within a document. N-grams are essentially chunks of text containing 'n' consecutive words. Below, we'll explore the process of creating bags of n-grams step by step.

Tokenization

Involves breaking down a piece of given text into individual words or tokens.

Creation of N-Grams

After tokenization the text is transformed into a set of n-grams. These can be unigrams (single words), bigrams (two consecutive words), trigrams (three consecutive words), and so forth. Below example shows how to generate n-grams for a given text.

 
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
 
# Sample text
text = "Bag of N-grams enhances text representation in NLP."
 
# Tokenize the text
tokens = word_tokenize(text)
 
# Function to generate n-grams
def generate_ngrams(tokens, n):
    n_grams = ngrams(tokens, n)
    return [' '.join(grams) for grams in n_grams]
 
# Generate bags of 2-grams
bigrams = generate_ngrams(tokens, 2)
 
# generate bags of 3-grams
trigrams = generate_ngrams(tokens, 3)
 
# Display the Bag of bigrams and trigrams
print("Bag of Bigrams:", bigrams)
print("\nBag of Trigrams:", trigrams)
 

The output appears as follows.

 
Bag of Bigrams: ['Bag of', 'of N-grams', 'N-grams enhances', 'enhances text', 
'text representation', 'representation in', 'in NLP', 'NLP .']
 
Bag of Trigrams: ['Bag of N-grams', 'of N-grams enhances', 'N-grams enhances text', 
'enhances text representation', 'text representation in', 'representation in NLP', 
'in NLP .'] 
 

The Document-Term Matrix

The Document-Term Matrix (DTM) is a numeric representation of text data, where each row signifies a document, each column represents a unique term, and the cells contain the term frequency in each document. Essential for machine learning, the DTM transforms raw text into a format suitable for analysis. Its vectorization enables tasks like similarity measurement and clustering. With a sparse structure, it efficiently handles large vocabularies. The DTM's incorporation of term frequency information is crucial for understanding the relevance of terms in documents, making it fundamental in text mining and natural language processing.

In below example shows how to form the bags by aggregating all unique n-grams from the text, disregarding the order but preserving the frequency of occurrence.

 
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "Programming is fun",
    "Machine learning is fascinating",
    "Natural language processing is essential for NLP",
    "Python programming is versatile and powerful"
]

# Create a CountVectorizer with unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the feature names (n-grams)
feature_names = vectorizer.get_feature_names()

# Convert the sparse matrix to a dense array for 
# better readability
dense_array = X.toarray()

# Display the results
print("Feature names (Unigrams and Bigrams):")
print(feature_names)

print("\nDocument-Term Matrix:")
print(dense_array)
 

The output appears as follows.

 
Feature names (Unigrams and Bigrams):
['and', 'and powerful', 'essential', 'essential for', 'fascinating', 'for', 'for nlp', 
'fun', 'is', 'is essential', 'is fascinating', 'is fun', 'is versatile', 'language', 
'language processing', 'learning', 'learning is', 'machine', 'machine learning', 
'natural', 'natural language', 'nlp', 'powerful', 'processing', 'processing is', 
'programming', 'programming is', 'python', 'python programming', 'versatile', 
 'versatile and']

Document-Term Matrix:
[[0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0]
 [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1]]  
 

This representation is useful for various NLP tasks, such as text classification and sentiment analysis, as it captures some context and relationships between words in the document.

While n-grams have proven useful in various NLP applications, it's important to note that they have limitations, especially in capturing long-range dependencies and understanding context. More advanced models, such as transformers, have been developed to address these challenges, but n-grams remain a valuable tool, particularly in scenarios where simplicity and interpretability are essential.

Conclusion

In this tutorial, we briefly explored the concept of the bag of n-grams and learned how to represent the bag of n-grams for a given text using the NLTK and scikit-learn libraries.

References:

DataTechNotes

Pages

Understanding Bags of n-grams in Natural Language Processing with Python

No comments:

Post a Comment