DataTechNotes: Text Classification Example with SpaCy and Scikit-Learn

Text classification is a fundamental task in natural language processing (NLP) that involves categorizing text into predefined categories or labels. In this blog post, we will explore how to perform text classification using the SpaCy library for text preprocessing and the Scikit-Learn library for building a machine learning classifier. The tutorial covers:

Preparing data
Feature extraction with TF-IDF
Building a text classifier
Evaluating the model and prediction
Conclusion

Let's get started.

We'll begin by loading the necessary libraries for this tutorial.

 
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report 
 

Preparing data

We'll be working with a simple dataset that contains text samples categorized into three labels: "NLP", "Programming", and "Machine Learning". The dataset is designed to showcase the diversity of language used in these domains. You can also use your own dataset instead of this one.

 
 # Training dataset with 3 labels
 data = [
    ("SpaCy is a natural language processing library.", "NLP"),
    ("Text classification is an important NLP task.", "NLP"),
    ("Python is a versatile programming language.", "Programming"),
    ("Programming is fun.", "Programming"),
    ("Machine learning involves training models to make predictions.", "Machine Learning"),
    ("Data analysis is a crucial part of any data science project.", "Machine Learning"),
    ("Deep learning is a subfield of machine learning that focuses on neural networks.", "Machine Learning"),
    ("Web development involves building and maintaining websites.", "Programming"),
    ("Java is a popular programming language for building large-scale applications.", "Programming"),
    ("Natural Language Processing is used for understanding and generating human language.", "NLP"),
    ("Machine learning algorithms can be supervised or unsupervised.", "Machine Learning"),
    ("Programming languages like Python, Java, and C++ are widely used in software development.", "Programming"),
    ("Neural networks are a key component of deep learning models.", "Machine Learning"),
    ("Web developers use HTML, CSS, and JavaScript to create interactive websites.", "Programming"),
    ("NLP tasks include sentiment analysis, named entity recognition, and part-of-speech tagging.", "NLP"),
    ("Reinforcement learning is a type of machine learning where agents learn through trial and error.", "Machine Learning"),
    ("Software engineers play a crucial role in developing and maintaining software applications.", "Programming"),
    ("Big data analytics involves processing and analyzing large volumes of data to extract valuable insights.", "Machine Learning"),
    ("Creating mobile applications requires knowledge of mobile development frameworks and programming languages.", "Programming"),
    ("Natural Language Processing is a fascinating field within computer science.", "NLP"),
    ("Recurrent Neural Networks (RNNs) are commonly used in natural language processing tasks.", "Machine Learning"),
    ("JavaScript is widely used for client-side scripting in web development.", "Programming"),
    ("In machine learning, feature engineering plays a crucial role in model performance.", "Machine Learning"),
    ("Regular expressions are powerful tools for text processing in programming.", "Programming"),
    ("Semantic analysis is a key component of natural language processing systems.", "NLP"),
    ("Software development involves collaboration between developers and other stakeholders.", "Programming"),
    ("Gradient boosting is an ensemble learning technique used in machine learning.", "Machine Learning"),
    ("Mobile app development includes building applications for iOS and Android platforms.", "Programming"),
    ("Named entity recognition is a common task in natural language processing.", "NLP")
 ]

 # Unpack the data into texts and labels
 texts, labels = zip(*data)

Next, we load the SpaCy language model for tokenization and text preprocessing. Each text sample is tokenized and converted to lowercase for consistency.

 
# Load spaCy's pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Tokenize and preprocess the text using spaCy
tokenized_texts = []
for text in texts:
    doc = nlp(text)
    tokenized_text = " ".join([token.text.lower() for token in doc])
    tokenized_texts.append(tokenized_text)

The dataset is then split into training and testing sets using Scikit-Learn's train_test_split function. This ensures that the model is trained on one portion of the data and evaluated on another, unseen portion.

 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tokenized_texts, labels, 
                                                test_size=0.3, random_state=42)
   

Feature extraction with TF-IDF

To represent the text data numerically, we use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer. This converts each text sample into a vector of numerical features, taking into account the importance of each term in the entire dataset.

 
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

Building a text classifier

The Multinomial Naive Bayes (MNB) classifier is a probabilistic machine learning algorithm based on Bayes' theorem. It is particularly well-suited for text classification tasks, where the features (in this case, words or terms) are assumed to be multinomially distributed.

The algorithm estimates the probabilities of each label using the training data. For each label, it calculates the likelihood of observing the features given that label. The prior probability of each label is also estimated from the training data.

During prediction, the algorithm uses Bayes' theorem to calculate the probability of each label given the observed features and selects the label with the highest probability.

We use a Multinomial Naive Bayes classifier from Scikit-Learn to build the text classification model. The model is trained on the TF-IDF transformed training data.

 
# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

Evaluating the model and prediction

The model's performance is evaluated on the testing set using accuracy and a detailed classification report. Here, we use accuracy_score and classification_report function of Scikit-learn.

 
# Make predictions on the test set
y_pred = classifier.predict(X_test_tfidf)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}")

# Display classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

The result:

   
 Accuracy: 100.00
 Classification Report:
                    precision    recall  f1-score   support

 Machine Learning       1.00      1.00      1.00         3
              NLP       1.00      1.00      1.00         2
      Programming       1.00      1.00      1.00         4

         accuracy                           1.00         9
        macro avg       1.00      1.00      1.00         9
     weighted avg       1.00      1.00      1.00         9

The trained model is then used to predict the labels for new sentences. We define two new sentence to classify with trained model.

# Predict new sentences
new_sentences = [
    "Data scientists use machine learning algorithms to extract insights from data.",
    "Spacy helps for your text processing tasks"
]
# Tokenize and preprocess the new sentences using spaCy
tokenized_new_sentences = []
for sentence in new_sentences:
    doc = nlp(sentence)
    tokenized_sentence = " ".join([token.text.lower() for token in doc])
    tokenized_new_sentences.append(tokenized_sentence)

# Transform the new sentences using the TF-IDF vectorizer
new_sentences_tfidf = vectorizer.transform(tokenized_new_sentences)

# Make predictions for the new sentences
predicted_labels = classifier.predict(new_sentences_tfidf)

# Display the predicted labels for each new sentence
for sentence, label in zip(new_sentences, predicted_labels):
    print(f"Sentence: {sentence}\nPredicted Label: {label}\n")

The result is as follows.

   
 Sentence: Data scientists use machine learning algorithms to extract insights from data.
 Predicted Label: Machine Learning

 Sentence: Spacy helps for your text processing tasks
 Predicted Label: Programming

Conclusion

In this tutorial, we explored how to classify text data with Spacy ad Scikit-learn libraries.

Text classification is a valuable tool in various applications, from sentiment analysis to content categorization. By leveraging the powerful capabilities of SpaCy for text processing and Scikit-Learn for machine learning, we can build accurate and efficient text classifiers. The full source code is listed below.

Source code listing

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Training dataset with 3 labels
data = [
    ("SpaCy is a natural language processing library.", "NLP"),
    ("Text classification is an important NLP task.", "NLP"),
    ("Python is a versatile programming language.", "Programming"),
    ("Programming is fun.", "Programming"),
    ("Machine learning involves training models to make predictions.", "Machine Learning"),
    ("Data analysis is a crucial part of any data science project.", "Machine Learning"),
    ("Deep learning is a subfield of machine learning that focuses on neural networks.", "Machine Learning"),
    ("Web development involves building and maintaining websites.", "Programming"),
    ("Java is a popular programming language for building large-scale applications.", "Programming"),
    ("Natural Language Processing is used for understanding and generating human language.", "NLP"),
    ("Machine learning algorithms can be supervised or unsupervised.", "Machine Learning"),
    ("Programming languages like Python, Java, and C++ are widely used in software development.", "Programming"),
    ("Neural networks are a key component of deep learning models.", "Machine Learning"),
    ("Web developers use HTML, CSS, and JavaScript to create interactive websites.", "Programming"),
    ("NLP tasks include sentiment analysis, named entity recognition, and part-of-speech tagging.", "NLP"),
    ("Reinforcement learning is a type of machine learning where agents learn through trial and error.", "Machine Learning"),
    ("Software engineers play a crucial role in developing and maintaining software applications.", "Programming"),
    ("Big data analytics involves processing and analyzing large volumes of data to extract valuable insights.", "Machine Learning"),
    ("Creating mobile applications requires knowledge of mobile development frameworks and programming languages.", "Programming"),
    ("Natural Language Processing is a fascinating field within computer science.", "NLP"),
    ("Recurrent Neural Networks (RNNs) are commonly used in natural language processing tasks.", "Machine Learning"),
    ("JavaScript is widely used for client-side scripting in web development.", "Programming"),
    ("In machine learning, feature engineering plays a crucial role in model performance.", "Machine Learning"),
    ("Regular expressions are powerful tools for text processing in programming.", "Programming"),
    ("Semantic analysis is a key component of natural language processing systems.", "NLP"),
    ("Software development involves collaboration between developers and other stakeholders.", "Programming"),
    ("Gradient boosting is an ensemble learning technique used in machine learning.", "Machine Learning"),
    ("Mobile app development includes building applications for iOS and Android platforms.", "Programming"),
    ("Named entity recognition is a common task in natural language processing.", "NLP")
]

# Unpack the data into texts and labels
texts, labels = zip(*data)

# Load spaCy's pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Tokenize and preprocess the text using spaCy
tokenized_texts = []
for text in texts:
    doc = nlp(text)
    tokenized_text = " ".join([token.text.lower() for token in doc])
    tokenized_texts.append(tokenized_text)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tokenized_texts, labels, test_size=0.3, random_state=42)

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test_tfidf)

# Evaluate the performance of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}")

# Display classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Predict new sentences
new_sentences = [
    "Data scientists use machine learning algorithms to extract insights from data.",
    "Spacy helps for your text processing tasks"
]
# Tokenize and preprocess the new sentences using spaCy
tokenized_new_sentences = []
for sentence in new_sentences:
    doc = nlp(sentence)
    tokenized_sentence = " ".join([token.text.lower() for token in doc])
    tokenized_new_sentences.append(tokenized_sentence)

# Transform the new sentences using the TF-IDF vectorizer
new_sentences_tfidf = vectorizer.transform(tokenized_new_sentences)

# Make predictions for the new sentences
predicted_labels = classifier.predict(new_sentences_tfidf)

# Display the predicted labels for each new sentence
for sentence, label in zip(new_sentences, predicted_labels):
    print(f"Sentence: {sentence}\nPredicted Label: {label}\n")

References:

DataTechNotes

Pages

Text Classification Example with SpaCy and Scikit-Learn

Evaluating the model and prediction

No comments:

Post a Comment