Text Classification with BERT in PyTorch

       Text classification is a fundamental task in NLP that involves categorizing text into predefined categories or labels. With the advent of deep learning and transformer-based models like BERT (Bidirectional Encoder Representations from Transformers), text classification has witnessed significant advancements in accuracy and performance. 

    In this tutorial, we will explore how to perform text classification using BERT in PyTorch, covering data preparation, model training, and prediction. The tutorial covers:

  1. Preparing data for text classification
  2. Overview of BERT
  3. Tokeniziation and encoding
  4. Loading the pre-trained BERT model
  5. Training (fine-tuning) the model
  6. Making predictions on new sentences
  7. Conclusion 
  8. Source code listing

     Let's get started.

 

Preparing Data for Text Classification

    First, we'll prepare the training data for custom training the BERT model. Below, I have prepared custom data where train_data contains labeled training samples, with each sample represented as a tuple of text and its corresponding label. In this dataset, we have three categories: "NLP", "Machine Learning", and "Programming". The new_sentences contains new, unlabeled sentences for prediction.

 
# Training dataset with 3 labels
train_data = [
("SpaCy is a natural language processing library.", "NLP"),
("Text classification is an important NLP task.", "NLP"),
("Python is a versatile programming language.", "Programming"),
("Programming is fun.", "Programming"),
("Machine learning involves training models to make predictions.", "Machine Learning"),
("Data analysis is a crucial part of any data science project.", "Machine Learning"),
("Deep learning is a subfield of machine learning that focuses on neural networks.", "Machine Learning"),
("Web development involves building and maintaining websites.", "Programming"),
("Java is a popular programming language for building large-scale applications.", "Programming"),
("Natural Language Processing is used for understanding and generating human language.", "NLP"),
("Machine learning algorithms can be supervised or unsupervised.", "Machine Learning"),
("Programming languages like Python, Java, and C++ are widely used in software development.", "Programming"),
("Neural networks are a key component of deep learning models.", "Machine Learning"),
("Web developers use HTML, CSS, and JavaScript to create interactive websites.", "Programming"),
("NLP tasks include sentiment analysis, named entity recognition, and part-of-speech tagging.", "NLP"),
("Software engineers play a crucial role in developing and maintaining software applications.", "Programming"),
("Natural Language Processing is a fascinating field within computer science.", "NLP"),
("Recurrent Neural Networks (RNNs) are commonly used in natural language processing tasks.", "Machine Learning"),
("JavaScript is widely used for client-side scripting in web development.", "Programming"),
("In machine learning, feature engineering plays a crucial role in model performance.", "Machine Learning"),
("Regular expressions are powerful tools for text processing in programming.", "Programming"),
("Semantic analysis is a key component of natural language processing systems.", "NLP"),
("Software development involves collaboration between developers and other stakeholders.", "Programming"),
("Gradient boosting is an ensemble learning technique used in machine learning.", "Machine Learning"),
("Mobile app development includes building applications for iOS and Android platforms.", "Programming"),
("Named entity recognition is a common task in natural language processing.", "NLP")
]

# New sentences for prediction
new_sentences = [
"Data scientists use machine learning algorithms to extract insights from data.",
"Spacy helps for your text processing tasks"
]
 

 

Overview of BERT

      BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model developed by Google for natural language understanding tasks. BERT leverages bidirectional transformers to capture contextual information from both preceding and succeeding words in a sentence. It has achieved state-of-the-art performance on various NLP tasks, including text classification.

    First, ensure you have the necessary libraries installed. We'll be using the transformers library from Hugging Face and PyTorch for this tutorial. You can install them by using the following command:


pip install transformers torch

    We'll begin by importing necessary modules and classes from the transformers and torch libraries. Specifically, we'll import BertTokenizer and BertForSequenceClassification from transformers, as well as softmax from torch.nn.functional and the torch module itself. 

 
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch
 

 

Tokenization and Encoding   

    Next, we load the tokenizer. The BertTokenizer class is used to tokenize and encode the text data. It converts input text into numerical tokens suitable for the BERT model. Tokenizer is initialized with a pre-trained BERT model ("bert-base-uncased"). Training data is tokenized and encoded using the tokenizer, and the labels are encoded into numerical form.

 
# Tokenize and encode the training data
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_texts, train_labels = zip(*train_data)
train_encodings = tokenizer(train_texts, truncation=True, padding=True
                            return_tensors="pt", max_length=128)

# # Encode labels into numerical form
label2id = {label: i for i, label in enumerate(set(train_labels))}
id2label = {i: label for label, i in label2id.items()}
train_labels = torch.tensor([label2id[label] for label in train_labels])

 

 Loading Pre-trained BERT Model

    The BertForSequenceClassification class is used to load a pre-trained BERT model for sequence classification.The model is initialized with the pre-trained BERT model ("bert-base-uncased") and the number of unique labels in the training data, which loads the pre-trained weights and configurations for the specified model name.

 
# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased"
                                num_labels=len(label2id))

 

Training (fine-tuning) the model

    Next we define optimizer (AdamW)  and a loss function (CrossEntropyLoss) for training. The model is trained for a specified number of epochs. In each epoch, the optimizer is used to minimize the loss calculated by the model.


# Define optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Train for a few epochs (you may need more epochs based on your dataset)
for epoch in range(10):
optimizer.zero_grad()
outputs = model(**train_encodings, labels=train_labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

 

Making Predictions on New Sentences

        With the trained model, new sentences are tokenized, encoded, and passed through the model for prediction. The output logits from the model are converted into probabilities using softmax. The predicted class indices are obtained by selecting the class with the highest probability. Finally, the predicted class labels are mapped back to their original labels using id2label.


# Make predictions on new sentences
with torch.no_grad():
# Tokenize and encode new sentences
new_encodings = tokenizer(new_sentences, truncation=True, padding=True
            return_tensors="pt", max_length=128)

# Get model prediction
logits = loaded_model(**new_encodings).logits

# Apply softmax to obtain probabilities
probs = softmax(logits, dim=1)

# Get the predicted class index
predicted_class_indices = torch.argmax(probs, dim=1).numpy()

# Map predicted class indices to labels
predicted_labels = [id2label[i] for i in predicted_class_indices]

Finally, we peredict new sententses with trained model.

 
# Print the results
for sentence, label in zip(new_sentences, predicted_labels):
print(f"Sentence: {sentence}")
print(f"Predicted Label: {label}")
 

We run the code and result looks as follows.

 
Epoch 1, Loss: 1.0877890586853027
Epoch 2, Loss: 0.9981281757354736
Epoch 3, Loss: 0.9305252432823181
Epoch 4, Loss: 0.7934939861297607
Epoch 5, Loss: 0.6986249089241028
Epoch 6, Loss: 0.6253791451454163
Epoch 7, Loss: 0.5116406679153442
Epoch 8, Loss: 0.43368178606033325
Epoch 9, Loss: 0.33982008695602417
Epoch 10, Loss: 0.26518476009368896
Sentence: Data scientists use machine learning algorithms to extract insights from data.
Predicted Label: Machine Learning
Sentence: Spacy helps for your text processing tasks
Predicted Label: NLP
 

 

Conclusion
 
    In this tutorial, we have covered the entire process of text classification with BERT in PyTorch, from data preparation to model training and prediction. By following these steps and leveraging the capabilities of BERT, you can develop accurate and efficient text classification models for various real-world applications in natural language processing. The full source code is listed below.
 
 
Source code listing

 
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Training dataset with 3 labels
train_data = [
("SpaCy is a natural language processing library.", "NLP"),
("Text classification is an important NLP task.", "NLP"),
("Python is a versatile programming language.", "Programming"),
("Programming is fun.", "Programming"),
("Machine learning involves training models to make predictions.", "Machine Learning"),
("Data analysis is a crucial part of any data science project.", "Machine Learning"),
("Deep learning is a subfield of machine learning that focuses on neural networks.", "Machine Learning"),
("Web development involves building and maintaining websites.", "Programming"),
("Java is a popular programming language for building large-scale applications.", "Programming"),
("Natural Language Processing is used for understanding and generating human language.", "NLP"),
("Machine learning algorithms can be supervised or unsupervised.", "Machine Learning"),
("Programming languages like Python, Java, and C++ are widely used in software development.", "Programming"),
("Neural networks are a key component of deep learning models.", "Machine Learning"),
("Web developers use HTML, CSS, and JavaScript to create interactive websites.", "Programming"),
("NLP tasks include sentiment analysis, named entity recognition, and part-of-speech tagging.", "NLP"),
("Software engineers play a crucial role in developing and maintaining software applications.", "Programming"),
("Natural Language Processing is a fascinating field within computer science.", "NLP"),
("Recurrent Neural Networks (RNNs) are commonly used in natural language processing tasks.", "Machine Learning"),
("JavaScript is widely used for client-side scripting in web development.", "Programming"),
("In machine learning, feature engineering plays a crucial role in model performance.", "Machine Learning"),
("Regular expressions are powerful tools for text processing in programming.", "Programming"),
("Semantic analysis is a key component of natural language processing systems.", "NLP"),
("Software development involves collaboration between developers and other stakeholders.", "Programming"),
("Gradient boosting is an ensemble learning technique used in machine learning.", "Machine Learning"),
("Mobile app development includes building applications for iOS and Android platforms.", "Programming"),
("Named entity recognition is a common task in natural language processing.", "NLP")
]

# New sentences for prediction
new_sentences = [
"Data scientists use machine learning algorithms to extract insights from data.",
"Spacy helps for your text processing tasks"
]

# Tokenize and encode the training data
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_texts, train_labels = zip(*train_data)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors="pt", max_length=128)

# # Encode labels into numerical form
label2id = {label: i for i, label in enumerate(set(train_labels))}
id2label = {i: label for label, i in label2id.items()}
train_labels = torch.tensor([label2id[label] for label in train_labels])

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(label2id))

# Define optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Train for a few epochs (you may need more epochs based on your dataset)
for epoch in range(10):
optimizer.zero_grad()
outputs = model(**train_encodings, labels=train_labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

# Make predictions on new sentences
with torch.no_grad():
# Tokenize and encode new sentences
new_encodings = tokenizer(new_sentences, truncation=True, padding=True, return_tensors="pt", max_length=128)

# Get model prediction
logits = model(**new_encodings).logits

# Apply softmax to obtain probabilities
probs = softmax(logits, dim=1)

# Get the predicted class index
predicted_class_indices = torch.argmax(probs, dim=1).numpy()

# Map predicted class indices to labels
predicted_labels = [id2label[i] for i in predicted_class_indices]

# Print the results
for sentence, label in zip(new_sentences, predicted_labels):
print(f"Sentence: {sentence}")
print(f"Predicted Label: {label}")

 
 
References:
  1. BERT https://huggingface.co/docs/transformers/model_doc/bert


No comments:

Post a Comment