Binary Classification with Logistic Regression using PyTorch

     Logistic regression is a fundamental machine learning algorithm used for binary classification tasks. In this tutorial, we'll explore how to classify binary data with logistic regression using PyTorch deep learning framework. We'll cover the following topics:

  1. Introduction to logistic regression
  2. Preparing data
  3. Building the classifier model
  4. Training the model
  5. Prediction and accuracy check
  6. Conclusion 
  7. Source code listing

     Let's get started.

    Please note that this tutorial provides a basic understanding of implementing logistic regression for data classification using PyTorch. It's important to remember that parameters and model definitions may require adjustments when dealing with larger datasets.


Introduction to logistic regression

    Logistic regression is a linear classification algorithm that predicts the probability that an instance belongs to a particular class. It's commonly used for binary classification tasks where the target variable has two possible outcomes, such as spam detection, disease diagnosis, and sentiment analysis.

    The logistic regression model calculates the probability that an input sample belongs to the positive class using the logistic function (also known as the sigmoid function). Mathematically, the logistic regression model can be represented as:       

P(y=1x;w)=11+e(wTx+b)

Where:

  • P(y=1x;w is the probability that y equals 1 given input x and model parameters w.
  • x is the input features.
  • w is the weight vector.
  • b is the bias term.

 

Preparing data

    We'll begin by loading the necessary libraries for this tutorial.

 
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report 
 

     Before building the model, it's essential to preprocess the data. This may include tasks such as data cleaning, feature scaling, and splitting the dataset into training and test sets. 

    We first load the breast cancer dataset using scikit-learn's load_breast_cancer function. Then, we separate the features X and target labels y. Next, we standardize the features using StandardScaler to ensure that each feature has a mean of 0 and a standard deviation of 1. After standardization, we convert the data into PyTorch tensors X_tensor for features and y_tensor for labels using the torch.tensor function.

 
# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert data to PyTorch tensors
X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, 
                                    test_size=0.2, random_state=42)

 

Building the classifier model

     We define a new class named LogisticRegression, which inherits from the nn.Module class. We create an instance of the nn.Linear module, which represents a linear transformation of the input data. The nn.Linear module expects two arguments: input_size, which is the number of features in the input data, and num_classes, which is the number of output classes for classification.

    The forward method specifies how input data flows through the model. In this case, we apply the linear transformation defined by self.linear to the input x. The output out represents the logits for each class, which are then used to compute probabilities during inference.


# Define logistic regression model
class LogisticRegression(nn.Module):
def __init__(self, input_size, num_classes):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_size, num_classes)

def forward(self, x):
out = self.linear(x)
return out
 

 

Training the model 

    We optimize the model parameters (weights and bias) using gradient descent and minimize the binary cross-entropy loss function during the training of the logistic regression model. 

    We determine the input size based on the number of features in our dataset X_train.shape[1] and find the number of classes by computing the length of unique labels in the training set(len(torch.unique(y_train))). Then, we initialize our logistic regression model LogisticRegression with the determined input size and number of classes. For the loss function, we use nn.CrossEntropyLoss(), commonly used for multi-class classification problems. For optimization, we choose stochastic gradient descent (SGD) with a learning rate (lr) of 0.01.

 
# Initialize the model
input_size = X_train.shape[1]
num_classes = len(torch.unique(y_train))
model = LogisticRegression(input_size, num_classes)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

       To train the model, we set the number of epochs to 200. Within the training loop for each epoch, we reset the gradients of the model parameters to prevent accumulation from prior iterations (optimizer.zero_grad()), then pass the training features X_train through the model to get the predicted outputs. Using the predicted outputs and the original labels y_train, we calculate the loss using the specified loss function (criterion). Next, we perform backpropagation to compute the gradients of the loss with respect to the model parameters loss.backward(). Finally, we update the model parameters using the optimizer optimizer.step(), taking a step towards minimizing the loss.

 
# Train the model
num_epochs = 200
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train)
loss.backward()
optimizer.step()

 

Prediction and accuracy check

    In this part, we first make predictions on the test data using the trained logistic regression model. Inside the with torch.no_grad() block, we ensure that no gradients are calculated during inference to save memory and computation. We obtain the predicted class labels y_pred by taking the index of the maximum value along the second dimension of the output tensor outputs, which corresponds to the predicted class.
    Next, we convert the predicted labels and the true labels from PyTorch tensors to numpy arrays using the numpy() method. This conversion is necessary to use scikit-learn's accuracy_score function and classification_report.
    Then, we calculate the accuracy of the predictions by comparing the predicted labels y_pred_np with the true labels y_test_np using the accuracy_score function from scikit-learn.
    Finally, we print the accuracy score and the classification report, which provides a summary of various evaluation metrics such as precision, recall, and F1-score for each class in the classification task.

# Predict on test data
with torch.no_grad():
outputs = model(X_test)
_, y_pred = torch.max(outputs, 1)

# Convert predictions and labels to numpy arrays
y_pred_np = y_pred.numpy()
y_test_np = y_test.numpy()

# Calculate accuracy
accuracy = accuracy_score(y_test_np, y_pred_np)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test_np, y_pred_np))


The result is as follows.

  
Accuracy: 0.9736842105263158
Classification Report:
precision recall f1-score support

0 0.95 0.98 0.97 43
1 0.99 0.97 0.98 71

accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
 


Conclusion
 
    In this tutorial, we've covered the basics of logistic regression and demonstrated how to implement it using PyTorch. Logistic regression is a powerful algorithm for binary classification tasks, and with PyTorch, building and training logistic regression models becomes straightforward. The full source code is listed below. 

 

Source code listing

 
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert data to PyTorch tensors
X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, 
                                    test_size=0.2, random_state=42)

# Define logistic regression model
class LogisticRegression(nn.Module):
def __init__(self, input_size, num_classes):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_size, num_classes)

def forward(self, x):
out = self.linear(x)
return out

# Initialize the model
input_size = X_train.shape[1]
num_classes = len(torch.unique(y_train))
model = LogisticRegression(input_size, num_classes)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Train the model
num_epochs = 200
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train)
loss.backward()
optimizer.step()

# Predict on test data
with torch.no_grad():
outputs = model(X_test)
_, y_pred = torch.max(outputs, 1)

# Convert predictions and labels to numpy arrays
y_pred_np = y_pred.numpy()
y_test_np = y_test.numpy()

# Calculate accuracy
accuracy = accuracy_score(y_test_np, y_pred_np)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test_np, y_pred_np))
 


References:






No comments:

Post a Comment