DataTechNotes: Classification Example with BaggingClassifier in Python

Bagging, short for Bootstrap Aggregating, is an ensemble learning technique in machine learning that combines multiple models to improve predictive performance. It works by training multiple models independently on different subsets of the training data and then combining their predictions through averaging (for regression) or voting (for classification).

In this tutorial, we'll explore the basics of bagging technique and how to implement classification using Sciki-learn BaggingClassifier. The tutorial covers:

Introduction to bagging
Bagging with single estimator
Bagging with multiple estimators
Conclusion

Introduction to Bagging

   Bagging, short for Bootstrap Aggregating, is a widely used technique in ensemble learning to improve the performance of machine learning models.

   In bagging, multiple base learners (often of the same type) are trained independently on different subsets of the training data. These subsets are typically created by sampling the training data with replacement. Each base learner then makes its predictions, and the final prediction is often obtained by voting for classification tasks over the predictions of all base learners.

   The main idea behind bagging is to reduce overfitting and variance by combining the predictions of multiple models trained on different subsets of the data. This often leads to better generalization performance compared to individual models. RandomForest, for example, is a popular ensemble learning method that uses bagging with decision trees as base learners.

Bagging with single estimator

Now let's start implementing classification with bagging method in Python. We'll begin by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

Then we load Iris dataset and split it into train and test sets by using train_test_split function.

 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To define a base classifier we use DecisionTreeClassifier class then initialize the bagging classifier with base estimator and its number. We train model on training data using fit() method.

 
#Initialize a base classifier (Decision Tree)
base_classifier = DecisionTreeClassifier(random_state=42)

# Initialize a Bagging Classifier with base classifier
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10, random_state=42)

# Train the bagging classifier
bagging_classifier.fit(X_train, y_train)

After training the model, we can predict test data and calculate the accuracy of prediction.

# Make predictions on the test set
y_pred = bagging_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
 

The result looks as follows.

 
Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Bagging with multiple estimators

In this part of tutorial, we implement multiple base estimators and check their performance. To evaluate the estimator models performance we create custom classification dataset and apply classification with each estimator.

 
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

def create_data(N):
    columns = ['a', 'b', 'c', 'target']
    data = np.random.randint(10, size=(N, 3))  # Generate random numbers for columns a, b, and c
    y = np.where(np.sum(data, axis=1) > 25, 'high', 
        np.where(np.sum(data, axis=1) < 12, 'low', 'normal'))  # Calculate y
    df = pd.DataFrame(np.column_stack([data, y]), columns=columns)  # Create DataFrame
    return df

df = create_data(1000)

Xtrain, Xtest, ytrain, ytest = train_test_split(df[['a', 'b', 'c']], df["target"], test_size=0.2)  
 
lr = LogisticRegression()
gnb = GaussianNB()
dt = DecisionTreeClassifier()

base_methods = [lr, gnb, dt]
for bm in base_methods:
    print("Method: ", bm)
    bag_model = BaggingClassifier(base_estimator=bm, n_estimators=100, bootstrap=True)
    bag_model = bag_model.fit(Xtrain, ytrain)
    ypred = bag_model.predict(Xtest)
    print("Accuracy:", accuracy_score(ytest, ypred))
    print(classification_report(ytest, ypred))

Results:

 
Method:  LogisticRegression()
Accuracy: 1.0              precision    recall  f1-score   support

        high       1.00      1.00      1.00         1
         low       1.00      1.00      1.00        76
      normal       1.00      1.00      1.00       123

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

Method:  GaussianNB()
Accuracy: 0.91
              precision    recall  f1-score   support

        high       1.00      1.00      1.00         1
         low       1.00      0.76      0.87        76
      normal       0.87      1.00      0.93       123

    accuracy                           0.91       200
   macro avg       0.96      0.92      0.93       200
weighted avg       0.92      0.91      0.91       200

Method:  DecisionTreeClassifier()
Accuracy: 0.955
              precision    recall  f1-score   support

        high       1.00      1.00      1.00         1
         low       0.96      0.92      0.94        76
      normal       0.95      0.98      0.96       123

    accuracy                           0.95       200
   macro avg       0.97      0.97      0.97       200
weighted avg       0.96      0.95      0.95       200

Conclusion

In this tutorial, we learned about the Bagging technique and how to classify data using the Scikit-learn BaggingClassifier class. We also implemented multiple estimators for classifying data and evaluated their performance.

Reference

1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

DataTechNotes

Pages

Classification Example with BaggingClassifier in Python

2 comments: