Classification Example with BaggingClassifier in Python

   Bagging, short for Bootstrap Aggregating, is an ensemble learning technique in machine learning that combines multiple models to improve predictive performance. It works by training multiple models independently on different subsets of the training data and then combining their predictions through averaging (for regression) or voting (for classification).

    In this tutorial, we'll explore the basics of bagging technique and how to implement classification using Sciki-learn BaggingClassifier. The tutorial covers:

  1. Introduction to bagging
  2. Bagging with single estimator 
  3. Bagging with multiple estimators  
  4. Conclusion

 


Introduction to Bagging

    Bagging, short for Bootstrap Aggregating, is a widely used technique in ensemble learning to improve the performance of machine learning models.

    In bagging, multiple base learners (often of the same type) are trained independently on different subsets of the training data. These subsets are typically created by sampling the training data with replacement. Each base learner then makes its predictions, and the final prediction is often obtained by voting for classification tasks over the predictions of all base learners.

    The main idea behind bagging is to reduce overfitting and variance by combining the predictions of multiple models trained on different subsets of the data. This often leads to better generalization performance compared to individual models. RandomForest, for example, is a popular ensemble learning method that uses bagging with decision trees as base learners.

 

Bagging with single estimator

    Now let's start implementing classification with bagging method in Python. We'll begin by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

    Then we load Iris dataset and split it into train and test sets by using train_test_split function. 

 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To define a base classifier we use DecisionTreeClassifier class then initialize the bagging classifier with base estimator and its number. We train model on training data using fit() method. 

 
#Initialize a base classifier (Decision Tree)
base_classifier = DecisionTreeClassifier(random_state=42)

# Initialize a Bagging Classifier with base classifier
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10, random_state=42)

# Train the bagging classifier
bagging_classifier.fit(X_train, y_train)

After training the model, we can predict test data and calculate the accuracy of prediction. 


# Make predictions on the test set
y_pred = bagging_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
 

The result looks as follows.

 
Accuracy: 1.0
Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 10 1 1.00 1.00 1.00 9 2 1.00 1.00 1.00 11 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30

 

Bagging with multiple estimators  

   In this part of tutorial, we implement multiple base estimators and check their performance. To evaluate the estimator models performance we create custom classification dataset and apply classification with each estimator. 

 
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

def create_data(N):
columns = ['a', 'b', 'c', 'target']
data = np.random.randint(10, size=(N, 3)) # Generate random numbers for columns a, b, and c
y = np.where(np.sum(data, axis=1) > 25, 'high'
np.where(np.sum(data, axis=1) < 12, 'low', 'normal')) # Calculate y
df = pd.DataFrame(np.column_stack([data, y]), columns=columns) # Create DataFrame
return df

df = create_data(1000)

Xtrain, Xtest, ytrain, ytest = train_test_split(df[['a', 'b', 'c']], df["target"], test_size=0.2)
 
lr = LogisticRegression()
gnb = GaussianNB()
dt = DecisionTreeClassifier()

base_methods = [lr, gnb, dt]
for bm in base_methods:
print("Method: ", bm)
bag_model = BaggingClassifier(base_estimator=bm, n_estimators=100, bootstrap=True)
bag_model = bag_model.fit(Xtrain, ytrain)
ypred = bag_model.predict(Xtest)
print("Accuracy:", accuracy_score(ytest, ypred))
print(classification_report(ytest, ypred))

    Results:

 
Method: LogisticRegression()
Accuracy: 1.0
precision recall f1-score support high 1.00 1.00 1.00 1 low 1.00 1.00 1.00 76 normal 1.00 1.00 1.00 123 accuracy 1.00 200 macro avg 1.00 1.00 1.00 200 weighted avg 1.00 1.00 1.00 200 Method: GaussianNB()
Accuracy: 0.91 precision recall f1-score support high 1.00 1.00 1.00 1 low 1.00 0.76 0.87 76 normal 0.87 1.00 0.93 123 accuracy 0.91 200 macro avg 0.96 0.92 0.93 200 weighted avg 0.92 0.91 0.91 200 Method: DecisionTreeClassifier()
Accuracy: 0.955 precision recall f1-score support high 1.00 1.00 1.00 1 low 0.96 0.92 0.94 76 normal 0.95 0.98 0.96 123 accuracy 0.95 200 macro avg 0.97 0.97 0.97 200 weighted avg 0.96 0.95 0.95 200

 

Conclusion

    In this tutorial, we learned about the Bagging technique and how to classify data using the Scikit-learn BaggingClassifier class. We also implemented multiple estimators for classifying data and evaluated their performance.


Reference

1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html



2 comments:

  1. Hi dude, this post is one of the most simple and explanatory I could find, it helped me a lot.

    Only in the part of:
    Y = df[["y"]]
    I changed:
    Y = df[["y"][0]]

    So that Python doesn't show a message on the terminal.

    ReplyDelete