Classification with Bagging Classifier in Python

   Bagging (Bootstrap Aggregating) is a widely used an ensemble learning algorithm in machine learning. The algorithm builds multiple models from randomly taken subsets of train dataset and aggregates learners to build overall stronger learner. In this post, we'll learn how to classify data with BaggingClassifier class of a sklearn library in Python. The tutorial includes:
  1. Preparing data
  2. Training bagging classifier
  3. Predicting test data and checking the accuracy
  4. Checking accuracy by changing base estimator
   We'll start by loading the required libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

Preparing data

   In this tutorial, I'll generate dataset by random numbers with the below function. You can use any classification dataset such as iris, etc. First, we separate X - features of dataset and Y - output parts.

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
df.head()
   a   b  c       y
0  4  13  0  normal
1  9  17  2    high
2  5   3  0     low
3  7   1  4  normal
4  3   6  3  normal 

As it is seen, Y - output is a categorical data type. It needs to be encoded, and we set numeric codes with LabelEncoder() function.

Y.head()
        y
0    high
1  normal
2  normal
3  normal
4     low
 
le = LabelEncoder()
y = le.fit_transform(Y)
 
print(y[0:5])
[0 2 2 2 1] 

Next, we create a train and test parts with train_test_split function.

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)


Training bagging classifier

We use a BaggingClassifier class of 'sklearn.ensemble' packages to build bagging classifier model. Here, we set DecisionTreeClassifier class as a base estimator and set 100 to the number of estimators, then train the model with train data.

dtc = DecisionTreeClassifier(criterion="entropy")
bag_model=BaggingClassifier(base_estimator=dtc, n_estimators=100, bootstrap=True)
bag_model=bag_model.fit(Xtrain,ytrain)

Next, we predict data and check the prediction accuracy.

ytest_pred=bag_model.predict(Xtest)
print(bag_model.score(Xtest, ytest))
0.9 
print(confusion_matrix(ytest, ytest_pred)) 
[[ 4  0  1]
 [ 0 10  1]
 [ 1  2 31]] 

The model has predicted test data with 90% accuracy.

Checking accuracy by changing base estimator

We can change base estimator in BaggingClassifier class. Here, we'll use Logistic regression, Naive Base (Gaussian, Bernoulli) methods as a base estimator and check their prediction accuracy.

lr = LogisticRegression();
bnb = BernoulliNB()
gnb = GaussianNB()

base_methods=[lr, bnb, gnb, dtc]
for bm  in base_methods:
 print("Method: ", bm)
 bag_model=BaggingClassifier(base_estimator=bm,n_estimators=100,bootstrap=True)
 bag_model=bag_model.fit(Xtrain,ytrain)
 ytest_pred=bag_model.predict(Xtest)
 print(bag_model.score(Xtest, ytest))
 print(confusion_matrix(ytest, ytest_pred))
 
Method: LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.9
[[ 0  0  5]
 [ 0 11  0]
 [ 0  0 34]]
Method: BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
0.74
[[ 0  0  5]
 [ 0  3  8]
 [ 0  0 34]]
Method: GaussianNB(priors=None)
0.82
[[ 1  0  4]
 [ 0 11  0]
 [ 1  4 29]]
Method: DecisionTreeClassifier(class_weight=None,criterion='entropy',max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
0.92
[[ 4  0  1]
 [ 0 10  1]
 [ 1  1 32]]  

You can check each methods prediction accuracy in the above results.

   In this post, we've briefly learned how to classify data with BaggingClassifier class in Python. Thank you for reading! The full source code is listed below.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
df.head()

X = df[["a","b","c"]]
Y = df[["y"]]

le=LabelEncoder()
y=le.fit_transform(Y)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

dtc = DecisionTreeClassifier(criterion="entropy")
bag_model=BaggingClassifier(base_estimator=dtc, n_estimators=100, bootstrap=True)
bag_model=bag_model.fit(Xtrain,ytrain)
ytest_pred=bag_model.predict(Xtest)
print(bag_model.score(Xtest, ytest))
print(confusion_matrix(ytest, ytest_pred)) 

lr = LogisticRegression();
bnb = BernoulliNB()
gnb = GaussianNB()

base_methods=[lr, bnb, gnb, dtc]
for bm  in base_methods:
 print("Method: ", bm)
 bag_model=BaggingClassifier(base_estimator=bm, n_estimators=100, bootstrap=True)
 bag_model=bag_model.fit(Xtrain,ytrain)
 ytest_pred=bag_model.predict(Xtest)
 print(bag_model.score(Xtest, ytest))
 print(confusion_matrix(ytest, ytest_pred)) 


Reference:

1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html



2 comments:

  1. Hi dude, this post is one of the most simple and explanatory I could find, it helped me a lot.

    Only in the part of:
    Y = df[["y"]]
    I changed:
    Y = df[["y"][0]]

    So that Python doesn't show a message on the terminal.

    ReplyDelete