Classification with Adaboost Classifier in Python

   Adaboost stands for Adaptive Boosting and it is widely used ensemble learning algorithm in machine learning. Weak learners, the base classifiers like a decision tree, are boosted by improving their weights and make them vote in creating a combined final model. In this post, we'll learn how to classify data with Adaboost Classifier model in Python. This tutorial covers:
  1. Preparing data
  2. Training the Adaboost Classifier model
  3. Predicting test data and checking the accuracy
  4. Testing iris dataset with different base classifiers 
We'll start by loading the required libraries.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets


Preparing data

   In this tutorial, first, we'll generate dataset by random numbers with some rules and then check iris dataset with Adaboost Classifier. Below function helps us to create a dataset.

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
print(df.head())
   a   b  c       y
0  4  11  3  normal
1  0   9  1     low
2  2  18  0  normal
3  9  11  1  normal
4  4   7  1  normal 

Here, y is output data, and it is a categorical type. We need to change it numeric one. To encode the 'Y' value, we can use LabelEncoder().

le=LabelEncoder()
y=le.fit_transform(Y)

print(Y.head())
        y
0    high
1  normal
2  normal
3     low
4  normal 
 
print(y[0:5])
[0 2 2 1 2]

Next, we'll split X and y data into train and test parts with train_test_split().

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)


Training the Adaboost Classifier model

We use AdaboostClassfier class of 'sklearn.enseble' package to build the Adaboost Classifier model. As a base classifier, we implement DecisionTreeClassfier and train model with training data.

dtc = DecisionTreeClassifier(criterion="entropy", max_depth=3)
ada_model=AdaBoostClassifier(base_estimator=dtc, n_estimators=100)
ada_model=ada_model.fit(Xtrain,ytrain)
print(ada_model)
AdaBoostClassifier(algorithm='SAMME.R',
    base_estimator=DecisionTreeClassifier(class_weight=None, 
    criterion='entropy',max_depth=3,
    max_features=None, max_leaf_nodes=None,
    min_impurity_decrease=0.0, min_impurity_split=None,
    min_samples_leaf=1, min_samples_split=2,
    min_weight_fraction_leaf=0.0, presort=False, random_state=None,
    splitter='best'),
    learning_rate=1.0, n_estimators=100, random_state=None)


Predicting test data and checking the accuracy

After the training, we can classify test data and check the accuracy of the model.

ytest_pred=ada_model.predict(Xtest)
print(ada_model.score(Xtest, ytest))
0.94
print(confusion_matrix(ytest, ytest_pred))
[[ 3  0  2]
 [ 0 16  0]
 [ 0  1 28]]
 

Testing iris dataset with different base classifiers 

Next, we apply the Adaboost classification method to classify iris dataset. Here, we do the same process to prepare data as have done above.

iris= datasets.load_iris()
X = iris.data
Y = iris.target

le=LabelEncoder()
y=le.fit_transform(Y)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

We check the performance of the model by changing base classifier to Naive Bayes and Random Forest methods.

gnb = GaussianNB()
rf = RandomForestClassifier(n_estimators=10)
 
base_methods=[rf, gnb, dtc]
for bm  in base_methods:
 print("Method: ", bm)
 ada_model=AdaBoostClassifier(base_estimator=bm)
 ada_model=ada_model.fit(Xtrain,ytrain)
 ytest_pred=ada_model.predict(Xtest)
 print(ada_model.score(Xtest, ytest))
 print(confusion_matrix(ytest, ytest_pred))
 


Method: RandomForestClassifier(bootstrap=True, class_weight=None,criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
0.9736842105263158
[[13  0  0]
 [ 0 15  1]
 [ 0  0  9]]
Method:  GaussianNB(priors=None)
0.9736842105263158
[[13  0  0]
 [ 0 15  1]
 [ 0  0  9]]
Method: DecisionTreeClassifier(class_weight=None,criterion='entropy',max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
0.9736842105263158
[[13  0  0]
 [ 0 15  1]
 [ 0  0  9]] 

Here, all base classifiers showed the same performance.

   In this post, we have briefly learned how to use the Adaboost Classifier to classify data in Python.

Thank you for reading. The full source code is listed below.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
print(df.head())

X = df[["a","b","c"]]
Y = df[["y"]]

le=LabelEncoder()
y=le.fit_transform(Y)

print(Y.head())
print(y[0:5])

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

dtc = DecisionTreeClassifier(criterion="entropy", max_depth=3)
ada_model=AdaBoostClassifier(base_estimator=dtc, n_estimators=100)
ada_model=ada_model.fit(Xtrain,ytrain)
ytest_pred=ada_model.predict(Xtest)
print(ada_model.score(Xtest, ytest))
print(confusion_matrix(ytest, ytest_pred)) 

iris= datasets.load_iris()
X = iris.data
Y = iris.target

le=LabelEncoder()
y=le.fit_transform(Y)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

gnb = GaussianNB()
rf = RandomForestClassifier(n_estimators=10)

base_methods=[rf, gnb, dtc]
for bm  in base_methods:
 print("Method: ", bm)
 ada_model=AdaBoostClassifier(base_estimator=bm)
 ada_model=ada_model.fit(Xtrain,ytrain)
 ytest_pred=ada_model.predict(Xtest)
 print(ada_model.score(Xtest, ytest))
 print(confusion_matrix(ytest, ytest_pred)) 



No comments:

Post a Comment