Gradient Boosting Classification Example in Python


   Boosting is an ensemble learning technique in machine learning and widely used in regression and classification problems. The main concept of boosting is to improve weak learners and create single strong learner.
    In gradient boosting, prediction of a weak learner is compared to actual value and error is identified. Based on this error, the model can find out gradient and change the parameters to decrease the error rate in the next training. The weak learner is identified by the gradient in the loss function.

   In this post, we'll learn how to classify data with GradientBoostingClassifier in Python. We'll check the parameter of learning rate and estimators number to find out optimal setting values. The tutorial covers:
  • Preparing data
  • Prediction with GradientBoostingClassifier
  • Checking the learning rate 
  • Checking estimator number
 We'll start by loading required libraries.

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

Preparing data 

   First, we generate a random dataset for this tutorial. Here, we create a data frame and separate it into the feature - X  and label - Y parts. Then, we split X, Y data into the train and test parts.

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
>>> df.head(10)
   a   b  c       y
0  2   6  0     low
1  5   6  0     low
2  6   2  4  normal
3  1  10  1  normal
4  0   2  3     low
5  4   0  1     low
6  8   2  3  normal
7  2   7  0     low
8  6   2  3     low
9  8  13  1  normal

X = df[["a","b","c"]]
Y = df[["y"]]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y, random_state=0)

Prediction with GradientBoostingClassifier

'sklearn' package provides GradientBoostingClassifier method to build a gradient boosting model. We can create it, train the model, and get the prediction results as shown below.

gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5, max_depth=1)
gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
ypred = gbc.predict(Xtest)
 
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.5, loss='deviance', max_depth=1,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False) 
 
print(gbc.score(Xtest, ytest))
0.86
print(confusion_matrix(ytest, ypred))
[[ 2  0  1]
 [ 0 17  2]
 [ 2  2 24]]

Checking the learning rate 

Next, we'll check the learning rate by setting different values and print the results.

# find optimal learning rate value
learning_rate =  [0.01, 0.05, 0.1, 0.5, 1];
for n in learning_rate:
 gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=n, max_depth=1)
 gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
 ypred = gbc.predict(Xtest)
 acc=gbc.score(Xtest, ytest) 
 print("Learning rate: ",n, "  Accuracy: ", acc)
 print("Confusion matrix:")
 print(confusion_matrix(ytest, ypred))
 
...
Learning rate:  0.01   Accuracy:  0.78
Confusion matrix:
[[ 0  0  3]
 [ 0 17  2]
 [ 0  6 22]]
...
Learning rate:  0.05   Accuracy:  0.8
Confusion matrix:
[[ 0  0  3]
 [ 0 16  3]
 [ 0  4 24]]
...
Learning rate:  0.1   Accuracy:  0.86
Confusion matrix:
[[ 1  0  2]
 [ 0 16  3]
 [ 0  2 26]]
...
Learning rate:  0.5   Accuracy:  0.86
Confusion matrix:
[[ 2  0  1]
 [ 0 17  2]
 [ 2  2 24]]
...
Learning rate:  1   Accuracy:  0.9
Confusion matrix:
[[ 1  0  2]
 [ 0 18  1]
 [ 0  2 26]]

Checking estimator number

Next, we'll check the number of estimators by setting different values and print the results.

# find optimal number of estimators
estimators =  [10,50,100,200,500];
for e in estimators:
 gbc = GradientBoostingClassifier(n_estimators=e, learning_rate=1, max_depth=1)
 gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
 ypred = gbc.predict(Xtest)
 acc=gbc.score(Xtest, ytest) 
 print("Number of estimators: ",e, "  Accuracy: ", acc)
 print("Confusion matrix:")
 print(confusion_matrix(ytest, ypred)) 
 
...
Number of estimators:  10   Accuracy:  0.86
Confusion matrix:
[[ 1  0  2]
 [ 0 17  2]
 [ 1  2 25]]
...
Number of estimators:  50   Accuracy:  0.9
Confusion matrix:
[[ 2  0  1]
 [ 0 17  2]
 [ 0  2 26]]
...
Number of estimators:  100   Accuracy:  0.9
Confusion matrix:
[[ 1  0  2]
 [ 0 18  1]
 [ 0  2 26]]
...
Number of estimators:  200   Accuracy:  0.9
Confusion matrix:
[[ 1  0  2]
 [ 0 18  1]
 [ 0  2 26]]
...
Number of estimators:  500   Accuracy:  0.9
Confusion matrix:
[[ 1  0  2]
 [ 0 18  1]
 [ 0  2 26]]

   In this post, we've learned how to classify data with GradientBoostingClassifier model in Python.

The full source is listed below.

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(200)
df.head()

X = df[["a","b","c"]]
Y = df[["y"]]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y, random_state=0)

# build and train GradientBoostingClassifier model
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5, max_depth=1)
gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
ypred = gbc.predict(Xtest)
print(gbc.score(Xtest, ytest))
print(confusion_matrix(ytest, ypred)) 

# find optimal learning rate value
learning_rate =  [0.01, 0.05, 0.1, 0.5, 1];
for n in learning_rate:
 gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=n, max_depth=1)
 gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
 ypred = gbc.predict(Xtest)
 acc=gbc.score(Xtest, ytest) 
 print("Learning rate: ",n, "  Accuracy: ", acc)
 print("Confusion matrix:")
 print(confusion_matrix(ytest, ypred))

# find optimal number of estimators
estimators =  [10,50,100,200,500];
for e in estimators:
 gbc = GradientBoostingClassifier(n_estimators=e, learning_rate=1, max_depth=1)
 gbc.fit(Xtrain, np.ravel(ytrain, order='C'))
 ypred = gbc.predict(Xtest)
 acc=gbc.score(Xtest, ytest) 
 print("Number of estimators: ",e, "  Accuracy: ", acc)
 print("Confusion matrix:")
 print(confusion_matrix(ytest, ypred)) 
 
 
No comments:
Post a Comment