In gradient boosting, prediction of a weak learner is compared to actual value and error is identified. Based on this error, the model can find out gradient and change the parameters to decrease the error rate in the next training. The weak learner is identified by the gradient in the loss function.
In this post, we'll learn how to classify data with GradientBoostingClassifier in Python. We'll check the parameter of learning rate and estimators number to find out optimal setting values. The tutorial covers:
- Preparing data
- Prediction with GradientBoostingClassifier
- Checking the learning rate
- Checking estimator number
import pandas as pd import numpy as np from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier
Preparing data
First, we generate a random dataset for this tutorial. Here, we create a data frame and separate it into the feature - X and label - Y parts. Then, we split X, Y data into the train and test parts.
def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(200) >>> df.head(10) a b c y 0 2 6 0 low 1 5 6 0 low 2 6 2 4 normal 3 1 10 1 normal 4 0 2 3 low 5 4 0 1 low 6 8 2 3 normal 7 2 7 0 low 8 6 2 3 low 9 8 13 1 normal X = df[["a","b","c"]] Y = df[["y"]] Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y, random_state=0)
Prediction with GradientBoostingClassifier
'sklearn' package provides GradientBoostingClassifier method to build a gradient boosting model. We can create it, train the model, and get the prediction results as shown below.
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest)
GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.5, loss='deviance', max_depth=1, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)
print(gbc.score(Xtest, ytest)) 0.86 print(confusion_matrix(ytest, ypred)) [[ 2 0 1] [ 0 17 2] [ 2 2 24]]
Checking the learning rate
Next, we'll check the learning rate by setting different values and print the results.
# find optimal learning rate value learning_rate = [0.01, 0.05, 0.1, 0.5, 1]; for n in learning_rate: gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=n, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest) acc=gbc.score(Xtest, ytest) print("Learning rate: ",n, " Accuracy: ", acc) print("Confusion matrix:") print(confusion_matrix(ytest, ypred))
...
Learning rate: 0.01 Accuracy: 0.78 Confusion matrix: [[ 0 0 3] [ 0 17 2] [ 0 6 22]] ... Learning rate: 0.05 Accuracy: 0.8 Confusion matrix: [[ 0 0 3] [ 0 16 3] [ 0 4 24]] ... Learning rate: 0.1 Accuracy: 0.86 Confusion matrix: [[ 1 0 2] [ 0 16 3] [ 0 2 26]] ... Learning rate: 0.5 Accuracy: 0.86 Confusion matrix: [[ 2 0 1] [ 0 17 2] [ 2 2 24]] ... Learning rate: 1 Accuracy: 0.9 Confusion matrix: [[ 1 0 2] [ 0 18 1] [ 0 2 26]]
Checking estimator number
Next, we'll check the number of estimators by setting different values and print the results.
# find optimal number of estimators estimators = [10,50,100,200,500]; for e in estimators: gbc = GradientBoostingClassifier(n_estimators=e, learning_rate=1, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest) acc=gbc.score(Xtest, ytest) print("Number of estimators: ",e, " Accuracy: ", acc) print("Confusion matrix:") print(confusion_matrix(ytest, ypred))
...
Number of estimators: 10 Accuracy: 0.86 Confusion matrix: [[ 1 0 2] [ 0 17 2] [ 1 2 25]] ... Number of estimators: 50 Accuracy: 0.9 Confusion matrix: [[ 2 0 1] [ 0 17 2] [ 0 2 26]] ... Number of estimators: 100 Accuracy: 0.9 Confusion matrix: [[ 1 0 2] [ 0 18 1] [ 0 2 26]] ... Number of estimators: 200 Accuracy: 0.9 Confusion matrix: [[ 1 0 2] [ 0 18 1] [ 0 2 26]] ... Number of estimators: 500 Accuracy: 0.9 Confusion matrix: [[ 1 0 2] [ 0 18 1] [ 0 2 26]]
In this post, we've learned how to classify data with GradientBoostingClassifier model in Python.
The full source is listed below.
import pandas as pd import numpy as np from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(200) df.head() X = df[["a","b","c"]] Y = df[["y"]] Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y, random_state=0) # build and train GradientBoostingClassifier model gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest) print(gbc.score(Xtest, ytest)) print(confusion_matrix(ytest, ypred)) # find optimal learning rate value learning_rate = [0.01, 0.05, 0.1, 0.5, 1]; for n in learning_rate: gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=n, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest) acc=gbc.score(Xtest, ytest) print("Learning rate: ",n, " Accuracy: ", acc) print("Confusion matrix:") print(confusion_matrix(ytest, ypred)) # find optimal number of estimators estimators = [10,50,100,200,500]; for e in estimators: gbc = GradientBoostingClassifier(n_estimators=e, learning_rate=1, max_depth=1) gbc.fit(Xtrain, np.ravel(ytrain, order='C')) ypred = gbc.predict(Xtest) acc=gbc.score(Xtest, ytest) print("Number of estimators: ",e, " Accuracy: ", acc) print("Confusion matrix:") print(confusion_matrix(ytest, ypred))
No comments:
Post a Comment