Classification Example with XGBClassifier in Python


  The XGBoost stands for Extreme Gradient Boosting and it is a boosting algorithm based on Gradient Boosting Machines.  XGboost applies regularization technique to reduce overfitting, and it is one of the differences from the gradient boosting. Another advantage of XGBoost over classical gradient boosting is that it is fast in execution speed.
   In this post, we'll briefly learn how to classify iris data with xgboost model in Python. We'll use xgboost python module and you may need to install if it is not available on your machine.
The tutorial cover:
  1. Preparing data
  2. Defining the model
  3. Predicting test data
We'll start by loading the required libraries.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold


Preparing data

In this tutorial, we'll use iris dataset as the classification data. First, we'll separate data into x and y parts.

iris = load_iris()
x, y = iris.data, iris.target

Then we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as a test data.

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)


Defining the model

We've loaded XGBClassifier class from xgboost module above. Now we can define the classifier model.

xgbc = XGBClassifier()
print(xgbc)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1) 

You can change the classifier model parameters according to your dataset characteristics. Here, we've defined it with default parameter values.
We'll fit the model with train data.

xgbc.fit(xtrain, ytrain)

Next, we'll check the training accuracy with cross-validation and k-fold methods.

cores = cross_val_score(xgbc, xtrain, ytrain, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())
Mean cross-validation score: 0.94 
 
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())
K-fold CV average score: 0.94 


Predicting test data

Finally, we'll predict test data check the prediction accuracy with a confusion matrix.

ypred = xgbc.predict(xtest)
cm = confusion_matrix(ytest,ypred) 
print(cm)
[[8 0 0]
 [0 8 0]
 [0 2 5]] 


   In this post, we've briefly learned how to classify data with xgboost classifier model in Python. Thank you for reading! 
   The full source code is listed below.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold

iris = load_iris()
x, y = iris.data, iris.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

xgbc = XGBClassifier()
print(xgbc)

xgbc.fit(xtrain, ytrain)

# - cross validataion
scores = cross_val_score(xgbc, xtrain, ytrain, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())

kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

ypred = xgbc.predict(xtest)
cm = confusion_matrix(ytest,ypred)
print(cm)


No comments:
Post a Comment