Classification Example with XGBClassifier in Python

   The XGBoost stands for eXtreme Gradient Boosting, which is a boosting algorithm based on gradient boosted decision trees algorithm. XGBoost applies a better regularization technique to reduce overfitting, and it is one of the differences from the gradient boosting.
   The ‘xgboost’ is an open-source library that provides machine learning algorithms under the gradient boosting methods. 
   The xgboost.XGBClassifier is a scikit-learn API compatible class for classification.
   In this post, we'll briefly learn how to classify iris data with XGBClassifier in Python. We'll use xgboost library module and you may need to install if it is not available on your machine. The tutorial cover:
  1. Preparing data
  2. Defining the model
  3. Predicting test data
  4. Video tutorial
  5. Source code listing
We'll start by loading the required libraries.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold


Preparing data

In this tutorial, we'll use the iris dataset as the classification data. First, we'll separate data into x and y parts.

iris = load_iris()
x, y = iris.data, iris.target

Then we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)


Defining the model

We've loaded the XGBClassifier class from xgboost library above. Now we can define the classifier model.

xgbc = XGBClassifier()
print(xgbc)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1) 

You can change the classifier model parameters according to your dataset characteristics. Here, we've defined it with default parameter values.
We'll fit the model with train data.

xgbc.fit(xtrain, ytrain)

Next, we'll check the training accuracy with cross-validation and k-fold methods.

scores = cross_val_score(xgbc, xtrain, ytrain, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())
Mean cross-validation score: 0.94 
 
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())
K-fold CV average score: 0.94 


Predicting test data

Finally, we'll predict test data check the prediction accuracy with a confusion matrix.

ypred = xgbc.predict(xtest)
cm = confusion_matrix(ytest,ypred) 
print(cm)
[[8 0 0]
 [0 8 0]
 [0 2 5]] 


   In this post, we've briefly learned how to classify data with the XGBClassifier class in Python. The full source code is listed below.


Video tutorial


 https://youtu.be/kqzJLSReg9c


Source code listing

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold

iris = load_iris()
x, y = iris.data, iris.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

xgbc = XGBClassifier()
print(xgbc)

xgbc.fit(xtrain, ytrain)

# - cross validataion
scores = cross_val_score(xgbc, xtrain, ytrain, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())

kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

ypred = xgbc.predict(xtest)
cm = confusion_matrix(ytest,ypred)
print(cm)


4 comments:

  1. U deserve a coffee but I don't have money ;)

    ReplyDelete
  2. small typo there:
    cores = cross_val_score(xgbc, xtrain, ytrain, cv=5) <--- here should be scores
    print("Mean cross-validation score: %.2f" % scores.mean())

    ReplyDelete