The ‘xgboost’ is an open-source library that provides machine learning algorithms under the gradient boosting methods.
The xgboost.XGBClassifier is a scikit-learn API compatible class for classification.
In this post, we'll briefly learn how to classify iris data with XGBClassifier in Python. We'll use xgboost library module and you may need to install if it is not available on your machine. The tutorial cover:
- Preparing data
- Defining the model
- Predicting test data
- Video tutorial
- Source code listing
from xgboost import XGBClassifier from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, KFold
Preparing data
In this tutorial, we'll use the iris dataset as the classification data. First, we'll separate data into x and y parts.
iris = load_iris() x, y = iris.data, iris.target
Then we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)
Defining the model
We've loaded the XGBClassifier class from xgboost library above. Now we can define the classifier model.
xgbc = XGBClassifier() print(xgbc)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
You can change the classifier model parameters according to your dataset characteristics. Here, we've defined it with default parameter values.
We'll fit the model with train data.
xgbc.fit(xtrain, ytrain)
Next, we'll check the training accuracy with cross-validation and k-fold methods.
scores = cross_val_score(xgbc, xtrain, ytrain, cv=5) print("Mean cross-validation score: %.2f" % scores.mean())
Mean cross-validation score: 0.94
kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean())
K-fold CV average score: 0.94
Predicting test data
Finally, we'll predict test data check the prediction accuracy with a confusion matrix.
ypred = xgbc.predict(xtest) cm = confusion_matrix(ytest,ypred)
print(cm)
[[8 0 0] [0 8 0] [0 2 5]]
In this post, we've briefly learned how to classify data with the XGBClassifier class in Python. The full source code is listed below.
Video tutorial
https://youtu.be/kqzJLSReg9c
Source code listing
from xgboost import XGBClassifier from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, KFold iris = load_iris() x, y = iris.data, iris.target xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15) xgbc = XGBClassifier() print(xgbc) xgbc.fit(xtrain, ytrain) # - cross validataion scores = cross_val_score(xgbc, xtrain, ytrain, cv=5) print("Mean cross-validation score: %.2f" % scores.mean()) kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean()) ypred = xgbc.predict(xtest) cm = confusion_matrix(ytest,ypred) print(cm)
Thanks you
ReplyDeleteU deserve a coffee but I don't have money ;)
ReplyDeleteThanks you
ReplyDeletesmall typo there:
ReplyDeletecores = cross_val_score(xgbc, xtrain, ytrain, cv=5) <--- here should be scores
print("Mean cross-validation score: %.2f" % scores.mean())