Classification with XGBoost Model in R


   XGBoost (Extreme Gradient Boosting) is a boosting algorithm based on Gradient Boosting Machines.  XGboost applies regularization technique to reduce overfitting, and it is one of the differences from the gradient boosting. Another advantage of XGBoost over classical gradient boosting is that it is fast in execution speed. It performs well in predictive modeling of classification and regression analysis.
   In this post, we'll briefly learn how to classify data with xgboost model in R. We'll use xgboost package for R. The tutorial cover:
  1. Preparing data
  2. Defining the model
  3. Predicting test data

We'll start by loading the required packages.

library(xgboost)
library(caret) 


Preparing data

In this tutorial, we'll use iris dataset as a given classification problem. First, we'll split the dataset into the train and test parts. Here, I'll use 10 percent of the dataset as test data.

indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]

Next, we'll separate x - feature and y - label parts. Note, the training x data should be matrix type to use in xgboost model. Thus, we'll convert x parts into the matrix type.

train_x = data.matrix(train[,-5])
train_y = train[,5]
 
test_x = data.matrix(test[,-5])
test_y = test[,5]

Here, you may know that 5 is the number of "Species" column in the iris data frame.
Next, we need to convert the train and test data into xgb matrix type.

xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)


Defining the model

We can define the xgboost model with xgboost function with changing some of the parameters. Note that xgboost is a training function, thus we need to include the train data too. Once we run the function, it fits the model with training data.

xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
[1] train-rmse:1.213938 
[2] train-rmse:0.865807 
[3] train-rmse:0.622092 
[4] train-rmse:0.451725 
[5] train-rmse:0.334372 
[6] train-rmse:0.255238
.... 
[43] train-rmse:0.026330 
[44] train-rmse:0.026025 
[45] train-rmse:0.025677 
[46] train-rmse:0.025476 
[47] train-rmse:0.024495 
[48] train-rmse:0.023678 
[49] train-rmse:0.022138 
[50] train-rmse:0.020715 
 
print(xgbc)
##### xgb.Booster
raw: 30.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max.depth = 3)
params (as set within xgb.train):
  max_depth = "3", silent = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
  cb.save.model(save_period = save_period, save_name = save_name)
niter: 50
evaluation_log:
    iter train_rmse
       1   1.213938
       2   0.865807
---                
      49   0.022138
      50   0.020715


Predicting test data

The model is ready and we can predict our test data.

pred = predict(xgbc, xgb_test)
print(pred)
 [1] 1.0083745 0.9993168 0.7263275 0.9887304 0.9993168 1.9989902 1.9592317 1.9999132
 [9] 2.0134101 1.9976928 2.9946277 3.5094361 2.8852687 2.8306360 2.1748595

Now, we'll convert the result into factor level type.

pred[(pred>3)] = 3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
 [1] setosa     setosa     setosa     setosa     setosa     versicolor versicolor
 [8] versicolor versicolor versicolor virginica  virginica  virginica  virginica 
[15] versicolor
Levels: setosa versicolor virginica

We'll check the prediction accuracy with a confusion matrix.

cm = confusionMatrix(test_y, pred_y)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 2.523e-05       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545


We can compare the test with original values.

result = cbind(orig=as.character(test_y),
               factor=as.factor(test_y),
               pred=pred,
               rounded=round(pred),
               pred=as.character(levels(test_y))[round(pred)])
 
print(data.frame(result))
         orig factor              pred rounded     pred.1
1      setosa      1  1.00837445259094       1     setosa
2      setosa      1 0.999316811561584       1     setosa
3      setosa      1 0.726327538490295       1     setosa
4      setosa      1 0.988730430603027       1     setosa
5      setosa      1 0.999316811561584       1     setosa
6  versicolor      2  1.99899017810822       2 versicolor
7  versicolor      2  1.95923173427582       2 versicolor
8  versicolor      2  1.99991321563721       2 versicolor
9  versicolor      2  2.01341009140015       2 versicolor
10 versicolor      2  1.99769282341003       2 versicolor
11  virginica      3   2.9946277141571       3  virginica
12  virginica      3                 3       3  virginica
13  virginica      3   2.8852686882019       3  virginica
14  virginica      3   2.8306360244751       3  virginica
15  virginica      3  2.17485952377319       2 versicolor


   In this post, we've briefly learned how to classify data with xgboost model in R. Thank you for reading!
   The full source code is listed below.
library(xgboost) library(caret) indexes = createDataPartition(iris$Species, p=.9, list=F) train = iris[indexes, ] test = iris[-indexes, ] train_x = data.matrix(train[,-5]) train_y = train[,5] test_x = data.matrix(test[,-5]) test_y = test[,5] xgb_train = xgb.DMatrix(data=train_x, label=train_y) xgb_test = xgb.DMatrix(data=test_x, label=test_y) xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50) print(xgbc) pred = predict(xgbc, xgb_test) print(pred) pred[(pred>3)]=3 pred_y = as.factor((levels(test_y))[round(pred)]) print(pred_y) cm = confusionMatrix(test_y, pred_y) print(cm) result = cbind(orig=as.character(test_y), factor=as.factor(test_y), pred=pred, rounded=round(pred), pred=as.character(levels(test_y))[round(pred)]) print(data.frame(result))

Classification with Adaboost Model in R

Classification with Gradient Boosting Model in R


No comments:
Post a Comment