### Classification with XGBoost Model in R

Extreme Gradient Boosting (XGBoost) is a gradient boosing algorithm in machine learning. The XGboost applies regularization technique to reduce the overfitting. The advantage of XGBoost over classical gradient boosting is that it is fast in execution speed and it performs well in predictive modeling of classification and regression problems.

In this tutorial, we'll briefly learn how to classify data with xgboost by using the xgboost package in R. The tutorial cover:

1. Preparing data
2. Defining the model
3. Predicting test data

`library(xgboost)`
`library(caret)`

Preparing data

In this tutorial, we'll use the Iris dataset as a target classification data. First, we'll split the dataset into the train and test parts. Here, ten percent of the dataset is selected as a test data.

```indexes = createDataPartition(iris\$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]```

Next, we'll extract x - feature and y - label parts. The training x data should be in matrix type to use in xgboost. Thus, we'll convert x data into the matrix type.

```train_x = data.matrix(train[,-5])
train_y = train[,5]

test_x = data.matrix(test[,-5])
test_y = test[,5]```

Next, we need to convert the train and test data into xgb matrix type.

```xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)```

Defining the model

We can define the xgboost model with xgboost function with changing some of the parameters. Note that xgboost is a training function, thus we need to include the train data too. Once we run the function, it fits the model with training data.

```xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
 train-rmse:1.213938
 train-rmse:0.865807
 train-rmse:0.622092
 train-rmse:0.451725
 train-rmse:0.334372
 train-rmse:0.255238```
`.... `
``` train-rmse:0.026330
 train-rmse:0.026025
 train-rmse:0.025677
 train-rmse:0.025476
 train-rmse:0.024495
 train-rmse:0.023678
 train-rmse:0.022138
 train-rmse:0.020715 ```
` `
```print(xgbc)
##### xgb.Booster
raw: 30.2 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max.depth = 3)
params (as set within xgb.train):
max_depth = "3", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
cb.save.model(save_period = save_period, save_name = save_name)
niter: 50
evaluation_log:
iter train_rmse
1   1.213938
2   0.865807
---
49   0.022138
50   0.020715```

Predicting test data

The model is ready and we can predict the test data.

`pred = predict(xgbc, xgb_test)`
```print(pred)
 1.0083745 0.9993168 0.7263275 0.9887304 0.9993168 1.9989902 1.9592317 1.9999132
 2.0134101 1.9976928 2.9946277 3.5094361 2.8852687 2.8306360 2.1748595```

Now, we'll convert the result into factor type.

```pred[(pred>3)] = 3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
 setosa     setosa     setosa     setosa     setosa     versicolor versicolor
 versicolor versicolor versicolor virginica  virginica  virginica  virginica
 versicolor
Levels: setosa versicolor virginica```

We'll check the prediction accuracy with a confusion matrix.

```cm = confusionMatrix(test_y, pred_y)
print(cm)
Confusion Matrix and Statistics

Reference
Prediction   setosa versicolor virginica
setosa          5          0         0
versicolor      0          5         0
virginica       0          1         4

Overall Statistics

Accuracy : 0.9333
95% CI : (0.6805, 0.9983)
No Information Rate : 0.4
P-Value [Acc > NIR] : 2.523e-05

Kappa : 0.9
Mcnemar's Test P-Value : NA

Statistics by Class:

Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545```

We can compare the result with original values.

```result = cbind(orig=as.character(test_y),
factor=as.factor(test_y),
pred=pred,
rounded=round(pred),
pred=as.character(levels(test_y))[round(pred)])

print(data.frame(result))
orig factor              pred rounded     pred.1
1      setosa      1  1.00837445259094       1     setosa
2      setosa      1 0.999316811561584       1     setosa
3      setosa      1 0.726327538490295       1     setosa
4      setosa      1 0.988730430603027       1     setosa
5      setosa      1 0.999316811561584       1     setosa
6  versicolor      2  1.99899017810822       2 versicolor
7  versicolor      2  1.95923173427582       2 versicolor
8  versicolor      2  1.99991321563721       2 versicolor
9  versicolor      2  2.01341009140015       2 versicolor
10 versicolor      2  1.99769282341003       2 versicolor
11  virginica      3   2.9946277141571       3  virginica
12  virginica      3                 3       3  virginica
13  virginica      3   2.8852686882019       3  virginica
14  virginica      3   2.8306360244751       3  virginica
15  virginica      3  2.17485952377319       2 versicolor```

In this tutorial, we've briefly learned how to classify data with xgboost in R. The full source code is listed below.

