The XGBoost stands for "Extreme Gradient Boosting" and it is an implementation of gradient boosting trees algorithm. It is a popular supervised machine learning method with characteristics like computation speed, parallelization, and performance. XGBoost is an open-source software library and you can use it in the R development environment by downloading the xgboost R package.

In this tutorial, we'll briefly learn how to fit and predict regression data with the 'xgboost' function. The
tutorial covers:

- Preparing the data
- Fitting the model and prediction
- Accuracy checking
- Source code listing

We'll start by loading the required library.

`library(xgboost)`

library(caret)

**Preparing the data**

We use Boston house-price dataset as a regression dataset in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, I'll extract 15 percent of the dataset as test data. The xgboost uses matrix data so that we need to convert our data into the xgb matrix type.

`boston = MASS::Boston`

str(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)

train = boston[indexes, ]

test = boston[-indexes, ]

train_x = data.matrix(train[, -13])

train_y = train[,13]

test_x = data.matrix(test[, -13])

test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)

xgb_test = xgb.DMatrix(data = test_x, label = test_y)

**Fitting the model and prediction**

We'll define the model by using the xgboost() function of
xgboost package. Here, we'll set 'max_depth' and 'nrounds' parameters. A 'max_depth' defines the depth of trees that the higher value is the more complex the model is. An 'nrounds' is the maximum number of iteration.

The calling the function is enough to train the model with included data. You can check the summary of the model by using the print() and str() functions.

`xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)`

print(xgbc)

##### xgb.Booster

raw: 22.2 Kb

call:

xgb.train(params = params, data = dtrain, nrounds = nrounds,

watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,

early_stopping_rounds = early_stopping_rounds, maximize = maximize,

save_period = save_period, save_name = save_name, xgb_model = xgb_model,

callbacks = callbacks, max_depth = 2)

params (as set within xgb.train):

max_depth = "2", validate_parameters = "1"

xgb.attributes:

niter

callbacks:

cb.print.evaluation(period = print_every_n)

cb.evaluation.log()

# of features: 13

niter: 50

nfeatures : 13

evaluation_log:

iter train_rmse

1 10.288543

2 7.710918

---

49 2.007022

50 1.997438

Next, we'll predict the x test data with the xgbc model.

`pred_y = predict(xgbc, xgb_test)`

**Accuracy check**

Next, we'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

`mse = mean((test_y - pred_y)^2)`

mae = caret::MAE(test_y, pred_y)

rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

MSE: 11.99942 MAE: 2.503739 RMSE: 3.464018

Finally, we'll visualize y original test and y predicted data in a plot.

`x = 1:length(test_y)`

plot(x, test_y, col = "red", type = "l")

lines(x, pred_y, col = "blue", type = "l")

legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),

col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))

In this tutorial, we've learned how to fit and predict regression data with xgboost in R. The full source code is listed below.

**Source code listing**

library(xgboost)

library(caret)

boston = MASS::Boston

str(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)

train = boston[indexes, ]

test = boston[-indexes, ]

train_x = data.matrix(train[, -13])

train_y = train[,13]

test_x = data.matrix(test[, -13])

test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)

xgb_test = xgb.DMatrix(data = test_x, label = test_y)

xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)

print(xgbc)

pred_y = predict(xgbc, xgb_test)

mse = mean((test_y - pred_y)^2)

mae = caret::MAE(test_y, pred_y)

rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

x = 1:length(test_y)

plot(x, test_y, col = "red", type = "l")

lines(x, pred_y, col = "blue", type = "l")

legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),

col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))

**Reference:**
Great. Thanks

ReplyDeleteThank you. This example helps me a lot.

ReplyDelete