Regression Example with XGBoost in R

    The XGBoost stands for "Extreme Gradient Boosting" and it is an implementation of gradient boosting trees algorithm. It is a popular supervised machine learning method with characteristics like computation speed, parallelization, and performance. XGBoost is an open-source software library and you can use it in the R development environment by downloading the xgboost R package.
    In this tutorial, we'll briefly learn how to fit and predict regression data with the 'xgboost' function. The tutorial covers:
  1. Preparing the data
  2. Fitting the model and prediction
  3. Accuracy checking
  4. Source code listing
We'll start by loading the required library.

library(xgboost)
library(caret)

Preparing the data

   We use Boston house-price dataset as a regression dataset in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, I'll extract 15 percent of the dataset as test data. The xgboost uses matrix data so that we need to convert our data into the xgb matrix type.

boston = MASS::Boston
str(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = data.matrix(train[, -13])
train_y = train[,13]

test_x = data.matrix(test[, -13])
test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)


Fitting the model and prediction

   We'll define the model by using the xgboost() function of xgboost package. Here, we'll set 'max_depth' and 'nrounds' parameters. A 'max_depth' defines the depth of trees that the higher value is the more complex the model is. An 'nrounds' is the maximum number of iteration.
    The calling the function is enough to train the model with included data. You can check the summary of the model by using the print() and str() functions.  

xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)

##### xgb.Booster
raw: 22.2 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max_depth = 2)
params (as set within xgb.train):
max_depth = "2", validate_parameters = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
# of features: 13
niter: 50
nfeatures : 13
evaluation_log:
iter train_rmse
1 10.288543
2 7.710918
---
49 2.007022
50 1.997438

Next, we'll predict the x test data with the xgbc model.

pred_y = predict(xgbc, xgb_test)



Accuracy check

Next, we'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

MSE: 11.99942 MAE: 2.503739 RMSE: 3.464018

Finally, we'll visualize y original test and y predicted data in a plot.

x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),
col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))


   In this tutorial, we've learned how to fit and predict regression data with xgboost in R. The full source code is listed below.


Source code listing


library(xgboost)
library(caret)

boston = MASS::Boston
str(boston)

set.seed(12)
indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = data.matrix(train[, -13])
train_y = train[,13]

test_x = data.matrix(test[, -13])
test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)

pred_y = predict(xgbc, xgb_test)

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),
col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))



Reference:

6 comments:

  1. Thank you. This example helps me a lot.

    ReplyDelete
  2. How can I get R^2? here?

    ReplyDelete
  3. This prints the results in a nice format.

    data.frame(MSE = mean((test_y - pred_y)^2),
    MAE = caret::MAE(test_y, pred_y),
    RMSE = caret::RMSE(test_y, pred_y),
    R2 = caret::R2(test_y, pred_y))

    ReplyDelete