DataTechNotes: Regression Example with XGBoost in R

The XGBoost stands for "Extreme Gradient Boosting" and it is an implementation of gradient boosting trees algorithm. It is a popular supervised machine learning method with characteristics like computation speed, parallelization, and performance. XGBoost is an open-source software library and you can use it in the R development environment by downloading the xgboost R package.

In this tutorial, we'll briefly learn how to fit and predict regression data with the 'xgboost' function. The tutorial covers:

Preparing the data
Fitting the model and prediction
Accuracy checking
Source code listing

We'll start by loading the required library.

library(xgboost)
library(caret)

Preparing the data

We use Boston house-price dataset as a regression dataset in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, I'll extract 15 percent of the dataset as test data. The xgboost uses matrix data so that we need to convert our data into the xgb matrix type.

boston = MASS::Boston
str(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = data.matrix(train[, -13])
train_y = train[,13]

test_x = data.matrix(test[, -13])
test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

Fitting the model and prediction

We'll define the model by using the xgboost() function of xgboost package. Here, we'll set 'max_depth' and 'nrounds' parameters. A 'max_depth' defines the depth of trees that the higher value is the more complex the model is. An 'nrounds' is the maximum number of iteration.

The calling the function is enough to train the model with included data. You can check the summary of the model by using the print() and str() functions.

xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)


##### xgb.Booster
raw: 22.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max_depth = 2)
params (as set within xgb.train):
  max_depth = "2", validate_parameters = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
# of features: 13 
niter: 50
nfeatures : 13 
evaluation_log:
    iter train_rmse
       1  10.288543
       2   7.710918
---                
      49   2.007022
      50   1.997438

Next, we'll predict the x test data with the xgbc model.

pred_y = predict(xgbc, xgb_test)

Accuracy check

Next, we'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)


MSE:  11.99942 MAE:  2.503739  RMSE:  3.464018

Finally, we'll visualize y original test and y predicted data in a plot.

x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38,  legend = c("original test_y", "predicted test_y"), 
       col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))

In this tutorial, we've learned how to fit and predict regression data with xgboost in R. The full source code is listed below.

Source code listing


library(xgboost)
library(caret)

boston = MASS::Boston
str(boston)

set.seed(12)
indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = data.matrix(train[, -13])
train_y = train[,13]

test_x = data.matrix(test[, -13])
test_y = test[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)

pred_y = predict(xgbc, xgb_test)

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38,  legend = c("original test_y", "predicted test_y"), 
       col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))

Reference:

XGBoost R package

6 comments:

UnknownFebruary 5, 2021 at 4:09 AM
Great. Thanks
AnonymousMarch 8, 2021 at 6:26 PM
Thank you. This example helps me a lot.
UnknownNovember 21, 2021 at 3:50 AM
Great piece here!
UnknownApril 14, 2022 at 5:21 AM
Thanks. Very resourceful
AnonymousNovember 27, 2022 at 10:33 PM
How can I get R^2? here?
AnonymousJuly 6, 2023 at 1:47 PM
This prints the results in a nice format.

data.frame(MSE = mean((test_y - pred_y)^2),
MAE = caret::MAE(test_y, pred_y),
RMSE = caret::RMSE(test_y, pred_y),
R2 = caret::R2(test_y, pred_y))

Pages

Regression Example with XGBoost in R

6 comments: