Gradient Boosting Regression Example with GBM in R

   Gradient boosting regression is a powerful machine learning technique used for predicting continuous outcomes. In this tutorial, we will learn how to use the gbm package in R to perform gradient boosting regression. The tutorial covers:

  1. Introduction to Gradient Boosting
  2. Package installation and preparing data
  3. Fitting the model and prediction
  4. Accuracy checking 
  5. Conclusion
  6. Source code listing
Let's get started.


 

Introduction to Gradient Boosting

     Gradient boosting combines the strengths of multiple weak learners to improve predictive models. It iteratively refines the model by adding new weak learners and optimizing the loss function. This approach distinguishes gradient boosting from other boosting methods, such as AdaBoost, which sequentially adds new weak learners without directly optimizing the loss function.

    The model training process includes components such as base learner, loss function, and optimization.

  • Base Learners are individual models (e.g., decision trees) within the ensemble, each specializing in specific data aspects and contributing to the final prediction.
  • Loss Functions calculate the difference between predicted and actual values.
  • The Optimization Process minimizes the loss function by iteratively adding weak learners. 

    The gbm (Generalized Boosted Regression Models) library in R is used for fitting gradient boosting models, which are powerful machine learning techniques for regression and classification tasks. The gbm package in R provides an extended implementation of Adaboost (Adaptive Boosting) and Friedman's gradient boosting machines algorithms. 

 

Package installation and preparing data

    First, we need to install and load the required packages for this tutorial.

 
install.packages("gbm", "caret")
library(gbm)
library(caret)

    For this tutorial, we use built-in Boston dataset. We split data into training and testing sets using createDataPartition() function from caret package. The last column ('medv') is an outcome (label) data  and other columns become feature data in a 'boston' data frame.

 
# Load the Boston dataset
boston <- MASS::Boston

# checking structure of boston dataset
print(str(boston))
 
# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.8, list = FALSE)
train <- boston[indexes, ]
test <- boston[-indexes, ]
test_x <- test[,1:13]
test_y <- test[,14]
 

   

Fitting the model and prediction

    Now, let's fit the gradient boosting regression model using the gbm() function. Here, we provide the following hyperparameters:

  • medv ~ .: This specifies the formula for the regression model. medv is the outcome variable, and . indicates that all other variables in the dataset should be used as predictors.

  • data = train: The training dataset to be used for model fitting.

  • distribution = "gaussian": Defines the distribution family for the response variable. In this case, it is set to "gaussian", indicating that the response variable follows a Gaussian (normal) distribution.

  • cv.folds = 10: Specifies the number of cross-validation folds to be used for model evaluation.

  • shrinkage = 0.01: Controls the contribution of each tree to the final prediction.

  • n.trees = 100: The number of trees to be used in the boosting process.

 
# Train the gradient boosting regression model
model <- gbm(medv ~ .,
data = train,
distribution = "gaussian",
cv.folds = 10,
shrinkage = .1,
n.trees = 200)

After the training we can predict test data using predict() method.

 
# Make predictions on the test set
pred_y <- predict(model, newdata = test_x)


Accuracy checking

    Finally, we evaluate the performance of our model using metrics such as MSE (Mean Squared Error), MAE (Mean Absolute Error), and RMSE (Root Mean Squared Error).

 
# Calculate evaluation metrics
mse <- mean((test_y - pred_y)^2)
mae <- mean(abs(test_y - pred_y))
rmse <- sqrt(mse)

# Print evaluation metrics
cat("MAE:", mae, "\n", "MSE:", mse, "\n", "RMSE:", rmse, "\n", "\n")

The result looks as follows:

 
MAE: 2.458947
MSE: 11.46156
RMSE: 3.385492  
 

 We can also visualize the original and predicted data in a plot to check the difference visually.

 
# Plot actual and predicted values
plot(test_y, pch = 18, col = "red", xlab = "Observation",
ylab = "medv", main = "Actual vs. Predicted")
lines(pred_y, lwd = 1, col = "blue")
legend("topleft", legend = c("Actual", "Predicted"),
col = c("red", "blue"), pch = c(18, NA), lty = c(NA, 1), lwd = 1, cex = 0.8)


Conclusion

    In this tutorial, we learned how to perform gradient boosting regression in R using the gbm package.  The gbm library helps us to build predictive models that can handle complex relationships in the data, making it widely used in various machine learning applications in R.


Source code listing

 
install.packages("gbm", "caret")
library(gbm)
library(caret)

# Load the Boston dataset
boston <- MASS::Boston

# checking structure of boston dataset
print(str(boston))

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.8, list = FALSE)
train <- boston[indexes, ]
test <- boston[-indexes, ]
test_x <- test[,1:13]
test_y <- test[,14]

# Train the gradient boosting regression model
model <- gbm(medv ~ .,
data = train,
distribution = "gaussian",
cv.folds = 10,
shrinkage = .1,
n.trees = 200)

# Print model summary
summary <- summary(model)
print(summary)

# Make predictions on the test set
pred_y <- predict(model, newdata = test_x)

# Calculate evaluation metrics
mse <- mean((test_y - pred_y)^2)
mae <- mean(abs(test_y - pred_y))
rmse <- sqrt(mse)

# Print evaluation metrics
cat("MAE:", mae, "\n", "MSE:", mse, "\n", "RMSE:", rmse, "\n", "\n")

# Plot actual and predicted values
plot(test_y, pch = 18, col = "red", xlab = "Observation",
ylab = "medv", main = "Actual vs. Predicted")
lines(pred_y, lwd = 1, col = "blue")
legend("topleft", legend = c("Actual", "Predicted"),
col = c("red", "blue"), pch = c(18, NA), lty = c(NA, 1), lwd = 1, cex = 0.8)


2 comments: