Gradient boosting regression is a powerful machine learning technique used for predicting continuous outcomes. In this tutorial, we will learn how to use the gbm package in R to perform gradient boosting regression. The tutorial covers:
- Introduction to Gradient Boosting
- Package installation and preparing data
- Fitting the model and prediction
- Accuracy checking
- Conclusion
- Source code listing
Introduction to Gradient Boosting
Gradient boosting combines the strengths of multiple weak learners to improve predictive models. It iteratively refines the model by adding new weak learners and optimizing the loss function. This approach distinguishes gradient boosting from other boosting methods, such as AdaBoost, which sequentially adds new weak learners without directly optimizing the loss function.
The model training process includes components such as base learner, loss function, and optimization.
- Base Learners are individual models (e.g., decision trees) within the ensemble, each specializing in specific data aspects and contributing to the final prediction.
- Loss Functions calculate the difference between predicted and actual values.
- The Optimization Process minimizes the loss function by iteratively adding weak learners.
The gbm (Generalized Boosted Regression Models) library in R is used for fitting gradient boosting models, which are powerful machine learning techniques for regression and classification tasks. The gbm package in R provides an extended implementation of Adaboost (Adaptive Boosting) and Friedman's gradient boosting machines algorithms.
Package installation and preparing data
First, we need to install and load the required packages for this tutorial.
For this tutorial, we use built-in Boston dataset. We split data into training and testing sets using createDataPartition() function from caret package. The last column ('medv') is an outcome (label) data and other columns become feature data in a 'boston' data frame.
Fitting the model and prediction
Now, let's fit the gradient boosting regression model using the gbm()
function. Here, we provide the following hyperparameters:
medv ~ .
: This specifies the formula for the regression model.medv
is the outcome variable, and.
indicates that all other variables in the dataset should be used as predictors.data = train
: The training dataset to be used for model fitting.distribution = "gaussian"
: Defines the distribution family for the response variable. In this case, it is set to "gaussian", indicating that the response variable follows a Gaussian (normal) distribution.cv.folds = 10
: Specifies the number of cross-validation folds to be used for model evaluation.shrinkage = 0.01
: Controls the contribution of each tree to the final prediction.n.trees = 100
: The number of trees to be used in the boosting process.
After the training we can predict test data using predict() method.
Accuracy checking
Finally, we evaluate the performance of our model using metrics such as MSE (Mean Squared Error), MAE (Mean Absolute Error), and RMSE (Root Mean Squared Error).
The result looks as follows:
MSE: 11.46156
RMSE: 3.385492
We can also visualize the original and predicted data in a plot to check the difference visually.
Conclusion
In this tutorial, we learned how to perform gradient boosting regression in R using the gbm
package. The gbm library helps us to build predictive models that can handle complex relationships in the data, making it widely used in various machine learning applications in R.
Source code listing
How can we calculate MAE in gbm?
ReplyDeleteRefer my post on Regression Accuracy check MAE, MSE, RMSE, R-squared in R.
Delete