DataTechNotes: Gradient Boosting Regression Example with Scikit-learn

Gradient Boosting Regression is a supervised learning algorithm used for regression tasks. The idea behind gradient boosting is to enhance weak learners and construct a final combined prediction model. Decision trees are primarily employed as base learners in this algorithm.

In this tutorial, we'll explore the fundamentals of gradient boosting regression and how to implement it using Sciki-learn GradientBoostingRegressor. The tutorial covers the following topics:

Introduction to Gradient Boosting
Preparing data
Defining the model
Prediction and visualizing the result
Conclusion

Introduction to Gradient Boosting regression

Gradient boosting combines the strengths of multiple weak learners to improve predictive models. It iteratively refines the model by adding new weak learners and optimizing the loss function. This approach addresses errors in the ensemble and employs gradient descent for enhanced robustness and efficiency, distinguishing it from other boosting methods. The model training process includes the following components:

Base Learners are individual models (e.g., decision trees) within the ensemble, each specializing in specific data aspects and contributing to the final prediction.
Loss Functions calculate the difference between predicted and actual values. Common ones include mean squared error (MSE) for regression tasks.
The Optimization Process minimizes the loss function by iteratively adding weak learners. Each new learner predicts the residuals of the current ensemble, refining predictions, and improving overall performance.

Preparing data

We start by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import matplotlib.pyplot as plt

Next, we generate simple regression data using the make_regression() function. This creates a dataset with 400 samples and 3 features. The generated data is then split into training and testing sets using the train_test_split() function. 80% of the data is used for training, and 20% is used for testing.

 
# Generating synthetic regression dataset
X, y = make_regression(n_samples=400, n_features=3)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the model

We create an instance of the Gradient Boosting regressor model using GradientBoostingRegressor class from the sklearn.ensemble module. Here, we provide hyperparameters such as n_estimator, learning_rate, and max_depth.

n_estimators specifie the number of weak learners (decision trees) to be sequentially added to the ensemble during the training process.
learning_rate controls the contribution of each weak learner to the final prediction. A lower learning rate makes the model more robust by slowing down the learning process and potentially reducing overfitting.
max_depth sets the maximum depth of each decision tree in the ensemble.

The model is then trained on the training data using the fit() method. After the training we can make predictions on the test data using the predict() method.

 
# Create instance of the Gradient Boosting Regressor model
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
 
# Train the Gradient Boosting Regression model
gbr.fit(X_train, y_train)

# Predict the test data
y_pred = gbr.predict(X_test)

Evaluation and visualizing the result

We define a function to evaluate the prediction accuracy. The function mse_rmse() calculates the Mean Squared Error (MSE) and the square root of the MSE between the actual and predicted values.

 
# Define a function to calculate the MSE and RMSE
def mse_rmse(y, y_pred):
    # Calculate the mean squared error (MSE)
    mse = np.mean((y - y_pred) ** 2)
    
    # Calculate the root mean squared error (RMSE) by taking the square root of MSE
    rmse = np.sqrt(mse)
    
    # Return both MSE and RMSE
    return mse, rmse 
   

Finally, we print the calculated MSE and RMSE to evaluate the performance of the model and visualize the result on a graph.

 
# Calculate and print the MSE and RMSE between the actual and predicted values
mse, rmse = mse_rmse(y_pred, y_test)
print(f"MSE: {mse}, RMSE: {rmse}")

# Plot the actual test data points and the predicted values
x = range(len(y_test))
plt.figure(figsize=(10, 6))
plt.scatter(x, y_test, color="blue", s=10, label='Ground truth')
plt.plot(x, y_pred, color='red', linewidth=1, label="Predicted")
plt.legend()
plt.show()
 

The result looks as follows:

 MSE: 67.58306936784967, RMSE: 8.220892248889392
 

Conclusion

In this tutorial, we learned about Gradient Boosting regression and how to implement it using scikit-learn GradientBoostingRegressor. Gradient boosting combines the strengths of multiple weak learners to improve predictive models. It iteratively refines the model by adding new weak learners and optimizing the loss function.

The GradientBoostingRegressor class helps us to build a gradient boosting regression model suitable for a wide range of regression tasks.

Source code listing

 
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import matplotlib.pyplot as plt

# Generating synthetic regression dataset
X, y = make_regression(n_samples=400, n_features=3)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create instance of the Gradient Boosting Regressor model
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the Gradient Boosting Regression model
gbr.fit(X_train, y_train)

# Evaluate the model performance
y_pred = gbr.predict(X_test)

# Define a function to calculate the MSE and RMSE
def mse_rmse(y, y_pred):
    # Calculate the mean squared error (MSE)
    mse = np.mean((y - y_pred) ** 2)
    
    # Calculate the root mean squared error (RMSE) by taking the square root of MSE
    rmse = np.sqrt(mse)
    
    # Return both MSE and RMSE
    return mse, rmse  

# Calculate and print the MSE and RMSE between the actual and predicted values
mse, rmse = mse_rmse(y_pred, y_test)
print(f"MSE: {mse}, RMSE: {rmse}")

# Plot the actual test data points and the predicted values
x = range(len(y_test))
plt.figure(figsize=(10, 6))
plt.scatter(x, y_test, color="blue", s=10, label='Ground truth')
plt.plot(x, y_pred, color='red', linewidth=1, label="Predicted")
plt.legend()
plt.show()

DataTechNotes

Pages

Gradient Boosting Regression Example with Scikit-learn

No comments:

Post a Comment