DataTechNotes: Regression Example with RandomForestRegressor in Python

Random Forest Regression is a powerful machine learning algorithm widely used for predicting continuous values. It belongs to the family of ensemble learning techniques, where multiple decision trees are combined to make more accurate predictions. In this tutorial, we will explore the concept of Random Forest Regression and its implementation with scikit-learn in Python.

Table of contents:

Introduction to random forest regression
Preparing the data
Building the random forest regressor
Making predictions and evaluating the model
Conclusion
Source code listing

Let's get started.

Introduction to random forest regression

Random Forest Regression is a machine learning algorithm used for predicting continuous values. It combines multiple decision trees to make more accurate predictions than any individual tree. Random Forest Regression is robust to overfitting and can handle large datasets with high dimensionality.

Random Forest Regression belongs to the family of ensemble learning and below is an explanation of how it works.

Ensemble Learning: Random Forest Regression is based on the concept of ensemble learning, which combines multiple individual models to make more accurate predictions than any single model alone. In the case of Random Forest Regression, the individual models are decision trees.
Decision Trees: Decision trees are simple, tree-like structures that recursively divide the feature space into regions, with each region corresponding to a prediction.
Random Forest: A Random Forest Regression model consists of a collection of decision trees, where each tree is trained on a random subset of the training data. Additionally, at each node of the decision tree, a random subset of features is considered for splitting.
Prediction: To make a prediction using a Random Forest Regression model, the predictions of all individual trees are averaged for regression tasks. This ensemble approach typically results in more accurate predictions compared to a single decision tree.

Preparing the data

We'll begin by loading the necessary libraries for this tutorial.

 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

We'll use a synthetic regression dataset generated using the make_regression function from scikit-learn. The dataset contains 1000 samples with 3 input features.

 
# Generating synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=3)

Then we split data into train and test parts. Here, we use 20 percent of data as test data.

 
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

You can also apply preprocessing step such as feature scaling.

Building the random forest regressor

We initialize the random forest regressor using the RandomForestRegressor class from scikit-learn, where we specify hyperparameters such as the number of trees (n_estimators) and any other optional parameters.

We proceed to train the Random Forest regressor on the training data by invoking the fit() method.

 
# Initializing the Random Forest Regression model
rf_regressor = RandomForestRegressor(n_estimators=120, random_state=42)

# Training the model
rf_regressor.fit(X_train, y_train)

Making predictions and evaluating the model

Using the trained regressor model, we proceed to make predictions on the testing data by using the predict method to obtain the predicted values for the testing set.

Then, we assess the performance of the model by comparing the predicted values with the true values from the testing set. To accomplish this, we utilize the mean_squared_error and r2_score functions from scikit-learn's metrics module. These functions compute the mean squared error (MSE) and the R-squared (R2) score, respectively, based on the actual and predicted values, providing insights into the regression model's performance.

R-squared, or coefficient of determination, measures the proportion of variance in the dependent variable explained by the independent variables in a regression model. It ranges from 0 to 1, where higher values indicate better model fit.

 
# Making predictions on the testing data
y_pred = rf_regressor.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Printing evaluation metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")
 

We also visualize the original and predicted data in a plot to assess them visually.

 
# Plotting the original vs predicted values
x_ax = range(len(y_test))
plt.figure(figsize=(10, 6))
plt.scatter(x_ax, y_test,  label="original")
plt.plot(x_ax, y_pred, 'r',  label="predicted")
plt.title("Original vs Predicted Values")
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend(loc='best', fancybox=True, shadow=True)
plt.grid(True)
plt.show()
 

Results are displayed as follows:

 
 Mean Squared Error (MSE): 175.4521517089352
 R-squared (R2) Score: 0.9728976287872141

Conclusion

In this tutorial, we've briefly learned how to train and make predictions on regression data using the RandomForestRegressor class from the Scikit-learn API in Python.

Random Forest Regression is a versatile algorithm suitable for regression tasks. Leveraging the power of ensemble learning, it enhances prediction accuracy and generalization, making it widely utilized in various machine learning applications. The complete source code is provided below for reference.

Source code listing

 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# Generating synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=3)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing the Random Forest Regression model
rf_regressor = RandomForestRegressor(n_estimators=120, random_state=42)

# Training the model
rf_regressor.fit(X_train, y_train)

# Making predictions on the testing data
y_pred = rf_regressor.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Printing evaluation metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

# Plotting the original vs predicted values
x_ax = range(len(y_test))
plt.figure(figsize=(10, 6))
plt.scatter(x_ax, y_test,  label="original")
plt.plot(x_ax, y_pred, 'r',  label="predicted")
plt.title("Original vs Predicted Values")
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend(loc='best', fancybox=True, shadow=True)
plt.grid(True)
plt.show()
 

References:

Scikit learn API

DataTechNotes

Pages

Regression Example with RandomForestRegressor in Python

No comments:

Post a Comment