DataTechNotes: How to Implement Grid Search Using GridSearchCV in Python

In machine learning, model performance depends on the choice of hyperparameters which are set before training and guide the learning process. Grid Search is a popular method for finding the best hyperparameter combination.

In this tutorial, we'll learn how to use GridSearchCV to determine the optimal parameters for the AdaBoostRegressor model using the California housing dataset in Python. This tutorial will cover the following steps:

Introduction to Grid Search
Preparing data, base estimator, and parameters
Exracting the best hyperparameters
Source code listing

Let's get started.

Introduction to Grid Search

Grid Search is a method used to exhaustively search for the best combination of hyperparameter values in a provided grid for a given estimator (model). It is particularly useful when we want to identify the optimal hyperparameters for a model based on a specific dataset.

The GridSearchCV class in scikit-learn is used for hyperparameter tuning in machine learning models. It exhaustively searches over a specified grid of hyperparameter values, evaluating model performance for each combination using cross-validation. By dividing the data into training and validation sets multiple times, it identifies the best hyperparameters that optimize model performance on unseen data. GridSearchCV automates and simplifies the process of finding the optimal settings, enhancing model accuracy and robustness.

Preparing data, base estimator, and parameters

We'll start by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

In this tutorial we use California Housing dataset as target regression data. After loading the dataset, we extract the features (X) and the labels (y), then split the data into training and testing sets. Here, we’ll reserve 15 percent of the dataset as the test data.

 
# Load the California housing dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

We apply standard scaling to the feature data using StandardScaler().

 
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We define a base estimator. In this case, we use the AdaBoostRegressor class from scikit-learn. Candidate hyperparameter (number of estimators, learning rate, and loss function) values for this regressor model are defined below.

 
# Define the AdaBoost Regressor
abreg = AdaBoostRegressor(random_state=42)

# Define the parameter grid for hyperparameter tuning
params = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'loss': ['linear', 'square']
}
 

Extracting the best hyperparameters

Next, we initialize the GridSearchCV with the estimator and parameter grid defined earlier. We set the cross-validation parameter to 5-fold. To fit the model on the training data, we use the fit() method.

 
# Initialize GridSearchCV with the model, grid, and cross-validation strategy
gridsearch = GridSearchCV(estimator=abreg, param_grid=h_params, cv=5, n_jobs=-1, verbose=2)
gridsearch.fit(X_train, y_train)

After fitting, we can extract best hyperparameters and retrieve best estimator.

# Print the best parameters
print("Best parameters found: ", gridsearch.best_params_)

# Retrieve the best estimator
best_estim = gridsearch.best_estimator_

Using the estimator we can predict test data and check the accuracy.

 
# Evaluate the model on the test set
y_pred = best_estim.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R2: {r2:.2f}")

We run the code and result looks as below.

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ...learning_rate=0.01, loss=linear, n_estimators=50; total time=   3.2s
[CV] END ...learning_rate=0.01, loss=linear, n_estimators=50; total time=   3.2s
[CV] END ...learning_rate=0.01, loss=linear, n_estimators=50; total time=   3.2s
[CV] END ...learning_rate=0.01, loss=linear, n_estimators=50; total time=   3.2s
[CV] END ...learning_rate=0.01, loss=square, n_estimators=50; total time=   3.2s
[CV] END ...learning_rate=0.01, loss=linear, n_estimators=50; total time=   3.3s
[CV] END ...learning_rate=0.01, loss=square, n_estimators=50; total time=   3.3s
[CV] END ..learning_rate=0.01, loss=linear, n_estimators=100; total time=   6.5s
[CV] END ..learning_rate=0.01, loss=linear, n_estimators=100; total time=   6.5s
[CV] END ..learning_rate=0.01, loss=linear, n_estimators=100; total time=   6.5s
[CV] END ...learning_rate=0.01, loss=square, n_estimators=50; total time=   3.2s
....
 
[CV] END ...learning_rate=0.1, loss=linear, n_estimators=100; total time=   5.8s
[CV] END ...learning_rate=0.1, loss=linear, n_estimators=100; total time=   5.8s
[CV] END ...learning_rate=0.1, loss=linear, n_estimators=100; total time=   5.8s
[CV] END ...learning_rate=0.1, loss=square, n_estimators=100; total time=   4.2s
[CV] END ...learning_rate=0.1, loss=square, n_estimators=100; total time=   4.2s
[CV] END ...learning_rate=0.1, loss=square, n_estimators=100; total time=   4.3s
[CV] END ...learning_rate=0.1, loss=square, n_estimators=100; total time=   4.3s
Best parameters found:  {'learning_rate': 0.1, 'loss': 'linear', 'n_estimators': 50}
MSE: 0.57
R2: 0.56
 

In this tutorial, we've learned the concept of grid search and how to implement grid search using the GridSearchCV class for regression data.

Source code listing

 
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


# Load the California housing dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the AdaBoost Regressor
abreg = AdaBoostRegressor(random_state=42)

# Define the parameter grid for hyperparameter tuning
h_params = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'loss': ['linear', 'square']
}

# Initialize GridSearchCV with the model, grid, and cross-validation strategy
gridsearch = GridSearchCV(estimator=abreg, param_grid=h_params, cv=5, n_jobs=-1, verbose=2)

gridsearch.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", gridsearch.best_params_)

# Retrieve the best estimator
best_estim = gridsearch.best_estimator_

# Evaluate the model on the test set
y_pred = best_estim.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R2: {r2:.2f}")
 

DataTechNotes

Pages

How to Implement Grid Search Using GridSearchCV in Python

No comments:

Post a Comment