How to Use GridSearchCV in Python


   GridSearchCV is a method to search the candidate best parameters exhaustively from the grid of given parameters. Target estimator (model) and parameters for search need to be provided for this cross-validation search method. GridSearchCV is useful when we are looking for the best parameter for the target model and dataset. In this method, multiple parameters are tested by cross-validation and the best parameters can be extracted to apply for a predictive model.
   In this article, we'll learn how to use the sklearn's GridSearchCV class to find out the best parameters of AdaBoostRegressor model for Boston housing-price dataset in Python. The tutorial covers:
  1. Preparing data, base estimator, and parameters
  2. Fitting the model and getting the best estimator
  3. Prediction and accuracy check
  4. Source code listing
We'll start by loading the required modules.

from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
import matplotlib.pyplot as plt




Preparing data, base estimator, and parameters

   We use Boston house-price dataset as regression data in this tutorial. After loading the dataset, first, we'll separate it into the x - feature and y - label, then split into the train and test parts. Here, we'll extract 15 percent of the dataset as test data.

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

As a base estimator, we'll use AdaBoostRegressor.

abreg = AdaBoostRegressor()

The parameters for this estimator need to be provided. We can find out AdaBoostRegressor class's parameter list on this page. We create params object to include target parameters.

params = {
 'n_estimators': [50, 100],
 'learning_rate' : [0.01, 0.05, 0.1, 0.5],
 'loss' : ['linear', 'square', 'exponential']
 }


We can also set the scoring parameter into the GridSearchCV model as a following. By default, it checks the R-squared metrics score.

score = make_scorer(mean_squared_error)


Fitting the model and getting the best estimator

Next, we'll define the GridSearchCV model with the above estimator and parameters. For cross-validation fold parameter, we'll set 10 and fit it with all dataset data.

gridsearch=GridSearchCV(abreg, params, cv=5, return_train_score=True)
gridsearch.fit(x, y)
GridSearchCV(cv=5, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=None, learning_rate=1.0, 
       loss='linear', n_estimators=50, random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [50, 100], 
                   'learning_rate': [0.01, 0.05, 0.1, 0.5], 
                   'loss': ['linear', 'square', 'exponential']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0) 


If you want to change the scoring method, you can also set the scoring parameter.

gridsearch=GridSearchCV(abreg,params,scoring=score,cv=5,return_train_score=True)

After fitting the model we can get best parameters.

print(gridsearch.best_params_)
{'learning_rate': 0.5, 'loss': 'exponential', 'n_estimators': 50}

print(gridsearch.best_score_)
0.5913769411856192 


Now, we can get the best estimator from the gird search result and call it best_estim model for further use.

best_estim=gridsearch.best_estimator_
print(best_estim)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='exponential',
         n_estimators=50, random_state=None)



Prediction and accuracy check

   We've extracted the best estimator model and now we can use as a predictive model. We'll fit again with train data and check the accuracy metrics.

best_estim.fit(xtrain,ytrain)

ytr_pred=best_estim.predict(xtrain)
mse = mean_squared_error(ytr_pred,ytrain)
r2 = r2_score(ytr_pred,ytrain)
print("MSE: %.2f" % mse)
MSE: 7.54
print("R2: %.2f" % r2)
R2: 0.89

Next, we'll predict test data and check the accuracy metrics.

ypred=best_estim.predict(xtest)
mse = mean_squared_error(ytest, ypred)
r2 = r2_score(ytest, ypred)
print("MSE: %.2f" % mse)
MSE: 11.51
print("R2: %.2f" % r2)
R2: 0.85 



Finally, we'll visualize the results in a plot.

x_ax = range(len(ytest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()



   In this article, we've briefly learned gird search method with GridSearchCV class and applied it into the regression data in Python. The full source code is listed below. Thank you for reading!


Source code listing


 
from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
import matplotlib.pyplot as plt

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

abreg = AdaBoostRegressor()
params = {
 'n_estimators': [50, 100],
 'learning_rate' : [0.01, 0.05, 0.1, 0.5],
 'loss' : ['linear', 'square', 'exponential']
 }

score = make_scorer(mean_squared_error)

gridsearch = GridSearchCV(abreg, params, cv=5, return_train_score=True)
gridsearch.fit(xtrain, ytrain)
print(gridsearch.best_params_)

best_estim=gridsearch.best_estimator_
print(best_estim)

best_estim.fit(xtrain,ytrain)

ytr_pred=best_estim.predict(xtrain)
mse = mean_squared_error(ytr_pred,ytrain)
r2 = r2_score(ytr_pred,ytrain)
print("MSE: %.2f" % mse)
print("R2: %.2f" % r2)

ypred=best_estim.predict(xtest)
mse = mean_squared_error(ytest, ypred)
r2 = r2_score(ytest, ypred)
print("MSE: %.2f" % mse)
print("R2: %.2f" % r2)

x_ax = range(len(ytest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()
 


No comments:
Post a Comment