DataTechNotes: Ridge Regression Example in Python

Ridge method applies L2 regularization to reduce overfitting in the regression model. In this post, we'll learn how to use sklearn's Ridge and RidgCV classes for regression analysis in Python. The tutorial covers:

Preparing data
Best alpha
Fitting the model and checking the results
Cross-validation with RidgeCV
Source code listing

We'll start by loading the required libraries.

from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

Preparing data

We use Boston house-price dataset as regression dataset in this tutorial. After loading the dataset, first, we'll separate data into x - feature and y - label. Then we'll split them into the train and test parts. Here, I'll extract 15 percent of the dataset as test data.

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

Best alpha

Alpha is an important factor in regularization. It defines Ridge shrinkage or regularization strength. The higher value means the stronger regularization. We don't know which value works efficiently for our regularization method. Thus we'll figure out the best alpha value by checking the model accuracy with setting multiple alpha values.

alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]

We can define Ridge model by setting alfa and fit it with x, y data. Then we check the R-squared, MSE, RMSE values for each alpha.

for a in alphas:
 model = Ridge(alpha=a, normalize=True).fit(x,y) 
 score = model.score(x, y)
 pred_y = model.predict(x)
 mse = mean_squared_error(y, pred_y) 
 print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
    .format(a, score, mse, np.sqrt(mse)))

Alpha:0.000001, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.000010, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.000100, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.001000, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.010000, R2:0.740, MSE:21.92, RMSE:4.68
Alpha:0.100000, R2:0.732, MSE:22.66, RMSE:4.76
Alpha:0.500000, R2:0.686, MSE:26.49, RMSE:5.15
Alpha:1.000000, R2:0.635, MSE:30.81, RMSE:5.55

The result shows that alpha with a 0.01 is the best value we can use.

Fitting the model and checking the results

Next, we'll define the Ridge model again with alpha 0.01 values and fit it with xtrain and ytrain data, then we'll predict the xtest data and check the prediction accuracy.

ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain)
ypred = ridge_mod.predict(xtest)
score = model.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
   .format(score, mse,np.sqrt(mse)))

R2:0.691, MSE:15.56, RMSE:3.95

Finally, we'll visualize the result.

x_ax = range(len(xtest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

Cross-validation with RidgeCV

RidgeCV is built-in cross-validation class. In this model, we can set all alpha values and get the efficient alpha value in a set.

ridge_cv=RidgeCV(alphas=alphas, store_cv_values=True)
ridge_mod = ridge_cv.fit(xtrain,ytrain)
print(ridge_mod.alpha_)

0.01

print(np.mean(ridge_mod.cv_values_, axis=0))

[25.38818446 25.388184   25.38817941 25.388134   25.387734   25.38842764
 25.44565372 25.54571739]

Now, we can predict test data and check the accuracy.

ypred = ridge_mod.predict(xtest)
score = ridge_mod.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
   .format(score, mse, np.sqrt(mse)))

R2:0.814, MSE:15.49, RMSE:3.94

We can also plot the result

x_ax = range(len(xtest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In this post, we've briefly learned how to use Ridge and RidgeCV classes for regression data analysis in Python. The full source code is listed below. Thank you for reading!

Source code listing

from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]
for a in alphas:
 model = Ridge(alpha=a, normalize=True).fit(x,y) 
 score = model.score(x, y)
 pred_y = model.predict(x)
 mse = mean_squared_error(y, pred_y) 
 print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
    .format(a, score, mse, np.sqrt(mse)))

ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain)
ypred = ridge_mod.predict(xtest)
score = model.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
   .format(score, mse,np.sqrt(mse)))

x_ax = range(len(xtest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

# RidgeCV method
ridge_cv=RidgeCV(alphas=alphas, store_cv_values=True)
ridge_mod = ridge_cv.fit(xtrain,ytrain)
print(ridge_mod.alpha_)
print(np.mean(ridge_mod.cv_values_, axis=0))

ypred = ridge_mod.predict(xtest)
score = ridge_mod.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
   .format(score, mse, np.sqrt(mse)))

x_ax = range(len(xtest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

DataTechNotes

Pages

Ridge Regression Example in Python

4 comments: