Lasso Regression Example with R

   LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method to minimize overfitting in a model. It reduces large coefficients with L1-norm regularization which is the sum of their absolute values. The penalty pushes the coefficients with lower value to be zero, to reduce the model complexity.
   In this post, we'll briefly learn how to use Lasso regularization in R. A 'glmnet' package provides regularization functions for Lasso. The tutorial covers:
  1. Preparing the data
  2. Defining the model
  3. Predicting test data
  4. Source code listing
We'll start by loading the required libraries in R.
 
library(glmnet)
library(caret)

Preparing the data

We'll use Boston housing dataset in this tutorial, we'll load it, split into the train test parts.

set.seed(123)
boston = MASS::Boston
indexes = createDataPartition(boston$medv, p=.85, list=F)
train = boston[indexes, ]
test =  boston[-indexes, ]

You can check the dataset with 'str(boston)' command. Here, the 'medv' variable is y, label column, and remainings are the x, feature data.
Now, we'll separate x and y parts of the train and test data. An x input data should be in matrix type and we'll convert it.

xtrain = as.matrix(train)[,-14] 
ytrain = train[,14]
 
xtest = as.matrix(test)[,-14] 
ytest = test[,14]


Defining the model

Next, we'll find out the lambda factor which defines the amount of shrinkage, with the 'glmnet' cross-validation function. We'll run cv.glmnet function with the alpha=1 parameter that defines the Lasso method.

lasso_cv = cv.glmnet(xtrain, ytrain, family="gaussian", alpha=1)

We can check the coefficients.

coef(lasso_cv)
14 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) 21.7893929324
crim        -0.0122937347
zn           0.0098930474
indus        .           
chas         2.5311725502
nox         -5.7841042218
rm           3.9974018400
age          .           
dis         -0.4873179429
rad          .           
tax         -0.0003229633
ptratio     -0.8262576443
black        0.0068882072
lstat       -0.5161864470

The model can also be plotted.

plot(lasso_cv)


From the above cross-validation result, we'll get the best lambda value.

best_lambda = lasso_cv$lambda.min
cat(best_lambda)
0.01014079 

Next, we'll fit the model again by using the best-lambda value.

lasso_mod = glmnet(xtrain, ytrain, family = "gaussian", 
                   alpha = 1, lambda = best_lambda)

We'll check the coefficients again.

coef(lasso_mod)
14 x 1 sparse Matrix of class "dgCMatrix"
                       s0
(Intercept)  37.021345010
crim         -0.086726277
zn            0.050935275
indus         0.011735446
chas          3.073769966
nox         -18.075419693
rm            3.676920080
age           0.004979978
dis          -1.446401516
rad           0.270290914
tax          -0.011341606
ptratio      -0.940283913
black         0.008557049
lstat        -0.521339466


Predicting test data

Finally, we'll predict the xtest data with the trained model and check the accuracy with MSE, MAE, RMSE, and R-squared metrics.

yhat = predict(lasso_mod, xtest)
 
mse = mean((ytest - yhat) ^ 2)
mae = MAE(ytest, yhat)
rmse = RMSE(ytest, yhat)
r2 = R2(ytest, yhat, form = "traditional")
 
cat(" MAE:", mae, "\n", "MSE:", mse, "\n", 
    "RMSE:", rmse, "\n", "R-squared:", r2)
 MAE: 3.889155 
 MSE: 24.04217 
 RMSE: 4.903281 
 R-squared: 0.5536334

We can visualize the results in a plot.

x = 1:length(ytest)
plot(x, ytest, ylim=c(min(yhat), max(ytest)), pch=20, col="red")
lines(x, yhat, lwd="1", col="blue")
legend("topleft", legend=c("medv", "pred-medv"),
       col=c("red", "blue"), lty=1,cex = 0.8, lwd=1, bty='n')

   In this tutorial, we've briefly learned how to use the glmnet lasso method to fit and predict the regression data. The full source code is listed below.


Source code listing

library(glmnet)
library(caret)
 
set.seed(123)
boston = MASS::Boston
indexes = createDataPartition(boston$medv, p=.85, list=F)
train = boston[indexes, ]
test =  boston[-indexes, ]
 
xtrain = as.matrix(train)[,-14] 
ytrain = train[,14]
 
xtest = as.matrix(test)[,-14] 
ytest = test[,14]
 
lasso_cv = cv.glmnet(xtrain, ytrain, family="gaussian", alpha=1)
 
coef(lasso_cv)
plot(lasso_cv) 
 
best_lambda = lasso_cv$lambda.min
cat(best_lambda)
 
lasso_mod = glmnet(xtrain, ytrain, family = "gaussian", 
                   alpha = 1, lambda = best_lambda)
 
coef(lasso_mod) 
 
yhat = predict(lasso_mod, xtest)
 
mse = mean((ytest - yhat) ^ 2)
mae = MAE(ytest, yhat)
rmse = RMSE(ytest, yhat)
r2 = R2(ytest, yhat, form = "traditional")
 
cat(" MAE:", mae, "\n", "MSE:", mse, "\n", 
    "RMSE:", rmse, "\n", "R-squared:", r2)
 
x = 1:length(ytest)
plot(x, ytest, ylim=c(min(yhat), max(ytest)), pch=20, col="red")
lines(x, yhat, lwd="1", col="blue")
legend("topleft", legend=c("medv", "pred-medv"),
       col=c("red", "blue"), lty=1,cex = 0.8, lwd=1, bty='n') 


No comments:

Post a Comment