Regression with Generalized Additive Model (GAM) in R


   Generalized Additive Model (GAM) is a type of linear model with smooth functions of some variables. In this tutorial, we'll briefly learn how to fit regression data with gam function in R. An 'mgcv' package provides 'gam' fitting function to use. The post covers
  1. Preparing data
  2. GAM fitting and predicting
  3. Source code listing
We'll start by loading the required library.

library(mgcv)
library(corrplot) 



Preparing data

   In this tutorial, we'll use Boston housing dataset as a regression dataset. A gam function requires smoothing parameters to fit the model. Thus, first, we'll identify the highly correlated features to the target variable 'medv' of the dataset. We can check the correlation of the features to each other as a following.

boston = MASS::Boston

cors = cor(boston)
corrplot(cors, method="number")


The correlation matrix shows that rm, lstat, ptratio, and indus features are highly correlated to the medv variable. We can use them as smoothing variable in gam.



GAM fitting and predicting

Now, we can define the gam model and fit it with Boston dataset. Here,  we set 'rm' and 'lstat' features as a smoothing factor. We can also add the remaining variables of the dataset.

bgam=gam(medv~s(rm)+s(lstat)+ptratio+indus+crim+zn+age, data=boston)
summary(bgam) 
 
Family: gaussian 
Link function: identity 

Formula:
medv ~ s(rm) + s(lstat) + ptratio + indus + crim + zn + age

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.54656    2.08187  15.153  < 2e-16 ***
ptratio     -0.52355    0.10112  -5.178 3.29e-07 ***
indus        0.00383    0.04052   0.095   0.9247    
crim        -0.13005    0.02498  -5.207 2.84e-07 ***
zn          -0.01682    0.01065  -1.579   0.1150    
age          0.01848    0.01055   1.751   0.0806 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value    
s(rm)    6.514  7.680 24.30  2e-16 ***
s(lstat) 6.272  7.451 34.29  2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.807   Deviance explained = 81.4%
GCV = 16.948  Scale est. = 16.319    n = 506


Next, we'll predict Boston data with the fitted model.
 
pred = predict(bgam, newdata = boston[,-14])

We can compare both predict result with the original one by visualizing them in a plot.
 
x = 1:nrow(boston)
plot(x, boston$medv, col="blue", type = "l")
lines(x, pred, col="red", type = "l" )
legend("bottomleft", legend=c("y-fitted", "y-origianl"),
        col=c("red", "blue"), lty=1, cex=0.7)
 


   In this tutorial, we've briefly learned how to use gam model for the regression problem in R. Source code is listed below.


Source code listing 
 
library(mgcv)
library(corrplot) 
boston = MASS::Boston

cors = cor(boston)
corrplot(cors, method="number")
 
bgam=gam(medv~s(rm)+s(lstat)+ptratio+indus+crim+zn+age, data=boston)
summary(bgam)
 
pred = predict(bgam, newdata = boston[,-14])
 
x = 1:nrow(boston)
plot(x, boston$medv, col="blue", type = "l")
lines(x, pred, col="red", type = "l" )
legend("bottomleft", legend=c("y-fitted", "y-origianl"),
        col=c("red", "blue"), lty=1, cex=0.7) 
 
No comments:
Post a Comment