DataTechNotes: Linear Regression Example with R

Linear regression is a widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. In this tutorial, we'll learn how to perform linear regression in R using the lm() function and evaluate the model's performance.

The tutorial covers:

Introduction to linear regression
Data preparing
Fitting the model
Accuracy check
Source code listing

Introduction to linear regression

Linear regression models the relationship between a dependent variable (target) and one or more independent variables (predictors) by fitting a linear equation to the observed data.

In simple linear regression, there is only one independent variable, while in multiple linear regression, there are multiple independent variables. The linear equation can be expressed:

$y = m x + b$

Where:

$y$ is the dependent variable
$x$ is the independent variable
$m$ is the slope of the line
$b$ is the y-intercept

The parameters $m$ and $b$ are estimated from the data using various techniques such as the method of least squares. Once the parameters are estimated, the linear equation can be used to predict the value of the dependent variable $y$ for new values of the independent variable $x$ .

Data preparing

We'll start by loading the required libraries for this tutorial.

 
# Load libraries
library(caret)

In this tutorial, we'll utilize the Boston housing price dataset for regression analysis. We'll begin by preparing the data, which involves splitting it into training and testing sets. Additionally, you can examine the structure of the dataset using the str() command. In this dataset, medv represents the target variable (output or label), while the remaining variables serve as input features (predictors).

# Load the Boston dataset
boston <- MASS::Boston
 
str(boston)

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.85, list = FALSE)
train <- boston[indexes, ]
test <- boston[-indexes, ]

The str() function displays the structure of an R object.

 
'data.frame':	506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 

Fitting the model

   In the model fitting part, we are using the lm() function in R to fit a linear regression model. In the code below, the lm() function is utilized for linear regression modeling in R.

   The formula medv ~ . specifies the linear regression model. In this formula, medv represents the dependent variable (also known as the response or target variable), and . denotes all other variables in the dataset (excluding medv). This notation instructs R to utilize all other variables in the training dataset as predictors to forecast medv.

   The argument data = train specifies the dataset from which the variables are derived. In this context, train refers to the training dataset containing both the dependent variable (medv) and the independent variables utilized for prediction.

# Fit linear regression model
model <- lm(medv ~ ., data = train)

# Print model summary
summary <- summary(model)
print(summary)

The summary() function gives the model summary.

 
Call:
lm(formula = medv ~ ., data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3499  -2.7247  -0.4358   1.7651  26.7350 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.191930   5.294047   6.270 9.03e-10 ***
crim         -0.082117   0.036324  -2.261 0.024294 *  
zn            0.044211   0.014316   3.088 0.002148 ** 
indus         0.019038   0.063700   0.299 0.765189    
chas          2.347720   0.894673   2.624 0.009006 ** 
nox         -17.763468   3.906448  -4.547 7.13e-06 ***
rm            4.010790   0.455132   8.812  < 2e-16 ***
age           0.012242   0.014162   0.864 0.387866    
dis          -1.327688   0.204758  -6.484 2.52e-10 ***
rad           0.261296   0.069392   3.766 0.000190 ***
tax          -0.010614   0.003922  -2.706 0.007084 ** 
ptratio      -0.942385   0.134082  -7.028 8.55e-12 ***
black         0.009253   0.002788   3.319 0.000984 ***
lstat        -0.518389   0.057466  -9.021  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.564 on 418 degrees of freedom
Multiple R-squared:   0.75, Adjusted R-squared:  0.7422 
F-statistic: 96.45 on 13 and 418 DF,  p-value: < 2.2e-16

Next, we'll predict test data with a trained model. Here, test[, -14] means that we're selecting all columns of the test dataset except for the 14th column, which corresponds to the medv variable. This ensures that the predictor variables in the test dataset align with the predictors used in the model fitting process.

 
# Make predictions on the test set
pred_medv <- predict(model, newdata = test[,-14)

Accuracy check

We can assess the prediction accuracy using several metrics, including Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared. Let's examine these metrics to evaluate the performance of our linear regression model:

 
#Calculate evaluation metrics
mse <- mean((test$medv - pred_medv)^2)
mae <- mean(abs(test$medv - pred_medv))
rmse <- sqrt(mse)
r2 <- summary$r.squared

# Print evaluation metrics
cat("MAE:", mae, "\n", "MSE:", mse, "\n", 
    "RMSE:", rmse, "\n", "R-squared:", r2, "\n")

    The result looks as follows.

 
 MAE: 3.879989 
 MSE: 32.87188 
 RMSE: 5.7334 
 R-squared: 0.7499765

Finally, we'll create a plot to visually inspect the results of our linear regression model.

 
# Plot actual vs. predicted values
plot(test$medv, pch = 18, col = "red", xlab = "Observation", ylab = "medv", 
                    main = "Actual vs. Predicted")
lines(pred_medv, lwd = 1, col = "blue")
legend("topleft", legend = c("Actual", "Predicted"), col = c("red", "blue"), 
                    pch = c(18, NA), lty = c(NA, 1), lwd = 1, cex = 0.8)

In this plot, the actual values are represented by red dots, and the predicted values are connected by a blue line.

This plot allows for a quick visual comparison between the actual and predicted values, helping us assess the performance of our linear regression model.

Conclusion

In this tutorial, we learned how to perform linear regression in R using the lm() function. We split the data into training and testing sets, trained the model, evaluated its performance using evaluation metrics, and visualized the results. Linear regression is a powerful tool for modeling relationships between variables and making predictions based on observed data. The full source code is listed below.

Source code listing

 
# Load libraries
library(caret)

# Load the Boston dataset
boston <- MASS::Boston

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.85, list = FALSE)
train <- boston[indexes, ]
test <- boston[-indexes, ]

# Fit linear regression model
model <- lm(medv ~ ., data = train)

# Print model summary
summary <- summary(model)
print(summary)

# Make predictions on the test set
pred_medv <- predict(model, newdata = test)

# Calculate evaluation metrics
mse <- mean((test$medv - pred_medv)^2)
mae <- mean(abs(test$medv - pred_medv))
rmse <- sqrt(mean((test$medv - pred_medv)^2))
r2 <- summary$r.squared

# Print evaluation metrics
cat("MAE:", mae, "\n", "MSE:", mse, "\n", 
    "RMSE:", rmse, "\n", "R-squared:", r2, "\n")

# Plot actual vs. predicted values
plot(test$medv, pch = 18, col = "red", xlab = "Observation", ylab = "medv", 
                main = "Actual vs. Predicted")
lines(pred_medv, lwd = 1, col = "blue")
legend("topleft", legend = c("Actual", "Predicted"), col = c("red", "blue"), 
                pch = c(18, NA), lty = c(NA, 1), lwd = 1, cex = 0.8)

DataTechNotes

Pages

Linear Regression Example with R

No comments:

Post a Comment