Logistic Regression Example in R

   Logistic regression is a widely used classification algorithm for binary class data. Although the name of the model includes the regression, it is a classification algorithm that separates the classes linearly. The model can also be used for multiclass problems. The plot of the logistic regression model looks like the S-shaped curve located in a range between 0 and 1. The number of  0 and 1 are binary class elements.
   In this post, we'll briefly learn how to use the logistic regression model and classify data in R. We use glm() function to define a logistic regression model in R.


Creating data

   First, we create a dataset and split it into test and train parts. It is better to use a binary class data to understand the logistic regression well. The dataset contains exam data with the binary output value of a 'result' (1 - pass, 0 - fail).


exam = data.frame(test = sample(40:100,200,replace = T),
                  paper = sample(30:100,200,replace = T)) 
 
exam = cbind(exam, 
             result=ifelse(exam$test>65 & exam$paper>40, 1, 0))
 
index = sample(1:nrow(exam), size = .80 * nrow(exam))
train = exam[index, ]
test = exam[-index, ] 

head(train)
    test paper result
198   80    37    0
28    76    69    1
180   75    98    1
114   97    33    0
78    77    63    1
88    94    46    1


Building the model

  Next, we build a logistic regression model with 'glm' binomial method.
 
exam_glm = glm(result~test + paper, data = train, family = "binomial")
summary(exam_glm)

Call:
glm(formula = result ~ test + paper, family = "binomial", data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.40363  -0.43696  -0.07681   0.44613   2.08213  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -15.52439    2.35453  -6.593 4.30e-11 ***
test          0.16269    0.02443   6.658 2.77e-11 ***
paper         0.06139    0.01417   4.332 1.48e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 221.41  on 159  degrees of freedom
Residual deviance: 102.86  on 157  degrees of freedom
AIC: 108.86

Number of Fisher Scoring iterations: 6

Now, we can draw a plot for our model with a 'ggplot' function.

library(ggplot2)
 
ggplot(exam_glm, aes(x = test + paper, y = result)) + 
   geom_point() + 
   stat_smooth(method = "glm", method.args=list(family="binomial"), se=F)


Predicting the class with logistic regression 

Logistic regression output class becomes categorical data. The model predicts the probability of a class in a range of [0 ~ 1]. We separate predicted data into two classes according to their probability values; if the value is higher than 0.5, the class is A, otherwise, the class is B.

pred = predict(exam_glm, test, type="response")   
test=cbind(test, pred_result=ifelse(pred >.5, 1, 0))
 
table(test$result, test$pred_result)  # confusion matrix 
   
     0  1
  0 15  5
  1  4 16
 
head(test)
   test paper result pred_result
4    93    66      1           1
7    72    55      1           0
12   67    47      1           0
19   60    88      0           0
21   94    50      1           1
24  100    78      1           1

We check the accuracy.

acc = mean(test$result==test$pred_result)
cat("Accuracy: ", acc)
Accuracy:  0.775

In this post, we've briefly learned how to use a logistic regression model and predict data in R.
The full source code is listed below.

library(ggplot2)

set.seed(123)
exam = data.frame(test=sample(40:100,200,replace = T),
                  paper=sample(30:100,200,replace = T)) 
 
exam = cbind(exam, 
             result=ifelse(exam$test>65 & exam$paper>40, 1, 0))
head(exam)
 
index = sample(1:nrow(exam), size = .80 * nrow(exam))
train = exam[index, ]
test = exam[-index, ]
 
head(train)
 
exam_glm = glm(result~test + paper, data = train, family = "binomial")
summary(exam_glm)
 
ggplot(exam_glm, aes(x = test + paper, y = result)) + 
       geom_point() + 
       stat_smooth(method = "glm", method.args=list(family="binomial"), se=F)
 
 
pred = predict(exam_glm, test, type="response")   
test=cbind(test, pred_result=ifelse(pred>.5, 1, 0))
 
table(test$result, test$pred_result)
   
head(test)


No comments:

Post a Comment