How to create a ROC curve in R

   ROC stands for Reciever Operating Characteristics, and it is used to evaluate the prediction accuracy of a classifier model. ROC curve is a metric describing the trade-off between the sensitivity (true positive rate, TPR) and specificity (false positive rate, FPR) of a prediction in all probability cutoffs (thresholds). It can be used for binary and multiclass classification accuracy checking. To evaluate the ROC in multiclass prediction, we create binary classes by mapping each class against the other classes.

   In this post, we briefly learn how to check the ROC curve in R. We use 'ROCR' package in this tutorial. Since ROC is created by the TPR and FPR value, here I'll mention the formula of both metrics. Please refer my previous post about the confusion matrix to understand below metrics.

                    TPR = TP / (TP + FN)


                     FPR = FP / (FP + TN)


   First, we'll generate sample dataset and build a classifier with a logistic regression model, then predict the test data. Based on prediction data we'll create a ROC curve and find out some other metrics.

library(ROCR)
  
df = data.frame(a=sample(1:25,400,replace = T),
                 b=runif(400)*3, 
                 c=sample(1:10,400,replace = T))
 
df = cbind(df,type=ifelse((df$a+df$b+df$c)>=20, "high", "low")) 
index = sample(1:nrow(df), size = .80 * nrow(df))
train = df[index, ]
test = df[-index, ]
 
model = glm(type~a+b,data=train, 
            family = binomial(link = "logit"))
 
pred = predict(model,test,type="response")
 

Performance check

Next, we'll use a 'prediction' and 'performance' functions of a 'ROCR' package to check the accuracy. Accuracy versus cutoff values plot is shown below.


pred = prediction(pred, test$type)
perf = performance(pred, "acc")
plot(perf)


You can also check the other metrics with a 'performance' function and visualize them in a plot.

perf_cost = performance(pred, "cost")
perf_err = performance(pred, "err")
perf_tpr = performance(pred, "tpr")
perf_sn_sp = performance(pred, "sens", "spec")

plot(perf_cost)

We can get maximum accuracy cutoff from accuracy performance.

max_ind = which.max(slot(perf, "y.values")[[1]] )
acc = slot(perf, "y.values")[[1]][max_ind]
cutoff = slot(perf, "x.values")[[1]][max_ind]
print(c(accuracy= acc, cutoff = cutoff))
  accuracy cutoff.347 
 0.9375000  0.5627766
 

ROC curve

Next, we'll create a ROC curve.

roc = performance(pred,"tpr","fpr")
plot(roc, colorize = T, lwd = 2)
abline(a = 0, b = 1) 
 

   A random guess is a diagonal line and the model does not make any sense. If the curve approaches closer to the top-left corner, the model performance becomes much better. Any curve under the diagonal line is worst than a random guess. We can set the cutoff threshold based on our requirement in terms of sensitivity and specificity importance.

AUC

   The AUC represents the area under the ROC curve. We can evaluate the model the performance by the value of AUC. Higher than 0.5  shows a better model performance. If the curve changes to rectangle it is perfect classifier with AUC value 1.

auc = performance(pred, measure = "auc")
print(auc@y.values)
[[1]]
[1] 0.9756098

   In this post, we've briefly learned how to build a ROC curve and find out AUC with ROCR package. The full source code is listed below.

library(ROCR)
 
df = data.frame(a=sample(1:25,400,replace = T),
                 b=runif(400)*3, 
                 c=sample(1:10,400,replace = T))
 
df = cbind(df,type=ifelse((df$a+df$b+df$c)>=20, "high", "low")) 
 
index = sample(1:nrow(df), size = .80 * nrow(df))
train = df[index, ]
test = df[-index, ]
 
model = glm(type~a+b,data=train, 
             family = binomial(link = "logit"))
 
pred = predict(model,test,type="response")
pred = prediction(pred, test$type)
perf = performance(pred, "acc")
plot(perf)
 
max_ind = which.max(slot(perf, "y.values")[[1]] )
acc = slot(perf, "y.values")[[1]][max_ind]
cutoff = slot(perf, "x.values")[[1]][max_ind]
print(c(accuracy= acc, cutoff = cutoff))
 
perf_cost = performance(pred, "cost")
perf_err = performance(pred, "err")
perf_tpr = performance(pred, "tpr")
perf_sn_sp = performance(pred, "sens", "spec")
 
roc = performance(pred,"tpr","fpr")
plot(roc, colorize = T, lwd = 2)
abline(a = 0, b = 1)
 
auc = performance(pred, measure = "auc")
print(auc@y.values) 


References:
  1. https://cran.r-project.org/web/packages/ROCR/ROCR.pdf 
 

1 comment: