In this tutorial, we'll learn how to extract ROC data from the binary predicted data and visualize it in a plot with Python. The tutorial covers:

- Metrics
- Defining the binary classifier
- Extract ROC and AUC
- Source code listing

from sklearn import metrics from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression

**Metrics**

ROC is created by the values TPR and FPR of the classifier. Thus, we need to understand these metrics. The TPR and FPR formulas are mentioned below. Here, TP- True Positive, FP - False Positive, TN - True Negative, FN - False Negative. The confusion matrix helps you to understand those metrics.

TPR = TP / (TP + FN)

FPR = FP / (FP + TN)

**Defining the binary classifier**

To get the prediction data, we need to prepare the dataset and classifier model. We can use the Breast Cancer dataset for this tutorial. We'll split data into test and train parts after separating it X and Y parts.

bc = load_breast_cancer() x, y = bc.data, bc.target trainX, testX, trainY, testY = train_test_split(x, y, test_size=0.3, random_state=12)

Since the label data Y is a binary type, we'll use the Logistic Regression classifier. We'll define the model and fit it with train data. To predict test data, we'll use the 'predcit_proba' method that describes the labels' prediction probability.

lr = LogisticRegression() lr.fit(trainX, trainY) predY = lr.predict_proba(testX)

print(predY[1:10,]) [[1.88177195e-03 9.98118228e-01] [2.33546181e-02 9.76645382e-01] [1.01389601e-03 9.98986104e-01] [7.44853637e-03 9.92551464e-01] [1.35032503e-01 8.64967497e-01] [8.19804471e-02 9.18019553e-01] [1.21066735e-03 9.98789333e-01] [4.52790405e-02 9.54720960e-01] [9.99995472e-01 4.52802862e-06]]

**Extract ROC and AUC**

We can extract the ROC data by using the 'roc_curve' function of sklearn.metrics.

fpr, tpr, thresh = metrics.roc_curve(testY, predY[:,1])

By using 'fpr' and 'tpr', we can get AUC values. The AUC represents the area under the ROC curve.

auc = metrics.auc(fpr, tpr) print("AUC:", auc)

AUC: 0.9871495327102804

Finally, we'll visualize the ROC in a plot.

plt.plot(fpr, tpr, label='ROC curve (area = %.2f)' %auc) plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess') plt.title('ROC curve') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.grid() plt.legend() plt.show()

A diagonal line is a random guess that the model defines nonsense. If the curve approaches closer to the top-left corner, the model performance becomes much better. Any curve under the diagonal line is worse than a random guess.

In this tutorial, we've briefly learned how to create the ROC curve plot from the binary classified data. The full source code is listed below. Thank you for reading!

**Source code listing**

from sklearn import metrics from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression bc = load_breast_cancer() x, y = bc.data, bc.target trainX, testX, trainY, testY = train_test_split(x, y, test_size=0.3, random_state=12) lr = LogisticRegression() lr.fit(trainX, trainY) predY = lr.predict_proba(testX) fpr,tpr, thresh = metrics.roc_curve(testY, predY[:,1]) auc = metrics.auc(fpr, tpr) print("AUC:", auc) plt.plot(fpr, tpr, label='ROC curve (area = %.2f)' %auc) plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess') plt.title('ROC curve') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.grid() plt.legend() plt.show()

**References:**

## No comments:

## Post a Comment