DataTechNotes: Anomaly Detection Example with DBSCAN in Python

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. The main principle of this algorithm is that it finds core samples in a dense area and groups the samples around those core samples to create clusters. The samples in a low-density area become the outliers. We'll focus on finding out those outliers in this tutorial.

The Scikit-learn API provides the DBSCAN class for this algorithm and we'll use it in this tutorial. The tutorial covers:

Preparing the dataset
Defining the model and anomaly detection
Source code listing

If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.

We'll start by loading the required libraries for this tutorial.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt

Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()

Defining the model and anomaly detection

We'll define the model by using the DBSCAN class of Scikit-learn API. We'll define the 'eps' and 'min_sample' in the arguments of the class. The argument 'eps' is the distance between two samples to be considered as a neighborhood and 'min_samples' is the number of samples in a neighborhood.

dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)

DBSCAN(algorithm='auto', eps=0.28, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=20, n_jobs=None, p=None)

We'll fit the model with x dataset and get the prediction data with the fit_predict() method.

pred = elenv.fit_predict(x)

Next, we'll extract the negative outputs as the outliers.

anom_index = where(pred == -1)
values = x[anom_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

In this tutorial, we've learned how to detect the anomalies with the DBSCAN method by using the Scikit-learn's DBSCAN class in Python. The full source code is listed below.

Source code listing

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt

random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show()

dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)

pred = dbscan.fit_predict(x)
anom_index = where(pred == -1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

References:

DataTechNotes

Pages

Anomaly Detection Example with DBSCAN in Python

1 comment: