Anomaly Detection Example with DBSCAN in Python

   The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. The main principle of this algorithm is that it finds core samples in a dense area and groups the samples around those core samples to create clusters. The samples in a low-density area become the outliers. We'll focus on finding out those outliers in this tutorial.
 
   The Scikit-learn API provides the DBSCAN class for this algorithm and we'll use it in this tutorial. The tutorial covers:
  1. Preparing the dataset
  2. Defining the model and anomaly detection
  3. Source code listing
We'll start by loading the required libraries for this tutorial.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt

Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()



Defining the model and anomaly detection

   We'll define the model by using the DBSCAN class of Scikit-learn API. We'll define the 'eps' and 'min_sample' in the arguments of the class. The argument 'eps' is the distance between two samples to be considered as a neighborhood and 'min_samples' is the number of samples in a neighborhood.

dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan) 
 
DBSCAN(algorithm='auto', eps=0.28, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=20, n_jobs=None, p=None)

We'll fit the model with x dataset and get the prediction data with the fit_predict() method.

pred = elenv.fit_predict(x)
 
Next, we'll extract the negative outputs as the outliers.

anom_index = where(pred == -1)
values = x[anom_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()



   In this tutorial, we've learned how to detect the anomalies with the DBSCAN method by using the Scikit-learn's DBSCAN class in Python. The full source code is listed below.

   We've been learned several methods of anomaly detection by using different methods with Python and R in previous tutorials. Please check this blog to learn more about them.


Source code listing

 
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt

random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show()

dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)

pred = dbscan.fit_predict(x)
anom_index = where(pred == -1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
 


References:
  1. Wikipedia
  2. Scikit-learn API


1 comment: