Anomaly Detection Example with DBSCAN in Python

   The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. The main principle of this algorithm is that it finds core samples in a dense area and groups the samples around those core samples to create clusters. The samples in a low-density area become the outliers. We'll focus on finding out those outliers in this tutorial.
 
   The Scikit-learn API provides the DBSCAN class for this algorithm and we'll use it in this tutorial. The tutorial covers:
  1. Preparing the dataset
  2. Defining the model and anomaly detection
  3. Source code listing

    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

We'll start by loading the required libraries for this tutorial.

 
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt
 


Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

 
random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))
 

We'll check the dataset by visualizing it in a plot.

 
plt.scatter(x[:,0], x[:,1])
plt.show()
 



Defining the model and anomaly detection

   We'll define the model by using the DBSCAN class of Scikit-learn API. We'll define the 'eps' and 'min_sample' in the arguments of the class. The argument 'eps' is the distance between two samples to be considered as a neighborhood and 'min_samples' is the number of samples in a neighborhood.

 
dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)
 
 
DBSCAN(algorithm='auto', eps=0.28, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=20, n_jobs=None, p=None)


We'll fit the model with x dataset and get the prediction data with the fit_predict() method.

 
pred = elenv.fit_predict(x) 
  

 Next, we'll extract the negative outputs as the outliers.


anom_index = where(pred == -1)
values = x[anom_index]
 

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.


plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
 



   In this tutorial, we've learned how to detect the anomalies with the DBSCAN method by using the Scikit-learn's DBSCAN class in Python. The full source code is listed below.



Source code listing

 
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from numpy import random, where
import matplotlib.pyplot as plt

random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show()

dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)

pred = dbscan.fit_predict(x)
anom_index = where(pred == -1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
 


References:
  1. Wikipedia
  2. Scikit-learn API


1 comment: