Anomaly Detection Example With OPTICS Method in Python

    Ordering Points To Identify the Clustering Structure (OPTICS) is an algorithm that estimates density-based clustering structure of a given data. It applies the clustering method similar to DBSCAN algorithm.

    In this tutorial, we'll learn how to apply OPTICS method to detect anomalies in given data. Here, we use OPTIC class of Scikit-learn API. The tutorial covers:

  1. Preparing the data
  2. Anomaly detection with OPTICS
  3. Source code listing

    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

We'll start by loading the required libraries and functions for this tutorial.
 
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt
  


Preparing the data

    We'll generate simple data data for this tutorial by using the make_blob() function and visualize it in a plot.

random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show() 

Anomaly detection with OPTICS

    We'll define the model by using OPTICS class with its default parameters then we'll fit it on x data. You can check the parameters of the class and change them according to your analysis and target data.
 
model = OPTICS().fit(x)
print(model)
 
OPTICS(algorithm='auto', cluster_method='xi', eps=None, leaf_size=30,
max_eps=inf, metric='minkowski', metric_params=None,
min_cluster_size=None, min_samples=5, n_jobs=None, p=2,
predecessor_correction=True, xi=0.05)
 

Next, we'll obtain the scores of each sample of x data by using core_distance_ property of the model.

scores = model.core_distances_ 
 

Then, we'll extract the threshold value from the scores data by using quantile() function. You can set your target percentage to quantile, in this example we'll set 98% data as normal and remaining part of data the data becomes an outlier.
 
thresh = quantile(scores, .98)
print(thresh) 
 
0.35064484877392416 
 

By using threshold value, we'll find the samples with the scores that are equal to or higher than the threshold value.

index = where(scores >= thresh)
values = x[index]
print(values)
 
[[ 9.45071447 14.58847433]
[ 8.500387 16.2113985 ]
[ 9.56481939 16.89136015]
[ 9.63176979 14.41548797]
[ 8.43771706 15.07302741]
[10.33672675 14.89789167]
[10.43533425 16.58262441]]
 

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()  
 

   In this tutorial, we've briefly learned how to detect the anomalies by using the OPTICS method by using the Scikit-learn's OPTICS class in Python. The full source code is listed below.


Source code listing

from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show() 
 
model = OPTICS().fit(x)
print(model)

scores = model.core_distances_

thresh = quantile(scores, .98)
print(thresh) 

index = where(scores >= thresh)
values = x[index]
print(values)

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()
  


References:

No comments:

Post a Comment