Ordering Points To Identify the Clustering Structure (OPTICS) is an algorithm that estimates density-based clustering structure of a given data. It applies the clustering method similar to DBSCAN algorithm.
In this tutorial, we'll learn how to apply OPTICS method to detect anomalies in given data. Here, we use OPTIC class of Scikit-learn API. The tutorial covers:
- Preparing the data
- Anomaly detection with OPTICS
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
We'll start by loading the required libraries and functions for this tutorial.
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt
Preparing the data
We'll generate simple data data for this tutorial by using the make_blob() function and visualize it in a plot.
random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))
plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show()
Anomaly detection with OPTICS
We'll define the model by using OPTICS class with its default parameters then we'll fit it on x data. You can
check the parameters of the class and change them according to your
analysis and target data.
model = OPTICS().fit(x)
print(model)
OPTICS(algorithm='auto', cluster_method='xi', eps=None, leaf_size=30,
max_eps=inf, metric='minkowski', metric_params=None,
min_cluster_size=None, min_samples=5, n_jobs=None, p=2,
predecessor_correction=True, xi=0.05)
Next, we'll obtain the scores of each sample of x data by using core_distance_ property of the model.
scores = model.core_distances_
Then, we'll extract the threshold value from the scores data by using quantile() function. You can set your target percentage to quantile, in this example we'll set 98% data as normal and remaining part of data the data becomes an outlier.
thresh = quantile(scores, .98)
print(thresh)
0.35064484877392416
By using threshold value, we'll find the samples with the scores that are equal to or higher than the threshold value.
index = where(scores >= thresh)
values = x[index]
print(values)
[[ 9.45071447 14.58847433]
[ 8.500387 16.2113985 ]
[ 9.56481939 16.89136015]
[ 9.63176979 14.41548797]
[ 8.43771706 15.07302741]
[10.33672675 14.89789167]
[10.43533425 16.58262441]]
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()
Source code listing
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt
random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))
plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show()
model = OPTICS().fit(x)
print(model)
scores = model.core_distances_
thresh = quantile(scores, .98)
print(thresh)
index = where(scores >= thresh)
values = x[index]
print(values)
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()
References:
No comments:
Post a Comment