Anomaly Detection Example With OPTICS Method in Python

    OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm similar to DBSCAN. It's used for estimating the density-based clustering structure in data. In this tutorial, we will explore how to apply the OPTICS method for detecting anomalies in a given dataset using the OPTICS class from the Scikit-learn library in Python.

Tutorial Overview

    In this tutorial, we will cover the following steps:

  1. Understanding OPTICS: An overview of OPTICS and its suitability for anomaly detection

  2. Preparing the Data: Generating synthetic data using the make_blobs function

  3. Anomaly Detection with OPTICS: Defining an OPTICS model and identifying anomalies in the dataset.

  4. Source Code Listing

 
    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  


Why Use OPTICS for Anomaly Detection?

    OPTICS is primarily a clustering algorithm, but it can be adapted for anomaly detection due to its unique characteristics:

  • Density-Based: OPTICS identifies dense clusters, making it well-suited for identifying outliers in sparser regions, which are often anomalies.

  • Hierarchical Structure: OPTICS reveals the hierarchical structure of clusters. It can help differentiate between anomalies within smaller clusters and those in larger clusters.

  • Automated Parameter Selection: OPTICS automatically determines clusters, making it flexible for changing cluster sizes and shapes.

  • Robust to Noise: Noise points are those data points that do not fit into any cluster. These can be useful for anomaly detection, as they often represent outliers.

 

Required Libraries and Functions

    Before we begin, let's load the required libraries and functions:

 
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt
  


Preparing the data

    We'll start by generating synthetic data using the make_blobs function and visualizing it in a plot:
 
 
# Generate random data  
random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5)) 
 
# Visualize the data 
plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show()
 

Anomaly detection with OPTICS

    Next, we'll apply the OPTICS algorithm for anomaly detection. We'll define the model, obtain core distances, set a threshold for anomaly detection, and identify anomalies in the data:
 
 
# Define the model 
model = OPTICS().fit(x)
 
# Get core distances 
scores = model.core_distances_ 
 
# Set a threshold  
thresh = quantile(scores, 0.98)
 
# Identify anomalies  
index = where(scores >= thresh)
values = x[index] 
 

We'll extract the threshold value from the scores data using the quantile() function. In this example, we'll consider 98% of the data as normal, and the remaining portion will be classified as outliers.You can set your target percentage for quantiles. With the threshold value, we will identify samples with scores equal to or higher than the threshold.

Finally, we'll visualize the results by highlighting the anomalies in a plot:

 
# Visualize the anomalies 
plt.scatter(x[:, 0], x[:, 1])
plt.scatter(values[:, 0],values[:, 1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()  
 

   In this tutorial, we've introduced you to the OPTICS algorithm for anomaly detection using Scikit-learn's OPTICS class in Python. You've learned how to prepare data, apply the algorithm, and visualize anomalies. The source code for the entire tutorial is provided below.
 

Source code listing

 
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show() 
 
# Define the model 
model = OPTICS().fit(x) 
 
# Get core distances 
scores = model.core_distances_  
 
# Set a threshold  
thresh = quantile(scores, 0.98) 
 
# Identify anomalies  
index = where(scores >= thresh)
values = x[index] 
  # Visualize the anomalies plt.scatter(x[:,0], x[:,1]) plt.scatter(values[:,0],values[:,1], color='r') plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True) plt.grid(True) plt.show()
  


References:

No comments:

Post a Comment