DataTechNotes: Anomaly Detection Example With OPTICS Method in Python

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm similar to DBSCAN. It's used for estimating the density-based clustering structure in data. In this tutorial, we will explore how to apply the OPTICS method for detecting anomalies in a given dataset using the OPTICS class from the Scikit-learn library in Python.

Tutorial Overview

In this tutorial, we will cover the following steps:

Understanding OPTICS: An overview of OPTICS and its suitability for anomaly detection
Preparing the Data: Generating synthetic data using the make_blobs function
Anomaly Detection with OPTICS: Defining an OPTICS model and identifying anomalies in the dataset.
Source Code Listing

If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.

Why Use OPTICS for Anomaly Detection?

OPTICS is primarily a clustering algorithm, but it can be adapted for anomaly detection due to its unique characteristics:

Density-Based: OPTICS identifies dense clusters, making it well-suited for identifying outliers in sparser regions, which are often anomalies.
Hierarchical Structure: OPTICS reveals the hierarchical structure of clusters. It can help differentiate between anomalies within smaller clusters and those in larger clusters.
Automated Parameter Selection: OPTICS automatically determines clusters, making it flexible for changing cluster sizes and shapes.
Robust to Noise: Noise points are those data points that do not fit into any cluster. These can be useful for anomaly detection, as they often represent outliers.

Required Libraries and Functions

Before we begin, let's load the required libraries and functions:

from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

Preparing the data

We'll start by generating synthetic data using the make_blobs function and visualizing it in a plot:

# Generate random data

random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))

# Visualize the data

plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show()

Anomaly detection with OPTICS

Next, we'll apply the OPTICS algorithm for anomaly detection. We'll define the model, obtain core distances, set a threshold for anomaly detection, and identify anomalies in the data:

# Define the model

model = OPTICS().fit(x)

# Get core distances

scores = model.core_distances_

# Set a threshold

thresh = quantile(scores, 0.98)

# Identify anomalies

index = where(scores >= thresh)
values = x[index]

We'll extract the threshold value from the scores data using the quantile() function. In this example, we'll consider 98% of the data as normal, and the remaining portion will be classified as outliers.You can set your target percentage for quantiles. With the threshold value, we will identify samples with scores equal to or higher than the threshold.

Finally, we'll visualize the results by highlighting the anomalies in a plot:

# Visualize the anomalies

plt.scatter(x[:, 0], x[:, 1])
plt.scatter(values[:, 0],values[:, 1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()

In this tutorial, we've introduced you to the OPTICS algorithm for anomaly detection using Scikit-learn's OPTICS class in Python. You've learned how to prepare data, apply the algorithm, and visualize anomalies. The source code for the entire tutorial is provided below.

Source code listing

 from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(123)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.grid(True)
plt.show() 
 
# Define the model model = OPTICS().fit(x)  # Get core distances scores = model.core_distances_   # Set a threshold  thresh = quantile(scores, 0.98)  # Identify anomalies  index = where(scores >= thresh)
values = x[index] 
 
# Visualize the anomalies 
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()  

References:

Scikit-learn API

DataTechNotes

Pages

Anomaly Detection Example With OPTICS Method in Python

Why Use OPTICS for Anomaly Detection?

Required Libraries and Functions

No comments:

Post a Comment