DataTechNotes: Anomaly Detection Example with K-means in Python

K-means clustering is primarily used for grouping similar data points together. In this tutorial, I'll share my approach how to use the KMeans to detect outlier detection in data. The tutorial covers:

Introduction to K-Means algorithm
Approach for anomaly detection
Preparing the data
Anomaly detection with K-means
Conclusion
Source code listing

If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.

Introduction to K-Means algorithm

K-means is a clustering algorithm that partitions data into 'k' clusters. It starts by randomly selecting 'k' centroids, then iteratively assigns each data point to the nearest centroid and recalculates the centroids based on the mean of the points assigned to each cluster. This process continues until the centroids stabilize or a specified number of iterations is reached. K-means aims to minimize the within-cluster sum of squared distances, effectively grouping similar data points together. The result is 'k' clusters with centroids representing their respective centers, allowing for data segmentation based on similarity.

Approach for anomaly detection

In this tutorial, we fit the K-means algorithm to the data and obtain the cluster labels. However, setting n_clusters=1 means that only one cluster is created, effectively finding the centroid of the data. Then, the distance of each point to this centroid is calculated to identify outliers.

Using K-means for anomaly detection by setting n_clusters=1 and considering points far from the centroid as outliers can work in some cases, especially when you're dealing with data that forms well-defined clusters. However, it's not a robust method for anomaly detection in general.

Note that this is only the approach I wanted to try it.

Preparing the data

We'll start by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

We'll use randomly generated data by using make_blob() function of Scikit-learn.

# Generate synthetic data with three clusters
X, _ = make_blobs(n_samples=300, centers=2, cluster_std=2.6, random_state=42)
 
# Visualize the data
plt.scatter(X[:,0], X[:,1])
plt.show() 

Anomaly detection with KMeans

We instantiate KMeans class of Scikit-learn and fit to the data with n_clusters=1, effectively finding one cluster.

# Fit k-means to the  data
kmeans = KMeans(n_clusters=1, random_state=42)
kmeans.fit(X)
 

Next, we predict cluster labels using predict() method.

# Predict cluster labels
labels = kmeans.predict(X)

The distances of each data point to the cluster center are calculated using transform() method.

# Calculate distances of each point to the cluster center
cluster_distances = kmeans.transform(X)

Euclidean distances from each point to the cluster center are calculated and the top n_outlier points with the longest distances are identified as outliers.

 
# Number of outliers to identify
n_outlier = 4

# Calculate Euclidean distances and find the outliers
euclidean_distances = np.linalg.norm(cluster_distances, axis=1)
outlier_indices = np.argsort(euclidean_distances)[-n_outlier:]

Data points are plotted with their cluster labels. Outliers are highlighted in red, and the cluster center is marked with a black cross.

 
# Plot the data points and cluster center
plt.scatter(X[:, 0], X[:, 1], c=labels, alpha=0.7)

# Highlight outliers with a different color
plt.scatter(X[outlier_indices, 0], X[outlier_indices, 1], marker='x', 
            s=70, c='red', label='Anomalies')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='o', c='blue', label='Cluster Center')
plt.title('K-Means Anomaly Detection')
plt.legend()
plt.show()

Conclusion

In this tutorial, we learned how to detect anomalies using Kmeans and distance calculation.

While K-means can be a simple and computationally efficient method for clustering, it might not always be the best choice for anomaly detection. It's essential to consider the characteristics of your data and explore other methods that are specifically designed for anomaly detection to ensure robust and accurate results.

Source code listing

 
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data with three clusters
X, _ = make_blobs(n_samples=300, centers=2, cluster_std=2.6, random_state=42)

# Visualize the data
plt.scatter(X[:,0], X[:,1])
plt.show()

# Fit k-means to the  data
kmeans = KMeans(n_clusters=1, random_state=42)
kmeans.fit(X)

# Predict cluster labels
labels = kmeans.predict(X)

# Calculate distances of each point to the cluster center
cluster_distances = kmeans.transform(X)

# Number of outliers to identify
n_outlier = 4

# Calculate Euclidean distances and find the outliers
euclidean_distances = np.linalg.norm(cluster_distances, axis=1)
outlier_indices = np.argsort(euclidean_distances)[-n_outlier:]

# Plot the data points and cluster center
plt.scatter(X[:, 0], X[:, 1], c=labels, alpha=0.7)

# Highlight outliers with a different color
plt.scatter(X[outlier_indices, 0], X[outlier_indices, 1], marker='x', 
            s=70, c='red', label='Anomalies')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            marker='o', c='blue', label='Cluster Center')
plt.title('K-Means Anomaly Detection')
plt.legend()
plt.show()
 

DataTechNotes

Pages

Anomaly Detection Example with K-means in Python

2 comments: