Anomaly Detection Example with K-means in Python

   The K-means clustering method is mainly used for clustering purposes. I experimented to apply this model for anomaly detection and it worked for my test scenario. Technically, we can figure out the outliers by using the K-means method. However, it is better to use the right method for anomaly detection according to data content you are dealing with.
   In this tutorial, we'll learn how to detect outliers for regression data by applying the KMeans class of Scikit-learn API in Python. The tutorial covers:
  • The K-Means algorithm
  • Preparing the data
  • Anomaly detection with K-means
  • Testing with Boston housing dataset
  • Source code listing

    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

We'll start by loading the required libraries for this tutorial.

from sklearn.cluster import KMeans
from numpy import sqrt, random, array, argsort
from sklearn.preprocessing import scale
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt


The K-Means algorithm

   The K-Means is a clustering algorithm. In this method, K random points are selected as centroids in a dataset. Then, the elements are arranged to the closest centroids by calculating the distance. The process is repeated to achieve optimal distances between sample data and centroids.
   In this tutorial, we'll limit cluster numbers to 1 and fit the model on data to find out single centroid. Then, we'll calculate the distances from the centroid and extract the top n long-distance samples as outliers. Note that we use the K-means method to detect center point of a given dataset only.


Preparing the data

We'll use randomly generated regression data as a target dataset. Here, I can write simple function to generate sample data.

random.seed(123)
def makeData(N):
	x = []
	for i in range(N):
		a = i/1000 + random.uniform(-3, 2)
		r = random.uniform(-5, 10)
		if(r >= 9.9):
			r = r + 10
		elif(r<(-4.8)):
			r = r +(- 10)			
		x.append([a + r])	
	return array(x)

x = makeData(500)

We'll visualize it in a plot to check visually. 

x_ax = range(500)
plt.plot(x_ax, x)
plt.show()


Next, we'll scale the dataset.

x = scale(x)


Anomaly detection with KMeans

We'll use Scikit-learn API's Kmeans class to define the K-Means model.

kmeans = KMeans(n_clusters = 1).fit(x)
print(kmeans)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=1, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
We'll get centroids from the fitted model.

center = kmeans.cluster_centers_
print(center)
[[-2.30926389e-17]]

Next, we'll calculate the distance of each sample from the center value.

distance = sqrt((x - center)**2)

Then, we'll sort it by using the argsoft() method and extract the settings of elements with the longest distance.

order_index = argsort(distance, axis = 0)
indexes = order_index[-5:]

We'll get the values of the elements.

values = x[indexes]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.plot(x_ax, x)
plt.scatter(indexes, values, color='r')
plt.show()



Testing with Boston housing dataset

We can apply the same method to the Boston housing dataset. We'll use only y target data from this dataset. We'll reshape and scale it to use it in the KMean model.

boston = load_boston()
y =  boston.target
y = y.reshape(y.shape[0], 1)
y = scale(y)

Next, we'll define the model, fit the model on y data, and find out the center. Then, we'll calculate the distances of each sample.

kmeans = KMeans(n_clusters = 1).fit(y)
print(kmeans)

center = kmeans.cluster_centers_
print(center)

distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

x_ax = range(y.shape[0])
plt.plot(x_ax, y)
plt.scatter(indexes,values, color='r')
plt.show()


   In this tutorial, we've briefly learned how to detect the anomalies with the K-Means method by using the Scikit-learn's KMeans class in Python. The full source code is listed below.


Source code listing


from sklearn.cluster import KMeans from numpy import sqrt, array, random, argsort from sklearn.preprocessing import scale from sklearn.datasets import load_boston import matplotlib.pyplot as plt random.seed(123) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.9): r = r + 10 elif(r<(-4.8)): r = r +(- 10) x.append([a + r]) return array(x) x = makeData(500) x_ax = range(500) plt.plot(x_ax, x) plt.show() x = scale(x)
kmeans = KMeans(n_clusters = 1).fit(x) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((x - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-5:] values = x[indexes] plt.plot(x_ax, x) plt.scatter(indexes, values, color='r') plt.show()
# Boston housing dataset case
boston = load_boston() y = boston.target y = y.reshape(y.shape[0], 1) y = scale(y) kmeans = KMeans(n_clusters = 1).fit(y) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes] x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(indexes,values, color='r') plt.show()

2 comments:

  1. Hi, thanks for sharing!! I have a question why do we need to use K-mean when k=1, can we simply use average or median instead?

    ReplyDelete
    Replies
    1. You are welcome! Here, k=1 means that single cluster for given dataset. We'll find all outliers around one center.

      Delete