The K-means clustering method is mainly used for clustering purposes. I experimented to apply this model for anomaly detection and it worked for my test scenario. Technically, we can figure out the outliers by using the K-means method. However, it is better to use the right method for anomaly detection according to data content you are dealing with.
In this tutorial, we'll learn how to detect outliers for regression data by applying the KMeans class of Scikit-learn API in Python. The tutorial covers:
- The K-Means algorithm
- Preparing the data
- Anomaly detection with K-means
- Testing with Boston housing dataset
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
from sklearn.cluster import KMeans from numpy import sqrt, random, array, argsort from sklearn.preprocessing import scale from sklearn.datasets import load_boston import matplotlib.pyplot as plt
The K-Means algorithm
The K-Means is a clustering algorithm. In this method, K random points are selected as centroids in a dataset. Then, the elements are arranged to the closest centroids by calculating the distance. The process is repeated to achieve optimal distances between sample data and centroids.
In this tutorial, we'll limit cluster numbers to 1 and fit the model on data to find out single centroid. Then, we'll calculate the distances from the centroid and extract the top n long-distance samples as outliers. Note that we use the K-means method to detect center point of a given dataset only.
Preparing the data
In this tutorial, we'll limit cluster numbers to 1 and fit the model on data to find out single centroid. Then, we'll calculate the distances from the centroid and extract the top n long-distance samples as outliers. Note that we use the K-means method to detect center point of a given dataset only.
Preparing the data
We'll use randomly generated regression data as a target dataset. Here, I can write simple function to generate sample data.
random.seed(123) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.9): r = r + 10 elif(r<(-4.8)): r = r +(- 10) x.append([a + r]) return array(x) x = makeData(500)
We'll visualize it in a plot to check visually.
x_ax = range(500) plt.plot(x_ax, x) plt.show()
Next, we'll scale the dataset.
Anomaly detection with KMeans
x = scale(x)
Anomaly detection with KMeans
We'll use Scikit-learn API's Kmeans class to define the K-Means model.
kmeans = KMeans(n_clusters = 1).fit(x) print(kmeans)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=1, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
We'll get centroids from the fitted model.
center = kmeans.cluster_centers_ print(center)
[[-2.30926389e-17]]
Next, we'll calculate the distance of each sample from the center value.
distance = sqrt((x - center)**2)
Then, we'll sort it by using the argsoft() method and extract the settings of elements with the longest distance.
order_index = argsort(distance, axis = 0) indexes = order_index[-5:]
We'll get the values of the elements.
values = x[indexes]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
Testing with Boston housing dataset
plt.plot(x_ax, x) plt.scatter(indexes, values, color='r') plt.show()
Testing with Boston housing dataset
We can apply the same method to the Boston housing dataset. We'll use only y target data from this dataset. We'll reshape and scale it to use it in the KMean model.
boston = load_boston() y = boston.target y = y.reshape(y.shape[0], 1) y = scale(y)
Next, we'll define the model, fit the model on y data, and find out the center. Then, we'll calculate the distances of each sample.
kmeans = KMeans(n_clusters = 1).fit(y) print(kmeans) center = kmeans.cluster_centers_ print(center)
distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(indexes,values, color='r') plt.show()
Source code listing
from sklearn.cluster import KMeans from numpy import sqrt, array, random, argsort from sklearn.preprocessing import scale from sklearn.datasets import load_boston import matplotlib.pyplot as plt random.seed(123) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.9): r = r + 10 elif(r<(-4.8)): r = r +(- 10) x.append([a + r]) return array(x) x = makeData(500) x_ax = range(500) plt.plot(x_ax, x) plt.show() x = scale(x)
kmeans = KMeans(n_clusters = 1).fit(x) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((x - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-5:] values = x[indexes] plt.plot(x_ax, x) plt.scatter(indexes, values, color='r') plt.show()
# Boston housing dataset case
boston = load_boston() y = boston.target y = y.reshape(y.shape[0], 1) y = scale(y) kmeans = KMeans(n_clusters = 1).fit(y) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes] x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(indexes,values, color='r') plt.show()
Hi, thanks for sharing!! I have a question why do we need to use K-mean when k=1, can we simply use average or median instead?
ReplyDeleteYou are welcome! Here, k=1 means that single cluster for given dataset. We'll find all outliers around one center.
Delete