The K-means clustering method is mainly used for clustering purposes. I experimented to apply this model for anomaly detection and it worked for my test scenario. Technically, we can figure out the outliers by using the K-means method. However, it is better to use the right method for anomaly detection according to data content you are dealing with.

In this tutorial, we'll learn how to detect outliers for regression data by applying the KMeans class of Scikit-learn API in Python. The tutorial covers:

- The K-Means algorithm
- Preparing the data
- Anomaly detection with K-means
- Testing with Boston housing dataset
- Source code listing

from sklearn.cluster import KMeans from numpy import sqrt, random, array, argsort from sklearn.preprocessing import scale from sklearn.datasets import load_boston import matplotlib.pyplot as plt

**The K-Means algorithm**

The K-Means is a clustering algorithm. In this method, K random points are selected as centroids in a dataset. Then, the elements are arranged to the closest centroids by calculating the distance. The process is repeated to achieve optimal distances between sample data and centroids.

In this tutorial, we'll limit cluster numbers to 1 and fit the model on data to find out single centroid. Then, we'll calculate the distances from the centroid and extract the top n long-distance samples as outliers. Note that we use the K-means method to detect center point of a given dataset only.

In this tutorial, we'll limit cluster numbers to 1 and fit the model on data to find out single centroid. Then, we'll calculate the distances from the centroid and extract the top n long-distance samples as outliers. Note that we use the K-means method to detect center point of a given dataset only.

**Preparing the data**We'll use randomly generated regression data as a target dataset. Here, I can write simple function to generate sample data.

random.seed(123) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.9): r = r + 10 elif(r<(-4.8)): r = r +(- 10) x.append([a + r]) return array(x) x = makeData(500)

We'll visualize it in a plot to check visually.

x_ax = range(500) plt.plot(x_ax, x) plt.show()

Next, we'll scale the dataset.

`x = scale(x)`

**Anomaly detection with KMeans**We'll use Scikit-learn API's Kmeans class to define the K-Means model.

kmeans = KMeans(n_clusters = 1).fit(x) print(kmeans)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

n_clusters=1, n_init=10, n_jobs=None, precompute_distances='auto',

random_state=None, tol=0.0001, verbose=0)

We'll get centroids from the fitted model.

center = kmeans.cluster_centers_ print(center)

[[-2.30926389e-17]]

Next, we'll calculate the distance of each sample from the center value.

distance = sqrt((x - center)**2)

Then, we'll sort it by using the argsoft() method and extract the settings of elements with the longest distance.

order_index = argsort(distance, axis = 0) indexes = order_index[-5:]

We'll get the values of the elements.

`values = x[indexes]`

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.plot(x_ax, x) plt.scatter(indexes, values, color='r') plt.show()

**Testing with Boston housing dataset**We can apply the same method to the Boston housing dataset. We'll use only y target data from this dataset. We'll reshape and scale it to use it in the KMean model.

boston = load_boston() y = boston.target y = y.reshape(y.shape[0], 1) y = scale(y)

Next, we'll define the model, fit the model on y data, and find out the center. Then, we'll calculate the distances of each sample.

kmeans = KMeans(n_clusters = 1).fit(y) print(kmeans) center = kmeans.cluster_centers_ print(center)

distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(indexes,values, color='r') plt.show()

**Source code listing**

from sklearn.cluster import KMeans from numpy import sqrt, array, random, argsort from sklearn.preprocessing import scale from sklearn.datasets import load_boston import matplotlib.pyplot as plt random.seed(123) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.9): r = r + 10 elif(r<(-4.8)): r = r +(- 10) x.append([a + r]) return array(x) x = makeData(500) x_ax = range(500) plt.plot(x_ax, x) plt.show() x = scale(x)

kmeans = KMeans(n_clusters = 1).fit(x) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((x - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-5:] values = x[indexes] plt.plot(x_ax, x) plt.scatter(indexes, values, color='r') plt.show()

# Boston housing dataset case

boston = load_boston() y = boston.target y = y.reshape(y.shape[0], 1) y = scale(y) kmeans = KMeans(n_clusters = 1).fit(y) print(kmeans) center = kmeans.cluster_centers_ print(center) distance = sqrt((y - center)**2) order_index = argsort(distance, axis = 0) indexes = order_index[-10:] values = y[indexes] x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(indexes,values, color='r') plt.show()

## No comments:

## Post a Comment