Outlier Detection Example With K-means Distance Calculation in R

   Outliers in data can be calculated each element's distance from its clustered center value. We can divide the data into specified clusters by using R's kmeans() function. 

    In this tutorial, I'll try to detect outliers in a list by using kmean() function and distance calculation in R. The tutorial covers:

  1. Preparing test data
  2. K-means distance calculation
  3. Source code listing
Preparing test data
 
    We'll start preparing the test data for this tutorial. Here, we can use Boston housing dataset label data.  We'll load the dataset and visualize the target data in graph. 


boston = MASS::Boston
dim(boston)

test = boston[,14]
plot(test, pch=16, col="blue") 
 


Kmeans distance calculation

   Cluster numbers can be decided by checking the test data structure. We can divide test data into two clusters by setting 2 into the 'centers' parameter of the function.

 
km = kmeans(test, centers=2)
print(km)
  
K-means clustering with 2 clusters of sizes 106, 400

Cluster means:
[,1]
1 36.73019
2 18.77050

Clustering vector:
[1] 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[34] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2
[67] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1
[100] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[133] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2
[166] 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 2
[232] 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1
[265] 1 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2
[298] 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[331] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[364] 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[397] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[430] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[463] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[496] 2 2 2 2 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 6040.163 9648.192
(between_SS / total_SS = 63.3 %)

Available components:

[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
 
 

We'll extract centers from km object. Here, we can see the cluster id and its center values of each element.


centers=km$centers[km$cluster,] 
str(centers)
 
Named num [1:506] 18.8 18.8 36.7 36.7 36.7 ...
- attr(*, "names")= chr [1:506] "2" "2" "1" "1" ...
 
 
head(centers)
 
 2        2        1        1        1        1 
18.77050 18.77050 36.73019 36.73019 36.73019 36.73019 

Next, we'll calculate the distance of each observation value in a dataset and sort the output data.

 
distance <- sqrt((test-centers)^2)
ordered <- order(distance, decreasing = T) 

We'll extract top outliers number by collecting two extreme (min and max) values of distance.


min_out = min(test[ordered])
max_out = max(test[ordered])
outs = c(test[test[ordered]==min_out], test[test[ordered]==max_out])

outs_count = length(outs) 
 

Now, we can obtain outliers from the ordered list by setting their number.


outs = head(ordered, outs_count)
cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")

Outliers index:  399 406 162 163 164 167 187 196 205 226 258 268 284 369 370 371 372 373
Outliers value:  5 5 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  
 

Finally, we'll visualize the above values in a graph.

 
plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")  
 


   In this tutorial, we have briefly learned how to detect the outliers by using kmeans() function and distance calculation in R. The full source code is listed below.



Source code listing


# load Boston data and extract label part 
boston = MASS::Boston
dim(boston)
 
test = boston[,14]
plot(test, pch=16, col="blue")

# apply kmeans and extract centers 
km = kmeans(test, centers=2)
centers = km$centers[km$cluster,]
head(centers)

# calculate distance 
distance = sqrt((test-centers)^2)
ordered = order(distance, decreasing = T)

# extract outliers 
min_out = min(test[ordered])
max_out = max(test[ordered])
outs_val = c(test[test[ordered]==min_out], test[test[ordered]==max_out])

outs_count = length(outs_val)
outs = head(ordered, outs_count)
 
cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")

# visualize in a plot
plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")



Outlier check with SVM novelty detection in R

Outlier detection with Local Outlier Factor with R

Outlier detection with boxplot.stats function in R


No comments:

Post a Comment