Outlier Detection with Local Outlier Factor with R

   The 'Rlof' package provides 'lof()' function to find out local outlier factor for each observation in a given dataset with k neighbors.
   In this post, we'll learn how to use the lof() function to extract outliers in a given dataset with a decision threshold value. For this tutorial, we'll need a 'Rlof' library in R. We'll start by installing the package.

install.packages("Rlof")

Then we can load the package.

library(Rlof)

Preparing the data 

First, we'll generate a sample dataset for this tutorial and visualize it in a plot.

set.seed(124)
test = runif(100)*10
test[sample(1:100, 6)] = sample(-10:30, 6)
 
plot(test, col="blue", type='p', pch=19)
 



Defining the Lof()

Next, we calculate LOF for each element in the test data. Here, we set 5 into argument k, the distance to calculate LOFs. We can print the header part of it.

mlof = lof(test, k=5)
 
head(mlof)
[1] 1.0379733 1.0355735 1.0372052 0.9481038 0.9537252 1.1713114 

Next, we check the probability distribution range of the mlof data.

quantile(mlof)
        0%        25%        50%        75%       100% 
 0.9007470  0.9794372  1.0363230  1.1577253 26.9784639 

Here, I set 97 percent value as a threshold to decide the value as an outlier. You may change it according to your data density.

quantile(mlof, .97)
     97% 
1.976995 
 
thr = quantile(mlof, .97)

Next, we'll extract the elements that are equal to or higher than the threshold value from test data.

out_index = which(mlof >= thr)
 
print(out_index)
[1] 14 19 34
 
print(test[out_index])
[1] 18 -4 -3

Finally, we'll plot the results to check the outliers in a chart.

plot(test, col="blue", type='p', pch=19)
points(x=out_index, y=test[out_index], pch=19, col="red")
 

The plot shows the outlier points in test data.
   In this post, we've briefly learned how to use the lof() function to find out the outliers in a dataset.


Source code listing

install.packages("Rlof") 
library(Rlof)
 
set.seed(124)
 
test = runif(100)*10
test[sample(1:100, 6)] = sample(-10:30, 6)
  
plot(test, col="blue", type='p', pch=19)
 
mlof = lof(test, k=5)
print(mlof) 
 
quantile(mlof)
 
quantile(mlof, .97)
 
thr = quantile(mlof, .97)
out_index = which(mlof >= thr)
 
print(out_index)
print(test[out_index])
   
plot(test, col="blue", type='p', pch=19)
points(x=out_index, y=test[out_index], pch=19, col="red")


Outlier check with SVM novelty detection in R

Outlier check with kmeans distance calculation with R

Outlier detection with boxplot.stats function in R

No comments:

Post a Comment