Anomaly Detection Example with Local Outlier Factor in Python

   The Local Outlier Factor is an algorithm to detect anomalies in observation data. Measuring the local density score of each sample and weighting their scores are the main concept of the algorithm. By comparing the score of the sample to its neighbors, the algorithm defines the lower density elements as anomalies in data.

   "The local outlier factor is based on a concept of a local density, where locality is given by nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers."

   In this tutorial, we'll learn how to detect anomaly in a dataset by using the Local Outlier Factor method in Python. The Scikit-learn API provides the LocalOutlierFactor class for this algorithm and we'll use it in this tutorial. The tutorial covers:
  1. Preparing the dataset
  2. Defining the model and prediction
  3. Anomaly detection with scores
  4. Source code listing
We'll start by loading the required packages for this tutorial.

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt


Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(1)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10))

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()
 


Defining the model and prediction 

We'll define the model by using the LocalOutlierFactor class of Scikit-learn API. We'll set estimators number and contamination value in arguments. Contamination defines the proportion of outliers in a dataset.

lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)

We'll fit the model with x dataset and get the prediction data with the fit_predict() method.

y_pred = lof.fit_predict(x)

We'll extract the negative outputs as the outliers.

lofs_index = where(y_pred==-1)
values = x[lofs_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
 




Anomaly detection with scores

In the second method, we'll define the model without setting the contamination argument.

model = LocalOutlierFactor(n_neighbors=20) 

We'll fit the model with x dataset, then extract the samples score.

model.fit_predict(x)
lof = model.negative_outlier_factor_ 

Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 3 percent of score values as the anomalies.

thresh = quantile(lof, .03)
print(thresh)
-1.8191482960907037

We'll extract the anomalies by comparing the threshold value and identify the values of elements.

index = where(lof<=thresh)
values = x[index]

Finally, we can visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

In both methods above we've got the same result. You can use any of them in your analysis. The threshold or contamination value can be changed to filter out more extreme cases.

   In this tutorial, we've learned how to detect the anomalies with the Local Outlier Factor algorithm by using the Scikit-learn API class in Python. The full source code is listed below.


Source code listing

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(1)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10))

plt.scatter(x[:,0], x[:,1])
plt.show()

lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)
print(thresh)  
 
y_pred = lof.fit_predict(x)

lofs_index=where(y_pred==-1)
values = x[lofs_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

model = LocalOutlierFactor(n_neighbors=20
print(model)  
model.fit_predict(x) 
 
lof = model.negative_outlier_factor_
thresh = quantile(lof, .03)
print(thresh) 
 
index = where(lof<=thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()


References:
  1. Scikit-learn API
  2. Wikipedia: Local Outlier Factor

No comments:

Post a Comment