Anomaly Detection with Isolation Forest in Python

   Anomaly or outlier is an element with the properties that differ from the majority of the observation data. Anomalies may define the errors, extremes, or abnormal cases in observation data. There are several methods to detect anomalies in a dataset. Isolation Forest is one of the anomaly detection methods.

   Isolation forest is a learning algorithm for anomaly detection by isolating the instances in the dataset. The algorithm creates isolation trees (iTrees), holding the path length characteristics of the instance of the dataset and Isolation Forest (iForest) applies no distance or density measures to detect anomalies. To learn more about the algorithm, please refer to the links listed in the reference section.

   In this tutorial, we'll learn how to detect anomaly in the dataset by using the Isolation Forest method in Python. The Scikit-learn API provides the IsolationForest class for this algorithm and we'll use it in this tutorial. The tutorial covers:
  1. Preparing the dataset
  2. Defining the model and prediction 
  3. Anomaly detection with scores
  4. Source code listing
We'll start by loading the required packages for this tutorial.

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt


Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(3)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

We'll check the dataset by visualizing it in a plot.



Defining the model and prediction

We'll define the model by using the IsolationForest class of Scikit-learn API. We'll set estimators number and contamination value in arguments of the class.

iforest = IsolationForest(n_estimators=100, contamination=.02)

We'll fit the model with x dataset and get the prediction data with fit_predict() function.

pred = iforest.fit_predict(x)
 
We'll extract the negative outputs as the outliers.

anom_index = where(pred==-1)
values = x[anom_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()



Anomaly detection with scores

In the second method, we'll define the model without setting the contamination argument. 

iforest = IsolationForest(n_estimators=100)
print(iforest)
IsolationForest(behaviour='deprecated', bootstrap=False, contamination='auto',
                max_features=1.0, max_samples='auto', n_estimators=100,
                n_jobs=None, random_state=None, verbose=0, warm_start=False) 
 
We'll fit the model with x dataset, then extract the samples score.

iforest.fit(x)
scores = iforest.score_samples(x)

Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 2 percent of samples as the anomalies. 

thresh = quantile(scores, .02)
print(thresh)
-0.6262114295204622

We'll extract the anomalies by comparing the threshold value and get values from the indexes.

index = where(scores <= thresh)
values = x[index]

Finally, we can visualize the results in a plot by highlighting the anomalies with a color. 

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()



In both methods above we've got the same result. You can use any of them in your analysis. You can change the threshold or contamination argument in a model to filter out more extreme cases.

   In this tutorial, we've learned how to detect the anomalies with the Isolation Forest algorithm by the Scikit-learn API class in Python. The full source code is listed below.


Source code listing

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(3)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show() 
 
iforest = IsolationForest(n_estimators=100, contamination=.02)
print(iforest)
 
pred = iforest.fit_predict(x) 
 
anom_index = where(pred==-1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show() 
 
iforest = IsolationForest(n_estimators=100)
print(iforest)

iforest.fit(x)
scores = iforest.score_samples(x)

thresh = quantile(scores, .02)
print(thresh)
 
index = where(scores <= thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()


References:
  1. Isolation Forest
  2. Wikipedia: Isolation Forest
  3. Scikit-learn API

No comments:

Post a Comment