DataTechNotes: Anomaly Detection with Isolation Forest in Python

Anomalies or outliers are elements that deviate from the typical characteristics of the majority of observed data. These anomalies can represent errors, extreme values, or unusual instances within the dataset. There are several methods available for detecting anomalies, and the Isolation Forest is one of the techniques used for anomaly detection.

In this tutorial, we'll learn how to detect anomaly in the dataset by using the Isolation Forest method in Python. The Scikit-learn API provides the IsolationForest class for this algorithm and we'll use it in this tutorial. The tutorial covers:

What is Isolation Forest?
Preparing the dataset
Defining the model and prediction
Anomaly detection with scores
Source code listing

If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.

What is Isolation Forest?

The Isolation Forest is an efficient anomaly detection technique. It constructs a set of binary trees called Isolation Trees by randomly selecting features and recursively splitting data points. Anomalies, being rare, are easier to isolate and have shorter paths in the trees, while normal data points require more splits, resulting in longer paths. By measuring the average path length, the algorithm identifies anomalies as data points with shorter paths, falling below a predefined threshold. Isolation Forest is highly effective for detecting outliers, especially in high-dimensional data, making it valuable for applications like fraud detection and network security.

Preparing the dataset

We'll start by loading the required packages for this tutorial.

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

Next, we'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(3)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

We'll check the dataset by visualizing it in a plot.

Defining the model and prediction

We'll define the model by using the IsolationForest class of Scikit-learn API. We'll set estimators number and contamination value in arguments of the class.

iforest = IsolationForest(n_estimators=100, contamination=.02)

We'll fit the model with x dataset and get the prediction data with fit_predict() function.

pred = iforest.fit_predict(x)

We'll extract the negative outputs as the outliers.

anom_index = where(pred==-1)
values = x[anom_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

Anomaly detection with scores

In the second method, we'll define the model without setting the contamination argument.

iforest = IsolationForest(n_estimators=100)
print(iforest)

IsolationForest(behaviour='deprecated', bootstrap=False, contamination='auto',
                max_features=1.0, max_samples='auto', n_estimators=100,
                n_jobs=None, random_state=None, verbose=0, warm_start=False)

We'll fit the model with x dataset, then extract the samples score.

iforest.fit(x)
scores = iforest.score_samples(x)

Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 2 percent of samples as the anomalies.

thresh = quantile(scores, .02)
print(thresh)

-0.6262114295204622

We'll extract the anomalies by comparing the threshold value and get values from the indexes.

index = where(scores <= thresh)
values = x[index]

Finally, we can visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

In both methods above we've got the same result. You can use any of them in your analysis. You can change the threshold or contamination argument in a model to filter out more extreme cases.

In this tutorial, we've learned how to detect the anomalies with the Isolation Forest algorithm by the Scikit-learn API class in Python. The full source code is listed below.

Source code listing

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(3)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show()

iforest = IsolationForest(n_estimators=100, contamination=.02)

print(iforest)

pred = iforest.fit_predict(x)

anom_index = where(pred==-1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

iforest = IsolationForest(n_estimators=100)
print(iforest)

iforest.fit(x)
scores = iforest.score_samples(x)

thresh = quantile(scores, .02)
print(thresh)

index = where(scores <= thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

References:

DataTechNotes

Pages

Anomaly Detection with Isolation Forest in Python

No comments:

Post a Comment