Anomaly Detection Example with Elliptical Envelope in Python

   The Elliptical Envelope method is a statistical and machine learning technique used for detecting outliers or anomalies in a dataset. It's particularly useful when you have multivariate data (data with multiple features or dimensions) and you want to identify observations that deviate significantly from the norm.

    The Elliptical Envelope method detects the outliers in a Gaussian distributed data.
Scikit-learn API provides the EllipticEnvelope class to apply this method for anomaly detection. In this tutorial, we'll learn how to detect the anomalies by using the Elliptical Envelope method in Python. The tutorial covers:

  1. Preparing the data
  2. Defining the model and prediction
  3. Anomaly detection with scores
  4. Source code listing

    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

 

 

We'll start by loading the required libraries for this tutorial.

from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt


Preparing data

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(2)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5)) 

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()



Defining the model and prediction

We'll define the model by using the EllipticEnvelope class of Scikit-learn API. We'll define the contamination value in a class definition. Contamination argument defines the proportion of outliers in a dataset.

elenv = EllipticEnvelope(contamination=.02)
print(elenv)
EllipticEnvelope(assume_centered=False, contamination=0.02, random_state=None,
                 store_precision=True, support_fraction=None) 

We'll fit the model on x dataset and get the prediction data with the fit_predict() method.

pred = elenv.fit_predict(x)

Next, we'll extract the negative outputs as the outliers.

anom_index = where(pred==-1)
values = x[anom_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
 

Anomaly detection with scores

We can find anomalies by using their scores. In this method, we'll define the model without setting the contamination argument. In this case, the model applies the default value.
 
elenv = EllipticEnvelope()
print(elenv)
EllipticEnvelope(assume_centered=False, contamination=0.1, random_state=None,
                 store_precision=True, support_fraction=None) 

We'll fit the model on x dataset, then extract the samples score.

elenv.fit(x)
scores = elenv.score_samples(x) 

Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 2 percent of score values as the anomalies.

thresh = quantile(scores, .02)
print(thresh)
-9.469243838613968 

Next, we'll extract the anomalies by comparing the threshold value and identify the values of elements.

index = where(scores <= thresh)
values = x[index] 

Finally, we can visualize the results in a plot by highlighting the anomalies with a color.
 
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

    In both methods above we've got the same result. You can use any of them in your analysis. The threshold or contamination value can be changed to filter out more extreme cases.

   In this tutorial, we've learned how to detect the anomalies with the Elliptical Envelope method by using the Scikit-learn's EllipticEnvelope class in Python. The full source code is listed below. 


Source code listing

from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(12)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show() 
 
elenv = EllipticEnvelope(contamination=.02)
print(elenv)

pred = elenv.fit_predict(x)
anom_index=where(pred==-1)
values = x[anom_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
 
 
elenv = EllipticEnvelope()
print(elenv)

elenv.fit(x)
scores = elenv.score_samples(x)

thresh = quantile(scores, .02)
print(thresh) 
 
index = where(scores <= thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show() 


References:

  1. Scikit-learn API

2 comments:

  1. Hi,

    Within the sklearn.covariance package, there are many methods and algorithms (Empirical covariance, Shrunk Covariance, OAS, GraphicalLasso, etc.). Why did you specifically use Elliptical Envelope for this example vs any other algorthim?

    It would be great if you can also share some examples highlighting the different scenarios in which we should use the different sklearn.covariance algorithms.

    ReplyDelete
    Replies
    1. Thanks for your suggestion I'll think about it. The purpose of this post is to show an example of anomaly detection with Elliptical Envelope method.

      Delete