Anomaly Detection Example with Gaussian Mixture in Python

   The Gaussian Mixture is a probabilistic model to represent a mixture of multiple Gaussian distributions on population data. The model is widely used in clustering problems. In this tutorial, we'll learn how to detect anomalies in a dataset by using a Gaussian mixture model.
 
  The Scikit-learn API provides the GaussianMixture class for this algorithm and we'll apply it for an anomaly detection problem. The tutorial covers:
  1. Preparing the dataset
  2. Defining the model and anomaly detection
  3. Source code listing
We'll start by loading the required libraries for this tutorial.

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

Besides the GaussionMixture class, we need make_blob function to create a sample dataset, some of the functions of numpy, and the matplotlib library to visualize the data.


Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(4)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()


Now, we need to detect anomalies in this dataset.


Defining the model and anomaly detection

   We'll define the model by using the GaussianMixture class of Scikit-learn API. Here, I'll define the class with a default value. You can set some of the arguments according to your dataset content.

gausMix = GaussianMixture().fit(x)
print(gausMix) 
 
GaussianMixture(covariance_type='full', init_params='kmeans', max_iter=100,
                means_init=None, n_components=1, n_init=1, precisions_init=None,
                random_state=None, reg_covar=1e-06, tol=0.001, verbose=0,
                verbose_interval=10, warm_start=False, weights_init=None)

We'll compute the weighted log probabilities for each sample with a score_sample() method.

scores = gausMix.score_samples(x)

Next, we'll extract the threshold values from the scores data by using quantile() function.

thresh = quantile(scores, .03)
print(thresh) 
-2.4998195352804533

Based on the extracted threshold value, we'll find the samples with the scores that are equal to or lower than the threshold value.

index = where(scores <= thresh)
values = x[index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()



   In this tutorial, we've learned how to detect the anomalies with the Gaussian mixture method by using the Scikit-learn's GaussianMixture class in Python. The full source code is listed below.

   We've been learned several methods of anomaly detection by using different methods with Python and R in previous tutorials. Please check this blog to learn more about them.


Source code listing

 
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(4)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

plt.scatter(x[:,0], x[:,1])
plt.show()

gausMix = GaussianMixture().fit(x)
print(gausMix)

scores = gausMix.score_samples(x)

thresh = quantile(scores, .03)
print(thresh)
 
index = where(scores <= thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
 


References:
  1. Scikit-learn API

No comments:

Post a Comment