Anomaly Detection Example with Gaussian Mixture in Python

   The Gaussian Mixture Model (GMM) is a powerful probabilistic model that represents a mixture of Gaussian distributions and it is widely used in clustering problems. 

    In this tutorial, we'll learn how to detect anomalies in a dataset by using the GaussianMixture class of Scikit-learn API. The tutorial covers:

  1. Preparing the dataset
  2. Defining the model and anomaly detection
  3. Source code listing


    
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

We'll start by loading the required libraries for this tutorial.

 
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt


 

Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

 
# Set a seed for reproducibility
random.seed(4)

# Generate a synthetic dataset with make_blobs
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3
center_box=(20, 5))
 
# Visualize the dataset using a scatter plot
plt.scatter(x[:, 0], x[:, 1])
plt.show()
 

This is a target data to detect anomalies by using Gaussian Mixture method.


Defining the model and anomaly detection

    In scikit-learn's GaussianMixture class, the score_samples method computes the log likelihood of each sample in the input data. The log likelihood represents how well the observed data fits the estimated Gaussian mixture model. 

    In the context of anomaly detection, we can set a threshold on these log likelihood scores. Samples with log likelihoods below a certain threshold are considered anomalies or outliers, as they are less likely to be generated by the learned Gaussian mixture model. 

    We'll define the model by using the GaussianMixture class of Scikit-learn. Here, we'll use the class with a default value. You can set some of the arguments according to your dataset content. You can check all default parameters used in a class with get_params() method.

 
# Fit a Gaussian Mixture Model to the dataset
gausMix = GaussianMixture().fit(x)

# Access and print entire attribute dictionary
print(gausMix.get_params())

{'covariance_type': 'full', 'init_params': 'kmeans', 'max_iter': 100, 'means_init': None, 
'n_components': 1, 'n_init': 1, 'precisions_init': None, 'random_state': None, 
'reg_covar': 1e-06, 'tol': 0.001, 'verbose': 0, 'verbose_interval': 10, 'warm_start': False, 
'weights_init': None}

We'll get the weighted log probabilities for each sample with a score_sample() method.

 
# Compute the weighted log probabilities for each sample
scores = gausMix.score_samples(x)

Next, we'll extract the threshold values from the scores data by using quantile() function.

 
# Extract the threshold for anomaly detection using quantile
thresh = quantile(scores, .03)
print(thresh)

-2.4998195352804533
 

Based on the extracted threshold value, we'll identify samples with scores equal to or lower than the threshold. 

 
# Identify samples with scores equal to or lower than the threshold
index = where(scores <= thresh)
values = x[index]
 

Finally, we'll visualize the results by highlighting the anomalies in a red.

 
# Visualize the dataset with anomalies highlighted in red
plt.scatter(x[:, 0], x[:, 1])
plt.scatter(values[:, 0], values[:, 1], color='r')
plt.show()
 

   In this tutorial, we've learned how to detect the anomalies with the Gaussian mixture method by using the Scikit-learn's GaussianMixture class in Python. We detected the anomalies in a data by using their log likelihood scores. The full source code is listed below.


Source code listing

 
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

# Set a seed for reproducibility
random.seed(4)

# Generate a synthetic dataset with make_blobs
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))

# Visualize the dataset using a scatter plot
plt.scatter(x[:, 0], x[:, 1])
plt.show()

# Fit a Gaussian Mixture Model to the dataset
gausMix = GaussianMixture().fit(x)

# Access and print entire attribute dictionary
print(gausMix.get_params())

# Compute the weighted log probabilities for each sample
scores = gausMix.score_samples(x)

# Extract the threshold for anomaly detection using quantile
thresh = quantile(scores, .03)
print(thresh)

# Identify samples with scores equal to or lower than the threshold
index = where(scores <= thresh)
values = x[index]

# Visualize the dataset with anomalies highlighted in red
plt.scatter(x[:, 0], x[:, 1])
plt.scatter(values[:, 0], values[:, 1], color='r')
plt.show()



References:
  1. Scikit-learn API

4 comments:

  1. A comment and a question.
    Comment: when I print(gausMix), I get "GaussianMixture()". Do you know why we see a difference? (I am following your code verbatim).

    Question: I think you chose your threshold based on how you designed your data blobs. How would you pick a threshold in general?

    ReplyDelete
    Replies
    1. 1. This may help you.

      from sklearn import set_config

      2. Yes you need to set the threshold according to your data content.

      Delete
  2. Hello I would like to ask, If I can somehow to distribute data after GMM by quartiles (Q1 Q2 and Q3)? can I do it somehow?

    ReplyDelete