Anomaly Detection Example with Gaussian Mixture in R

    Gaussian Mixture is a probabilistic model to represent a mixture of multiple Gaussian distributions on population data. The model is widely used in clustering problems.  
    In this tutorial, you'll briefly learn how to detect  outliers in a data by using Gaussian Mixture method  in R. We'll use mclus() function of Mclust library in R. 
    The tutorial covers:
  1. Preparing the data
  2. Defining the model and anomaly detection
  3. Video tutorial
  4. Source code listing
    We'll start by loading the required library.

 
library(mclust) 
 


Preparing the data

   We'll create a random sample dataset for this tutorial and visualize it in a plot to check it visually.

 
set.seed(124)

n = 500 x = runif(n)*10 x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
 
 


We'll try to find out the outliers in this dataset.

We need to scale the data.

 
x = scale(x)[,1]
 


Defining the model and anomaly detection

   We'll define the model by using the mclust() function of Mclust library. Here, I'll set 3 to number of the component G, and V model type. We'll fit the model on x data and print the summary of it. 

 
xfit = Mclust(x, G=3, model="V")

summary(xfit)
 
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust V (univariate, unequal variance) model with 3 components: 

 log-likelihood   n df       BIC       ICL
      -610.9331 500  8 -1271.583 -1344.313

Clustering table:
  1   2   3 
  6 331 163 

Next, we'll predict the x data with the xfit model.

 
pred = predict(xfit)
str(pred)
 
List of 2
 $ classification: int [1:500] 2 2 2 2 2 2 2 2 3 2 ...
 $ z             : num [1:500, 1:3] 0.00804 0.00324 0.00424 0.0032 0.00394 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "1" "2" "3"

We'll use the first column of z property.

 
xpred = pred$z[,1]
 

Next, we'll extract the threshold values from the probability scores by using quantile() function. Here, 0.99 means that we'll quantile the value of 99%.

 
thr = quantile(xpred, .99)
print(thr)
 
     99% 
0.5860772 

By using the threshold value, we'll find the samples with the scores that are equal to or higher the threshold value. Then, we'll get the index of those values.

 
outliers = which(xpred >= thr)
index = x[outliers]
 

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

 
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")
 


   In this tutorial, we've learned how to detect the anomalies with the Gaussian mixture method by using the mclust function of Mclust library in R. The full source code is listed below.


Video Tutorial



Source code listing

 
library(mclust)
 
set.seed(124)
n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
 
x = scale(x)[,1]
xfit = Mclust(x, G=3, model="V")

summary(xfit)
pred = predict(xfit)
str(pred)

xpred = pred$z[,1]
thr = quantile(xpred, .99)
print(thr)

outliers = which(xpred >= thr) index = x[outliers] plot(x, col="blue", type='l', pch=19) points(outliers,index, pch=19, col="red")
 


No comments:

Post a Comment