We saw the anomaly detection technique with the Gaussian Mixture method in Python in the previous post. In this tutorial, we'll learn how to apply the same method in R.

In brief, the Gaussian Mixture is a probabilistic model to represent a mixture of
multiple Gaussian distributions on population data. The model is widely
used in clustering problems. Here, we apply the prediction probability scores to find out the outliers in a dataset. We'll use mclus() function of Mclust library in R.

The
tutorial covers:

- Preparing the data
- Defining the model and anomaly detection
- Video tutorial
- Source code listing

We'll start by loading the required library.

`library(mclust) `

**Preparing the data**

We'll create a random sample dataset for this tutorial and visualize it in a plot to check it visually.

`set.seed(124)`

n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)

plot(x, col="blue", type='l', pch=19)

We'll try to find out the outliers in this dataset.

We need to scale the data.

`x = scale(x)[,1]`

**Defining the model and anomaly detection**

We'll define the model by using the mclust() function of
Mclust library. Here, I'll set 3 to number of the component G, and V model type. We'll fit the model on x data and print the summary of it.

`xfit = Mclust(x, G=3, model="V")`

summary(xfit)

```
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust V (univariate, unequal variance) model with 3 components:
log-likelihood n df BIC ICL
-610.9331 500 8 -1271.583 -1344.313
Clustering table:
1 2 3
6 331 163
```

Next, we'll predict the x data with the xfit model.

`pred = predict(xfit)`

str(pred)

```
List of 2
$ classification: int [1:500] 2 2 2 2 2 2 2 2 3 2 ...
$ z : num [1:500, 1:3] 0.00804 0.00324 0.00424 0.0032 0.00394 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "1" "2" "3"
```

We'll use the first column of z property.

`xpred = pred$z[,1]`

Next, we'll extract the threshold values from the probability scores by using quantile() function. Here, 0.99 means that we'll quantile the value of 99%.

`thr = quantile(xpred, .99)`

print(thr)

```
99%
0.5860772
```

By using the threshold value, we'll find the samples with the
scores that are equal to or higher the threshold value. Then, we'll get the index of those values.

```
outliers = which(xpred >= thr)
index = x[outliers]
```

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

```
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")
```

In this tutorial, we've learned how to detect the anomalies with the
Gaussian mixture method by using the mclust function of Mclust library in R. The full source code is listed below.

**Video Tutorial**

**Source code listing**

```
library(mclust)
set.seed(124)
n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
x = scale(x)[,1]
xfit = Mclust(x, G=3, model="V")
```

summary(xfit)

pred = predict(xfit)

str(pred)

xpred = pred$z[,1]
thr = quantile(xpred, .99)

print(thr)

outliers = which(xpred >= thr)
index = x[outliers]
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")

## No comments:

## Post a Comment