Empirical rule or 68-95-99.7 rule explains the percent of data within 1, 2, and 3 standard deviation range for normal distribution.

In this post, we'll briefly learn those two definitions with R.

**Preparing the data**

First, we'll generate a sample population data for this tutorial. We can create it with the sample() command in R.

`set.seed(1234)`

` `

```
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)
```

We'll visualize the x data in a plot.

`plot(x, type = "l", col = "blue")`

**Standard deviation**

Before going to the standard deviation, we need to understand the mean value of giving vector data.

*(μ) is a central value of elements in a numerical set. We can get the mean value of an x vector with the mean() command in R.*

**Mean**```
mean(x)
[1] -0.93
```

*is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use the sd() command in R.*

**Standard deviation**```
sd(x)
[1] 16.07951
```

*is the value of squared deviation from the mean value of a set. Variance can be taken with the below commands.*

**Variance**```
var(x)
[1] 258.5506
```

` `

```
sd(x)^2
[1] 258.5506
```

**Empirical or 68-95-99.7 rule**

The percentage of values located in a range of 1σ, 2σ, and 3σ will be 68%, 95%, and 99.7% respectively. The 68-95-99.7 rule is based on those values and its name comes from those percentage values. It explains the distribution of sample data in the range of 1, 2 and 3 sigmas and their statistical percentage in those areas. Here, 1σ represents the range between -σ to σ, 2σ is from -2σ to 2σ, and 3σ is from -3σ to 3σ.

We can check the x data and its sigma range by plotting the normal distribution plot.

```
s <- sd(x)
m <- mean(x)
```

` `

`index <- seq(min(x), max(x), length = 100)`

`dn <- dnorm(index, mean = m, sd = s) `

` `

`plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()`

```
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")
```

**Calculating the percentages**

Finally, we'll calculate the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] ranges.

```
x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])
[1] 67
```

```
x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])
[1] 95
```

```
x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE])
[1] 99
```

The results show the closest outputs to the expected values. If we increase the number of samples we'll come closer to the values of 68-95-99.7.

In this post, we've briefly learned standard deviation and 68-95-99.7 rule in R. The full source code is listed below.

**Source code listing**

`set.seed(1234)`

` `

```
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)
```

` `

`plot(x, type = "l", col = "blue")`

` `

`mean(x) `

` `

```
sd(x)
```

```
var(x)
```

`sd(x)^2`

` `

```
s <- sd(x)
m <- mean(x)
```

` `

`index <- seq(min(x), max(x), length = 100)`

`dn <- dnorm(index, mean = m, sd = s) `

` `

`plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid() `

```
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")
```

` `

```
x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])
x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])
x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE])
```

Data science is one of the top course in today's career. Your content will going to helpful for all the beginners who are trying to find Empirical Rule in Statistics . Thanks for sharing useful information. keep updating.

ReplyDelete