Understanding standard deviation and 68-95-99.7 rule with R

   Standard deviation is one of the key fundamental concepts in statistics and numerical data analysis. In this post, we'll briefly learn about the standard deviation and 68-95-99.7 rule with R.
   First, we'll generate sample vector data for this tutorial. We can create it with the sample() command in R.

> set.seed(1234)  # reproduces the same result
> x <- sample(-50:50, 100, replace = T)
> x[sample(1:100, 80)] = sample(-20:20,80, replace = T)

To check the content of the x vector, we'll visualize it in a plot.

> plot(x, type = "l", col = "blue")



Before going to the standard deviation, we need to understand the mean value of giving vector data.

Mean (μ) is a central value of elements in a numerical set. We can get the mean value of an x vector with the mean() command in R.

> mean(x)
[1] -0.93

Standard deviation is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use the sd() command in R.

> sd(x)
[1] 16.07951

Variance is the value of squared deviation from the mean value of a set. Variance can be taken with the below commands.

> var(x)
[1] 258.5506
> sd(x)^2
[1] 258.5506

68-95-99.7 rule

   The percentage of values located in a range of 2σ, 4σ, and 6σ will be 68%, 95%, and 99.7% respectively. The 68-95-99.7 rule is based on those values and its name comes from those percentage values. It explains the distribution of sample data in the range of 2, 4 and 6 sigmas and their statistical percentage in those areas. Here, 2σ contains the range between -σ to σ and 68% of data fall within this area.

We can check the x data and its sigma range by plotting normal distribution plot.

> s <- sd(x)
> m <- mean(x)
> index <- seq(min(x), max(x), length = 100)
> dn <- dnorm(index, mean = m, sd = s) 
> plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()
> text(m - 2, .02, "μ", pos = 3)
> abline(s, 1, col = "green")
> abline(-s, 1, col = "green")
> text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
> text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
> abline(-2 * s, 1, col = "blue")
> abline(2 * s, 1, col = "blue")
> text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
> text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
> abline(3 * s, 1, col = "red")
> abline(-3 * s, 1, col = "red")
> text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
> text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")




   Finally, we'll calculate the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] ranges.

> x <= s & x >= (-s) -> sigma1
> length(sigma1[sigma1 == TRUE])
[1] 67
> x <= (s * 2) & x >= (-s * 2) -> sigma2
> length(sigma2[sigma2 == TRUE])
[1] 95
> x <= (s * 3) & x >= (-s * 3) -> sigma3
> length(sigma3[sigma3 == TRUE])
[1] 99



The results shows the closest outputs to the expected values. If we increase the number of samples we'll come closer to the values of 68-95-99.7.


No comments:

Post a Comment