Understanding standard deviation and 68-95-99.7 rule with R


   Standard deviation is one of the key fundamental concepts in statistics and numerical data analysis. This post is about a brief explanation of standard deviation and 68-95-99.7 rule with R.
   First, we'll generate x vector data with the sample() command in R.

> set.seed(1234)  # reproduces the same random results in every execution
> x <- sample(-50:50, 100, replace = T)
> x[sample(1:100, 80)] = sample(-20:20,80, replace = T)

To check the content of x vector, we'll visualize it in a plot.

> plot(x, type = "l", col = "blue")



Before going to standard deviation, we should know the mean value.
 Mean (μ) is a central value of elements in a numerical set.
Getting the mean value of an x data set, we can use a mean() command

> mean(x)
[1] -0.93

Standard deviation is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use sd() command in R.

> sd(x)
[1] 16.07951

Variance is the value of squared deviation from the mean value of a set.
Variance can be taken with below commands.

> var(x)
[1] 258.5506
> sd(x)^2
[1] 258.5506

68-95-99.7 rule

   Percentage of values located in a range of 2σ, 4σ, and 6σ will be 68%, 95%, and 99.7%, respectively. Thus, it is called the 68-95-99.7 rule.
Here, 2σ contains the range between -σ to σ and 68% of data falls within this area.
Next, we'll check x data and its sigma range with plotting normal distribution plot.

> s <- sd(x)
> m <- mean(x)
> index <- seq(min(x), max(x), length = 100)
> dn <- dnorm(index, mean = m, sd = s) 
> plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()
> text(m - 2, .02, "μ", pos = 3)
> abline(s, 1, col = "green")
> abline(-s, 1, col = "green")
> text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
> text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
> abline(-2 * s, 1, col = "blue")
> abline(2 * s, 1, col = "blue")
> text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
> text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
> abline(3 * s, 1, col = "red")
> abline(-3 * s, 1, col = "red")
> text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
> text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")




   Finally, we'll get the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] range.

> x <= s & x >= (-s) -> sigma1
> length(sigma1[sigma1 == TRUE])
[1] 67
> x <= (s * 2) & x >= (-s * 2) -> sigma2
> length(sigma2[sigma2 == TRUE])
[1] 95
> x <= (s * 3) & x >= (-s * 3) -> sigma3
> length(sigma3[sigma3 == TRUE])
[1] 99

No comments:
Post a Comment