Understanding standard deviation and 68-95-99.7 rule with R

   Standard deviation is one of the key fundamental concepts in statistics and numerical data analysis. It helps us to understand the data and do some investigation on it.
   Empirical rule or 68-95-99.7 rule explains the percent of data within 1, 2, and 3 standard deviation range for normal distribution.
   In this post, we'll briefly learn those two definitions with R.

Preparing the data

   First, we'll generate a sample population data for this tutorial. We can create it with the sample() command in R.

set.seed(1234)
 
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)

We'll visualize the x data in a plot.

plot(x, type = "l", col = "blue")


Standard deviation

   Before going to the standard deviation, we need to understand the mean value of giving vector data.

Mean (μ) is a central value of elements in a numerical set. We can get the mean value of an x vector with the mean() command in R.

mean(x)
[1] -0.93

Standard deviation is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use the sd() command in R.

sd(x)
[1] 16.07951

Variance is the value of squared deviation from the mean value of a set. Variance can be taken with the below commands.

var(x)
[1] 258.5506
 
sd(x)^2
[1] 258.5506


Empirical or 68-95-99.7 rule

   The percentage of values located in a range of 1σ, 2σ, and 3σ will be 68%, 95%, and 99.7% respectively. The 68-95-99.7 rule is based on those values and its name comes from those percentage values. It explains the distribution of sample data in the range of 1, 2 and 3 sigmas and their statistical percentage in those areas. Here, 1σ  represents the range between -σ to σ, 2σ is from -2σ to 2σ, and 3σ is from -3σ to 3σ. 

We can check the x data and its sigma range by plotting the normal distribution plot.

s <- sd(x)
m <- mean(x)
 
index <- seq(min(x), max(x), length = 100)
dn <- dnorm(index, mean = m, sd = s) 
 
plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")


Calculating the percentages


   Finally, we'll calculate the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] ranges.

x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])
[1] 67
 
x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])
[1] 95
 
x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE])
[1] 99


The results show the closest outputs to the expected values. If we increase the number of samples we'll come closer to the values of 68-95-99.7.

   In this post, we've briefly learned about standard deviation and 68-95-99.7 rule with R and learned how to calculate the percentage of population data in 1, 2, and 3σ range. The full source code is listed below.


Source code listing

set.seed(1234)
 
x <- sample(-50:50, 100, replace = T)
x[sample(1:100, 80)] = sample(-20:20,80, replace = T)
 
plot(x, type = "l", col = "blue")
 
mean(x) 
 
sd(x)
 
var(x)
 
sd(x)^2
 
s <- sd(x)
m <- mean(x)
 
index <- seq(min(x), max(x), length = 100)
dn <- dnorm(index, mean = m, sd = s)  
 
plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid() 
text(m - 2, .02, "μ", pos = 3)
abline(s, 1, col = "green")
abline(-s, 1, col = "green")
text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
abline(-2 * s, 1, col = "blue")
abline(2 * s, 1, col = "blue")
text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
abline(3 * s, 1, col = "red")
abline(-3 * s, 1, col = "red")
text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")
 
x <= s & x >= (-s) -> sigma1
length(sigma1[sigma1 == TRUE])

x <= (s * 2) & x >= (-s * 2) -> sigma2
length(sigma2[sigma2 == TRUE])

x <= (s * 3) & x >= (-s * 3) -> sigma3
length(sigma3[sigma3 == TRUE]) 



1 comment:

  1. Data science is one of the top course in today's career. Your content will going to helpful for all the beginners who are trying to find Empirical Rule in Statistics . Thanks for sharing useful information. keep updating.

    ReplyDelete