## Pages

### Understanding standard deviation and 68-95-99.7 rule with R

Standard deviation is one of the key fundamental concepts in statistics and numerical data analysis. In this post, we'll briefly learn about the standard deviation and 68-95-99.7 rule with R.
First, we'll generate sample vector data for this tutorial. We can create it with the sample() command in R.

`> set.seed(1234)  # reproduces the same result`
```> x <- sample(-50:50, 100, replace = T)
> x[sample(1:100, 80)] = sample(-20:20,80, replace = T)```

To check the content of the x vector, we'll visualize it in a plot.

`> plot(x, type = "l", col = "blue")`

Before going to the standard deviation, we need to understand the mean value of giving vector data.

Mean (μ) is a central value of elements in a numerical set. We can get the mean value of an x vector with the mean() command in R.

```> mean(x)
 -0.93```

Standard deviation is a measurement value of variations (differences) of the elements from the mean value of a set. It can be represented by σ letter, std, or SD. To get σ value, we'll use the sd() command in R.

```> sd(x)
 16.07951```

Variance is the value of squared deviation from the mean value of a set. Variance can be taken with the below commands.

```> var(x)
 258.5506
> sd(x)^2
 258.5506```

#### 68-95-99.7 rule

The percentage of values located in a range of 2σ, 4σ, and 6σ will be 68%, 95%, and 99.7% respectively. The 68-95-99.7 rule is based on those values and its name comes from those percentage values. It explains the distribution of sample data in the range of 2, 4 and 6 sigmas and their statistical percentage in those areas. Here, 2σ contains the range between -σ to σ and 68% of data fall within this area.

We can check the x data and its sigma range by plotting normal distribution plot.

```> s <- sd(x)
> m <- mean(x)```
`> index <- seq(min(x), max(x), length = 100)`
`> dn <- dnorm(index, mean = m, sd = s) `
`> plot(index, dn, type = "l", lwd = 2) + abline(m, m) + grid()`
```> text(m - 2, .02, "μ", pos = 3)
> abline(s, 1, col = "green")
> abline(-s, 1, col = "green")
> text(s + 2, .02, "σ", pos = 3, col = "darkgreen")
> text(-s + 2, .02, "-σ", pos = 3, co = "darkgreen")
> abline(-2 * s, 1, col = "blue")
> abline(2 * s, 1, col = "blue")
> text(2 * s + 3, .02, "2σ", pos = 3, col = "blue")
> text(-2 * s + 3, .02, "-3σ", pos = 3, col = "blue")
> abline(3 * s, 1, col = "red")
> abline(-3 * s, 1, col = "red")
> text(3 * s + 3, .02, "3σ", pos = 3, col = "red")
> text(-3 * s + 3, .02, "-3σ", pos = 3, col = "red")```

Finally, we'll calculate the percentages of values in 2σ [-σ:σ], 4σ [-2σ:2σ], and 6σ [-3σ:3σ] ranges.

```> x <= s & x >= (-s) -> sigma1
> length(sigma1[sigma1 == TRUE])
 67
> x <= (s * 2) & x >= (-s * 2) -> sigma2
> length(sigma2[sigma2 == TRUE])
 95
> x <= (s * 3) & x >= (-s * 3) -> sigma3
> length(sigma3[sigma3 == TRUE])
 99```

The results shows the closest outputs to the expected values. If we increase the number of samples we'll come closer to the values of 68-95-99.7.