## Important Notation

Scientific notation, while often confusing and frustrating initially, is very useful in helping to convey complex ideas in a compact and precise manner. The table below contains scientific notations relevant to the previous tutorial on central tendency and this one on variability.

We will be using the following notation in this class:

Symbol | meaning |
---|---|

\(y\) | Dependent Variable |

\(x\) | Independent Variable |

\(N\) | Population size |

\(n\) | Sample size |

\(\Sigma\) | Sum of |

\(\mu\) | Population mean |

\(M\) | sample mean (also often \(\bar{x}\)) |

\(M_w\) | Weighted mean (or \(\bar{x}_w\) ) |

\(IQR\) | Interquartile range |

\(\sigma^2\) | Population variance |

\(s^2\) | sample variance |

\(\sigma\) | Population standard deviation |

\(s\) | sample standard deviation |

## Range

#### GRE scores for two classes with the same mean (151.3) and same range (40):

We need some way to quantify the variability of the scores in distributions that reflects all scores and is sensitive to outliers. We already have a single score for the average or “typical” score - the mean. Ideally, we want a single number that represents the typical variation from the mean.

## Range

The range is the simplest way to describe variability or how scores are dispersed across possible values. The range is the difference between the highest and lowest values in the variable.

` Range = Highest - Lowest`

The range is useful for identifying outliers. But it is also very sensitive to outliers. If one value is drastically different for the others, the range can be misleading. For example, see the histogram of GRE scores above. This makes a major limitation of the range apparent: it is based on only two of the scores in the variable.

## Quantiles

Quantiles are a set of values in a variable that divide it into equal groups. The most common is quartiles, which divide a variable into four equal parts, so that there are the same number of scores in each quartile. The lower or first quartile separates the lower 25% of the scores from the upper 75%, the second or median quartile – which is the median value – separates the lower and upper 50%, and the third or upper quartile separates the lower 75% from the upper 25%. These three quartiles separate the variable into 4 equal parts.

## Variance

The variance is a very important measure of variability. It is also closely related to the standard deviation, which we talk about next.

### Deviance

We can calculate the distance between the mean and each score. These distances are called deviations. Each score has a deviation. I have plotted the histogram of the deviations for each of the two classes GRE scores below.

We might think to just take the mean of the deviations as a measure of the average or “typical” distance from the mean for each set of scores. But, the mean deviation for the first score is 0 and the mean deviation for the second set is 0. This is because, being a measure of central tendency, the mean is in the middle, and the positive distances of scores above the mean cancel out the negative distances below the mean.

We could take the mean of the absolute values of the deviations, but a more ingenious solution is to square the deviations.

This does two things:

- Squaring takes care of the problem of the deviations summing to 0, as all the squared deviations will be positive. Remember, positive times a positive is a positive, but a negative times a negative is also a positive.
- Summing the squared deviations is a minimal number. Explaining this would take use too far afield. Suffice it to say, the sum of the squared deviations is more influenced by scores further from the mean, making the variance sensitive to outliers

The sum of the squared deviations is often referred to simply as the “sum of squares”, and symbolized as \(SS\). The sum of squares is not very meaningful by itself, so we often calculate the mean squared deviation, by dividing the \(SS\) by \(N\). This give us the population variance:

\[ \sigma^2 = \frac{SS}{N} = \frac{\Sigma(x - \mu)^2}{N} \]

The sample variance is calculated in a similar way:

\[ s^2 = \frac{\Sigma{(x - M)^2}}{n - 1} \]

### Comparing formulae for mean and variance

\[ \mu = \frac{\Sigma x}{N}, \quad \sigma^2 = \frac{\Sigma(x - \mu)^2}{N}. \]

Comparing the formulae for the mean and variance makes clear that the variance is the mean squared deviation.

## Standard Deviation

If the variance is the mean squared distance from the mean, then taking the square root of the variance gives us the mean, or average, distance from the mean.

### Population standard deviation

\[ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{SS}{N}} = \sqrt{\frac{\Sigma{(x -\mu)^2}}{N}}. \]

### Sample standard deviation

\[ s = \sqrt{s^2} = \sqrt{\frac{SS}{n-1}} = \sqrt{\frac{\Sigma{(x - M)^2}}{n-1}}. \]

The variance of x1 is 54.71 and the variance of x2 is 9.51. The standard deviation of x1 is 7.4 and the standard deviation of x2 is 3.08.

```
vars n mean sd min max range se
x1 1 980 151.15 7.40 130.01 169.49 39.48 0.24
x2 2 980 151.34 3.08 130.00 170.00 40.00 0.10
```