Categories
BlogSchmog Of Course

Stats Hack #2

Next up from O’Reilly’s Statistics Hacks is: Describe the World Using Just Two Numbers. This is an explanation of something called the Central Limit Theorem.

Amongst all of the chi-squares and alphas, one of the few concepts I know very well is the mean … the average of a set up numbers, derived by adding up all the values and dividing by the number of points of data in the sample. Easy. What I didn’t know, though, is that there is something very interesting about that number. A mean is a number that is as close as a number can be to all of the values in the set. That is, if you added up all of the differences between the mean and a given number, that sum total would be smaller than if you did that with any other value. And, if you take that sum total for the mean and divide it by the total number of data points, you get the standard deviation. (Actually, the formula for S.D. is a little more complex, given the presence of negative numbers. It is the square root of the sum of the squares of each distance over the number of scores minus 1.)

The Central Limit Theorem states that “if you randomly select multiple samples form a population, the means of each of those samples will be normally distributed.” (Frey) That translates to being able to use standard deviation and mean — and at least N = 30 in the sample size — to project what the entire population is like. The larger the sample size, the more accurate, but even one sample of adequate size can be a good estimate. Dead math people told us so. What is ultimately calculated is the standard error of the mean, or the degree to which the sample mean would stray from the population mean.

This test can be done to see whether a sample was a random sampling. If the mean of the sample is not within the standard error of the actual mean, then the sample wasn’t drawn by chance. Being within the standard error is an indication that the sample was affected by “lots of random forces and unrelated events,” and thus has a normal distribution. This is true whether or not the population itself is normal.

Also: ,


Some definitions:

descriptive statistics

the properties of the sample scores

inferential statistics

the properties of the entire population based on what is know about sample scores

Central Limit Theorem

the use of two sample values and an assumption about the shape of the distribution across a population to accurately describe that population

central tendency

a fair summary representation of all scores in a sample

mean

the arithmetic average of all scores in a sample (total value / total data points), often the best measure of central tendency

variability

a representation of how far from center most of the data falls

standard deviation

the average distance from each score to the mean, often the best measure of variability in a data set because it uses all values in the distribution

variance

the square of the standard deviation, most useful in comparing different distributions rather than describing a single distribution

standard error of the mean

the degree to which the sample mean would stray from the population mean, calculated using standard deviation of a sample over the square root of the number in the sample