August 2, 2006 – Page 4

More from Statistics Hacks … Hack #5: Go Big to Get Small.

Accuracy is the big bugaboo with statistics. We can crunch numbers and manipulate (er, massage) data to get interesting results, but the proof is in the pudding. And since the whole point of statistics in the first place is to avoid experimentation on massive populations, it takes lots of pudding to get a good enough taste to satisfy. Increasing the size of the sample tends to improve the reliability of the conclusions.

The basis for this insight is Jakob Bernoulli’s Golden Theorem. “It is likely the single most useful discovery in the history of statistics,” according to author Bruce Frey.

In probability, this means that the number of sample points examined will determine how far away the predictions are from the actual results embedded in the world. This standard error, the gap between the guess and the ideal observation, can be calculated in random systems as the inverse of the square root of the sample size. The specific value depends on the scale of measurement and variability of the sample, but the gist is: big samples = small error.

To illustrate this further, imagine yourself counting cards with your big, exploitive brother in the movie “Rain Man.” As soon as the first cards are revealed, Raymond can make a guess about what cards are still in the deck. But with so many card left in the deck, it is not likely to be an accurate guess. The only certainty is that chance of getting cards like the ones he has seen has gone down. If he sees an ace of hearts on the table, then there is at least one fewer aces hidden in the deck. The more cards played, the more certain Raymond can be that he knows what remains in the shrinking deck. Fortunately, statisticians don’t get kicked out of academia for counting cards; they publish papers.

This all works, though, only if the sample is considered statistically random. Otherwise (sorry) … all bets are off.

Also: ,

Some definitions:

Law of Large Numbers

as the size of the sample increases, the mean of the sample grows closer to the mean of the whole population

standard error

the gap between the expected outcome and the actual observed proportions