Monday, October 16, 2006

Statistics Primer: Part 4: Sampling Distributions

The time has come to make a clearer link between the messages of my previous posts. One of them talked about the concept of probability and one about the sample mean, sample proportion and sample standard deviation. We are now going to build on those two posts to bring them together into a fruitful marriage of sorts.

To begin with, cast your mind back on the two disciplines: statistics and probability theory. How do they relate to each other? Think of this example: I have a deck of 52, cards, half of which are red and half black. If I randomly draw five cards from this deck, how likely is it that all of them are red? Now that is a question in probability theory. We know what the population (the deck of cards) looks like, and we wish to learn what the sample (the hand of five cards) might look like. Statistics reverses this way of thinking. For example, suppose that you have a deck of 52 cards, but you have manipulated the deck so that it no longer has exactly half red cards. You give me a randomly drawn hand of five cards and my job is to try to figure out from the hand I have what the deck proportions of red and black might be.

In short, statistics uses probability theory "in reverse".

We need one additional pair of terms to get going, and that is the pair of "a variable" and "a constant". A constant is exactly what it sounds like: a number which has a specified constant value. For example, if we calculate the sample proportion of people who love chocolate to be 0.95, then that is a constant. A variable is then not a constant. If we have a sample of annual incomes for Americans, and these incomes vary from zero to, say, 200,000 dollars, then "annual income" is a variable with many different values depending on which sample observation we select.

Variables can be quantitative, as in the previous example, or they can be qualitative. An example of the latter would be the religion of a person in a poll about voting. It can take many different values (Catholic, Evangelical Christian, Jewish, Muslim, Buddhist, Other, Atheist). These are qualities, not quantities. But remember, we can count proportions to get quantitative data on them.

Quantitative variables are of two main types: continuous and discrete. A continuous variable is one where the following question will be answered in the affirmative: If you take any two values of the variable, is there always a third possible value between the two? Heights and weights are continuous variables. Check this: If one piece of chocolate weighs 2 oz. and another 3 oz. it's clearly possible to have a third piece which weighs something between 2 and 3 oz. The number of visitors to a store selling chocolate on one weekday on the other hand, is a discrete variable. Check this: If the store had 20 visitors one Monday and 21 on another Monday, is it possible that it had, say, 20.75 visitors on some other Monday? The answer is negative (unless we are averaging over the visitor numbers but then we are talking about something different).

The next stage in our adventure is a magic trick. We are going to take the sample statistics such as the mean, the variance and the standard deviation, all constants, and we are going to transform them into variables! Well, not really, but something a little like that. How is this trick carried out?

Let's go back to the example of a deck of 52 playing cards. This deck has been manipulated by someone so that it doesn't necessarily have 26 red cards and 26 black cards as the usual decks do, and we don't know the true proportion of, say, red cards. Earlier I suggested that one hand of five playing cards has been randomly drawn from this deck. This sample could be used to make a point estimate of the proportion of red cards in the deck. Suppose the hand contains three red cards and two black ones. Then the proportion of red cards in the sample is 0.6. This is a constant, right?

Suppose now the evil person who messed up the deck takes these cards pack into the deck, reshuffles it, and deals out another five cards. The new sample has two red cards, so the sample proportion is 0.4. The cards are returned to the deck again, and yet another hand of five cards are dealt out, with three red cards (a sample proportion of 0.6 again). And so on. What is going on here?

Note that we now have a variable, the sample proportion of red cards. It can take more than one value in the experiment. Imagine the dealing-out of five cards continued for, say, a hundred times. We'd then have a large number of sample proportions, and we could use those numbers to figure out the average of all these proportions! We could even calculate a variance and a standard deviation for the sample proprotion! Layers upon layers, you might say. First we had one single sample proportion and one single sample standard deviation. Now we have a whole distribution of sample proportions and we have a standard deviation for the sample proportion itself!

The title of this post has to do with sampling distributions. This is a fancy name for showing us all the possible values and their probabilities that, say, a sample mean could take in this sort of repeated sampling. Or the sample proportion. Or the sample standard deviation. So a sampling distribution is a probability distribution for a sample statistic. If we bring in the heavy artillery of mathematics, it turns out that we can derive explicit formulas for the measures of central tendency and dispersion of these probability distributions. Quite wonderfully, it turns out that the average (or the expected value, as it is properly called) of all sample proportions is....voila! the population proportion! The very thing we wish to estimate! The same is true for the sampling distribution of the sample mean. It averages (in the expected value sense) to the population mean in this sort of repeated sampling (dealing out the five cards over and over from the deck).

The formulas for the variance and the standard deviation of the sample proportion and the sample mean are not quite as intuitive, but they make sense after some thought. Take the sample variance and divide it by the sample size. You get the variance of the sample mean/proportion. If you need the standard deviation, take a square root of the whole thing. So these measures of dispersion get smaller the larger the sample we use. Makes sense. Think about the deck example, but with ten cards dealt out every time. The proportion of red cards in these bigger samples is not going to vary as much from the true unknown proportion in the deck as it will in the smaller samples.

We are almost ready to start playing with real examples of polls. There's only one missing link, and that has to do with the question of probabilities in the sampling distributions. How do we know how likely each possible sample mean or proportion is in the repeated sampling? That is the topic of the next post in this series.

The crucial point of this post in the series is the following: When we take a sample and use it to make estimates about the unknown population characteristics of, say, voter preferences between candidates, we think of this sample as coming from a distribution of many possible samples. This means that the sample proportion we get is a variable. It has a mean and a variance of its own, and the variance, in particular, affects the trust we can place in our estimates.
This is a wonderful toy to play with. It lets you see the sampling distribution appear.
You can read the earlier parts of this series here: Part 1, Part 2 and Part 3.