## Wednesday, October 04, 2006

### Statistics Primer. Part 1: Samples

Don't run away. This is going to be gentle and soft and not as hard as you expect. Honest. I'm going to teach you some statistics and you don't have to pay the sorts of fees I would usually get for this. Something for nothing! Well, not quite. You still have to be willing to work just a little. But by the end of this post you will be so smart and bullet-proofed against a lot of lying with statistics.

So let us begin, you and I, even though we are not T.S. Eliot or in a poem. Rather, imagine that we are in a kitchen, a kitchen with a gigantic pot of really wonderful-smelling soup in it, and imagine that we are responsible for deciding if the soup needs its seasonings corrected.

How would you go about doing that? Yep, you would take a spoon or a ladle and taste the soup. That is pretty much what statisticians do when they take a sample. A sample is a ladleful of information from the population which is the whole soup. The reason for studying a sample is also fairly close to the same reason we only taste a ladleful of the soup to check the seasoning. If we ate all the soup there would be none left and we'd have to make more which would be time-consuming and expensive. Likewise, studying the whole population would be time-consuming and expensive, and in some cases also destructive (imagine testing how long light bulbs work, say).

The soup-and-ladle analogy works pretty well for explaining how sampling works. Think about a soup that has not been well stirred, which has lumps of carrots in one area and all the onions in another area. If you dip a ladle into that soup and then taste the contents of the ladle you may get a very different idea of the overall taste depending on where the ladle happened to enter the soup.

The solution to correct that problem is to stir the soup first. That way we make it random. But we can't really stir populations, and so the solution in sampling is a little different. For example, we might skim the ladle across the surface of the soup or dip it a little into several different places in the soup to get an idea of the totality of the soup. These solutions and others similar to those are ways of trying to guarantee that we get what statisticians call a random sample. In the simplest case a random sample gives each unit in the population the same chances of being selected for the sample. More generally, random sampling tries to avoid bias. A biased sampling process is one where different elements in the population have different likelihoods of being entered into the sample. A biased sample might overrepresent the carrots, say, and have too few onions, because the carrots are volunteering for the sample and the onions are refusing to participate.

More generally, samples where people choose to participate, such as those you often see on the internet, are biased samples. They omit the opinions of all those people who don't go on the net or who don't feel strongly enough or have enough time to click the vote-button. The results, then, tell us little about what people in general might think about the question the poll posed. Another example of a biased sample would be to carry out a general health assessment study by using only hospital records to pick subjects. People who have not been in a hospital in the recent past will not have any chance of being included in the sample, and the results are probably going to be biased towards greater apparent ill-health. In short, we don't want to let people decide themselves if they want to be in the sample and we don't want to exclude some people altogether by picking a sampling frame (here hospital records and more generally the source we use to find the sample) that doesn't include them.

A good sample is not based on convenience sampling, either. An example of the latter would be when a reporter goes out to the local mall to ask people about their opinions on some hot-button issue. This is convenient for the reporter, but unless we are interested in the population of people in malls it is not a way of getting a representative range of opinion. It excludes all those who don't visit malls (the bedridden, for one group) and, depending on the time of the day, it might also exclude all people at work. And these excluded groups might have quite different average opinions.

So polls usually employ random sampling to get the group that then is questioned. What sampling frame should they use in this? The most common one today consists of telephone numbers for landlines. But you can see how this might become a poor sampling frame as just owning cell phones becomes more common, especially among the younger individuals.

Most polls don't use simple random sampling of the kind I described, the kind that would be close to putting all names in a large hat and then stirring the names and randomly picking some. The reasons for not using this are three: First, simple random sampling could be incredibly expensive. Imagine that you are doing a study and that you need to interview 2,000 people in person. If you pick the names for these people randomly all across the United States, you might end up having to travel to two thousand different localities. To avoid this, many studies first draw randomly a smaller numer of geographical localities and then randomly pick a certain number of respondents within each of these localities. In my example this could be picking twenty random places and then picking hundred respondents randomly in each place.

Second, a simple random sample of all Americans would need to be enormous to include a meaningful number of people who belong to the less common minority groups, American Indians, for example. This is because it is likely that the sample would consist of all groups in their proportions in the target population, and so even a large simple random sample might include just one American Indian. If pollsters wish to understand the opinions of these smaller groups they would base all the evidence on one person's opinions. Not very sensible. The solution is to oversample the rarer groups, so that the study is including enough variety within the subgroup, and then to shrink back the share of this group in the overall results by weighing it down to the relative population share of the group.

Third (though in some ways this isn't completely separate from the second reason explained above), sometimes the question that is studied suggests obvious subgroups which are very similar inside the subgroup but very different from other subgroups. It might make more sense in a setting like this to randomly sample some respondents within each subgroup, especially if what we are interested in are the very differences between the subgroups. An example would be to poll a certain number of individuals with each possible religious affiliation on the question of how these individuals view government sanctioned torture, or to poll anti-choice and pro-choice voters on their views on other political topics than reproductive choice

This might be a good time to leave the kitchen and to remind all of us about the basic problem we have: There is this population (the soup) and we don't know its characteristics (what it tastes like). It's too expensive and time-consuming to study the whole population (drink all the soup) to find out, so we take a sample (a ladleful) and we try to make sure that our sample is representative of the population, like a microcosm of the population macrocosm. So we use a method of random sampling. This lets us exclude bias.

Our sample might still not reflect the population, just because we might have bad luck in the sampling (such as happening to get all the bay leaf in your ladle when tasting a soup), but statisticians have a way of figuring out what the risk of this happening might be. (This will be the topic of my next post on statistics.)

Note also that if we sampled a very large soup with a very tiny spoon we'd be unlikely to get a very good idea of the taste of the soup. I recently read about a study used to justify single-sex schooling where the population studied consisted of fewer than twenty teenagers and where the whole result touted in the media was based on two teenage boys' responses. Now this is a very tiny spoon, especially to use in an attempt to overturn the whole education system.

More generally, samples in statistics must be of a certain size to be meaningful representations of populations. How large, depends partly on the population we are looking at. If it's very diverse we need a larger sample to capture that diversity. The size of the sample also depends on the precision we are seeking and on the kinds of questions we are asking.

But it's clear that asking one person in a telephone poll isn't enough to get an idea about the general views in the United States. What isn't quite as clear is the question of how many people we should pick for the sample to get a representative sample. Remember that the bigger the sample the more it will cost to interview or to study. This means that statisticians must weigh the needs for a larger sample against the costs of acquiring one, always remembering that one of the costs of a too-small sample is that it will have a greater chance of being unrepresentative. More about that later on, too.

Part 1Part 2
Part 3
Part 4
Part 5
Part 6