Wednesday, January 05, 2005

Exit Polls Make My Heart Beat Faster

They do. Statistics is one of those fields which are almost impossible to popularize without making serious mistakes, and I have noticed that the many articles on the exit polls of the 2004 presidential elections in the U.S. don't seem to make the arguments clear enough for the average intelligent reader. My attempts are no better in this respect, but I'm going to try nevertheless, with a new study by Jonathan D. Simon and Ron P. Baiman ('The 2004 Presidential Election: Who Won the Popular Vote?'), because this is a very important topic and statistics shouldn't keep it from being understood more widely.

Here we go: Suppose that some unknown admirer delivers you a large shipment of wonderful chocolates, say, a hundred million boxes. They are delectable chocolates, and you want to give a box to most everyone you know and even to people you don't know. But you worry that such joy can't be for you, that something, after all, might be wrong with the chocolates. Why else would someone send them to you?

So you walk around this mountain of chocolates in your warehouse, thinking over this problem, one of quality control. Finally you decide to hire lots of people to go around and randomly test chocolates by opening a box here and there and by eating its contents. You tell your testers to eat a total of 13, 047 boxes, and to rate each box as either "great" or "so-so". Because a hundred million boxes is a lot of boxes and takes a large area, you tell your testers that they can choose a corner of the warehouse or one wall and choose their samples by taking boxes from the assigned area only, but you scatter the testers so that most of the perimeter of the chocolate mountain is covered.

The results come in from this testing. It turns out that 50.8% of the boxes are labeled "great" and 48.2% "so-so". The rest are unassigned to these categories.
The testers go home with aching tummies, and you send out all the remaining boxes with a questionnaire asking the eater to rate the chocolates as either "great" or "so-so". The responses come back and 50.9% say that the chocolates are "so-so", only 48.1% think that they are "great". Thus, your testing indicated that the majority of the chocolates are "excellent", but the overall eating indicated the opposite: that the chocolates were just "so-so".

This doesn't seam to mean very much in my example, but suppose that the question was whether the chocolates were spoiled or not. Then the example becomes a lot less trivial. Or suppose that the testing is exit polling, the final eating the election results, and the qualities of chocolate are votes for Kerry or Bush.

You might want to know why your testing didn't produce the same ratio of excellent to so-so as the final questionnaires, and you might also want to know whether the difference matters. For an example of the latter, maybe your testers somehow picked more boxes of excellent chocolates than their average number in the chocolate mountain. Or maybe the questionnaires you sent all the eaters asked the question wrong so that the eaters answered differently than your testers.

It's possible to sample something and to get results which don't represent the whole process or population. For example, if you make soup, tasting it is a way of taking a sample of its flavor. If you don't stir first you might taste only the last seasoning you added and get the flavor wrong. In statistics the stirring bit is achieved by making certain that the sampling takes place over as properly randomized population as possible, without focusing only on some corner of the warehouse. That we allowed the tasters to use clusters of boxes rather than requiring them to run around to a different spot for every box makes the randomization a little less, so there is a slightly larger possibility that the testers didn't sample randomly.

Here comes the statistics bit: It is possible to figure out what the odds are of getting a result as different as the one we got if we made this mental experiment: Suppose that the chocolate mountain actually contains 50.9% of "so-so" boxes of chocolate. If we could go back and repeat the testing, say, a hundred times, how many out of those hundred times might our testers report a result of 50.8% or more of "excellent" quality boxes? Or in election terms, if 50.9% of voters actually voted for Bush, how often would exit polls, repeated (mentally) a hundred times in exactly the same way, lead us to believe that the Kerry votes are at least 50.8%?

The answer to these questions is that we would get such testing or exit poll results not even once in a hundred repetitions. In fact, we would get this result once in 959,000 repetitions.

This means either that the chocolate testing was not a correct drawing from the chocolate mountain or that the essentially impossible happened. I would argue for the first explanation. In other words, both our chocolate testing and the exit polls differ from the results obtained from eating the mountain and from voting respectively.

What went wrong? Here I drop the chocolate example, as it's getting awfully boring. Two possibilities exist: either the exit polls were not done properly or the election results were not proper. If the first possibility is the correct one, the pollsters should explain why their methods resulted in such very biased findings (biased meaning that the error almost always favored Kerry; if the error had been random we should expect some samples to favor Bush). As Simon and Baiman point out, the two theories that exist about this are not at all strong. The first one argues that the exit polls oversampled women and women were more likely to vote for Kerry. But even if we adjust our sampling for this, we still get an impossible reading on the conformity of the exit polls and the final results. The second theory is the "reluctant voter" idea: that Republican voters are less likely to fill in the form that was used in exit polling. There is no evidence for this theory, nothing to explain why Republicans would be more reluctant than Democrats.

Thus, unless the final release of the exit poll data tells us something new, the conclusion is that we should address the question of the election process itself. This is not being a tinfoilhatter or a conspiracy theorist or a fraudster (using just some of the labels that have cropped up among the Democrats on the net). It is common sense.

Note that the Simon-Baiman paper addresses the popular vote, not the specific situation on Ohio or Florida. In the plainest terms, it asks some very awkward questions about Bush's victory margin in the popular vote numbers. Though these numbers don't legally matter in deciding the winner of the elections, they do offer the facade of legitimacy to Bush which he didn't have after the 2000 elections (as Gore won the popular vote).

You might wonder why I make such a fuss about differences between numbers which are very close together. How can we make such clear-cut arguments about a test that ended showing 50.8% of votes to Kerry and an election that gave him 48.1% of the votes? The answer is in the enormous numbers used in the exit polling, 13,047 voters answered the questions in the national poll. The more data we have, the sharper distinctions we can make with some confidence.

It is very important to explain why the exit poll results differed as much as they did from the final results. If we don't get an acceptable explanation, and preferably quite soon, the legitimacy of any future elections in the U.S. will be doubtful.
I have not included the criticisms of the Simon-Baiman paper here, but you can get a flavor of them at the Mystery Pollster. The problem with even the statistical wars about this topic is that the individuals' political views seem to affect their ideas about statistics, too. This is nothing new, of course, but we sometimes forget to be wary when reading more scientific-looking stuff. In general, the only criticism that is important is that the standard errors may be underestimated in this paper as in the earlier Friedman paper. But they could be enormously larger without changing any of the conclusions the authors make.