Wednesday, November 15, 2006

Statistics Primer. Part 6: Wrapping it All Up In A Nice Package

My series on a primer for statistics is pretty much done. I could go on a little bit longer about how confidence intervals can be improved when a poll uses cluster sampling, for example, but you can google "bootstrap confidence intervals" yourself. And it's worth reminding all you erudite and nice readers that no poll is better than the basic plan for collecting the data. If the researchers did not use random sampling and/or if the sampling frame was selected poorly, the results mean nothing outside the actual group that is being questioned. The crucial question to answer is whether the findings can be generalized to the population that we are interested in, for example, to all Americans, and a poor polling plan makes the answer always negative.

Even a random sample can always be erroneous because of the effect of pure luck, but the sampling error can be quantified as discussed in the last installment to this series. But when a sample is non-randomly drawn, it can also be biased, and we cannot quantify the bias. For instance, a study which tries to find out the health habits of Americans by only asking people at the emergency clinics of hospitals is going to be biased. That is a bad way of trying to get a random sample, because it will oversample people who are sick right now and it is also likely to oversample poor people who have no regular general practitioner to go to. A similar problem may be created in polls which use land-line telephones, if the individuals who only have a cell phone are different from those who have a land-line phone. It's possible that the group these polls reach is older, on average, and more likely to be at home. If being older and spending more time at home answering the phone makes ones opinions different from those of people who flit about with a cell phone in their belt, then the polls could be biased.

A further selection bias in polls may come from non-response bias. Lots of people refuse to answer polls altogether. If the people who refuse have different opinions, on average, than those who eagerly chat with pollsters, then the results are likely to be biased, i.e., not generalizable to the population we want to learn about, say, likely voters.

Then there is the way questions are framed in polls. It is well known that a certain answer can be elicited by just changing the question. Introducing clear judgemental components is one nifty trick to achieve this. Or one could use versions of the old "Have you started brushing your teeth yet?" One reason why polls on what to do about abortions get such different results is in the way the questions are framed.

Statistics is a very large field, and what I've touched in this short series of posts is just a very small square inch of the field. I linked to some internet courses on statistics earlier on and I encourage you to pursue more study on your own. Or you can ask me questions either by e-mail or in the comments threads, and if I know the answer I will let you know what it is. Statistics can be fun! Honest!

To finish off these meanderings, I want to talk about something that pisses me off: the way the term "statistically significant" is misused all over the place. First, this term is a technical one, and the "significant" part does NOT have its everyday meaning. If a finding is statistically significant it could be totally unimportant in everyday utility, not earth-shaking at all, even trivial and frivolous. And a statistically nonsignificant finding does NOT mean that the study found nothing of importance. Let me clarify all this a little.

We have talked about confidence intervals in this series. If you have taken a statistics intro course you may remember that the topic of confidence intervals is usually followed by the topic of hypothesis testing, and that is the application where statistical significance is commonly introduced. "Hypothesis" is just a guess or a theory you have about the likely magnitude of some number or relationship between numbers, and "testing" means that we use data we have collected to see how your guess fares.

The way this testing goes is by setting the theory you DON'T support as the one you try to disprove. The theory you don't support is usually the conventional knowledge you try to prove wrong or the idea that some new policy or treatment has no effect and so on, and it's called "the null hypothesis". The theory you secretly want to prove is then called the "alternative hypothesis". And yes, statisticians are terrible wordsmiths.

So your hypothesis testing tries to prove the null hypothesis wrong. If you can prove it wrong then the alternative hypothesis must be right. Of course for this to work you must frame the two hypotheses so that nothing falls outside them. An example might be useful here:

Suppose that you have a new treatment for the symptoms of some chronic disease. You run a study where you give the new treatment to some patients randomly and the old treatment to a similar group of patients, also randomly selected. You then measure the reduction in unpleasant symptoms in the two groups, and you use these data to determine if the new treatment is worth while. Now, the null hypothesis here could be that the new treatment is the same as the old treatment. The alternative hypothesis would then be that the new treatment is different; it could be either better or worse than the old treatment. If you decide to test this pair of hypotheses, you are said to do a two-sided test of hypotheses, because both large and small values in your experimental group might be evidence that the new treatment is different from the old one. It could be better or it could be worse.

It is more likely that you are interested in finding if the new treatment is better than the old one. This would be a one-sided test of hypotheses, and you would write the null hypothesis differently. It would be that the new treatment is either the same as the old treatment or worse. Then the alternative hypothesis would be that the new treatment is better than the old treatment. If fewer bad symptoms is what the test measures then only low values in the experimental group would support the idea that the new treatment works better.

Given that you are using sample data to do all this, you add something like the staple of the confidence intervals to your testing procedure, and you report the results in a form which tells the informed reader how likely the disproving of the null hypothesis is to work in the population rather than in the sample. The staple we use in the two-sided test of hypotheses is exactly the same as we used for confidence interval construction, except that we construct the interval for the way the world looks if the null hypothesis is true! Remember the 95% confidence interval? In a two-sided test of hypothesis, using this level of confidence translates into a 5% significance level of the results, meaning that the interval we have created around the possible mean under the null hypothesis is so long that it only omits the utmost 2.5% of the possible distribution of sample means at each end of the distribution.

Now suppose that the made-up study I have described here finds that the sample mean of bad symptoms in the experimental group is so low that the probability of such a value drawn from a population actually centering on the average symptoms from the old treatment is at most 2.5%. Then statisticians using the 5% level of significance would argue that the study disproves the null hypothesis that the old treatment is no different from the old one.

If the study had used a one-sided test of hypothesis, the story changes slightly. It's as if we are only going to look at the staple arrows missing the bull's eye when they do it on one side of the dartboard (read the earlier posts for this metaphor), and we are going to hone down the staple width appropriately to do that. Thus, a 90% confidence interval based on the null hypothesis would leave 5% in each end of the distribution uncovered, and in my example we'd only look at the lower end of the distribution to see if the experimental results fall into that area or not. If they do, we reject the null hypothesis at the same 5% level of significance (or at 0.05 level if you go all decimal on me). If the results fall elsewhere on the distribution, we keep the null hypothesis and find the results not statistically significant.

But of course finding that the new treatment is no better than the old one IS significant! And finding that something is "statistically significant" just means that the null hypothesis was rejected at the 0.05 or 5% level, that the researchers used either a 95% or a 90% confidence interval for the null hypothesis data, depending on whether the test was two-sided or one-sided.

Likewise, I could make up a silly study about something quite silly and find the results statistically significant without saying anything at all about their real-world relevance. Note also that we could find something to be statistically significant and that something could be such a minor effect in reality that it would hardly matter at all.

The convention is to call the results "statistically significant" if the null hypothesis is rejected at the 0.05 or 5% level. If the researchers used 0.01 or 1% level in their calculations, then any rejecting of the null hypothesis that takes place is "statistically very significant". That's all these terms mean.

Many studies now dispense with the terms altogether and instead report something called the p-values. These are the actual probabilities of getting the experimental sample result or one more extreme if the null hypothesis in fact was the correct one. The smaller the p-value is the less likely the null hypothesis looks. You can always compare the p-values in your head to the 0.05 and 0.01 conventional values if you so wish.

The end of this particular road. I hope that you have enjoyed the series and that you will now go out and learn the zillions of additional things in statistics.
The earlier posts in this series are here:
Part I
Part II
Part III
Part IV
Part V