Monday, October 09, 2006

Statistics Primer. Part 3: Sample Statistics



Here is where writing this primer gets difficult. An introductory course to statistics is at least one semester long for a good reason. To do something much shorter requires a certain amount of omissions and a couple of rough approximations or almost-lies. So if you know a lot of statistics, go gently on me in the comments section. Ok?

Suppose, then, that we have some information we have gathered by a proper random sampling process. It could be the yearly incomes of one hundred people, ranging from zero dollars to, let's say, two hundred thousand, and we want to do something interesting with these sample data. For one thing, nobody is going to think we are great if we just print out the hundred numbers on a piece of paper and distribute it. Human beings are not good at seeing general patterns in numbers like that. So what can we do to summarize the information?

Two things come to mind right away. We could try to condense the information into just a few numbers or we could try to make a mental picture of it (a topic I might or might not cover in this primer, depending on whether it seems needed). Let's begin by trying to condense the information into just a few numbers, called sample statistics. These statistics are numbers, to be distinguished from the science of statistics in general. If you could only give one sample statistic to represent all the information in the sample of one hundred incomes, what would it be?

Probably some measure of central tendency, meaning that the number we pick should somehow represent the average, or the common or the representative in the sample. There are three candidates for this measure: the mean (or the arithmetic average), the mode (or the most common value) and the median (or the middle value). Most of us are familiar with the mean, and it turns out to be the overall winner for reasons that have more to do with its statistical usefulness than its ability to otherwise beat the competition. But the mode and the median are also handy to know about.

In our income example, the mode would be the income value which appears most often in our sample. The median income of the sample would be found as follows: Arrange the hundred income numbers in an increasing order. Give each individual an ordinal number corresponding to his or her place in the line-up. Then the median income is the income of the individual who is standing smack in the middle of the line-up. Oops, you say now. There is nobody standing there! True, because my sample has an even number of observations. The trick in this case is to use the arithmetic average of the two incomes belonging to the two people on both sides of the missing central person.

An example might come in useful here, and one with fewer than a hundred numbers. Suppose that we have some data on yearly incomes for five people only, and the incomes are as follows:

0, 45,000, 45,000, 70,000, 100,000

I have put them into an increasing order for your convenience. The arithmetic average for this sample is 52,000. The mode is 45,000 (it occurs twice and no other figure occurs more than once) and the median is the middle income in this ordered array or 45,000. Note that the three measures of central tendency may or may not be the same and that each of them might be useful for different purposes, including different political manipulations. For example, see what happens when I add one more observation to the sample:

0, 45,000, 45,000, 70,000, 100,000, 700,000

The arithmetic mean is now 160,000! The mode is still 45,000, but the median is now the income half-way between the second 45,000 and the 70,000 figure following it or 57,500.

You might want to play with this a little more. For example, it's possible to have more modes than one. Take out the 700,000 I added and replace it with a second 100,000 figure. But there is always only one mean and one median.

As I mentioned above, the mean is the workhorse among these measures of central tendency, even though it may not always be the most representative single number in a sample. What we use it for, ultimately, is in estimating the same single number in the population. For example, if we found that the average income of the one hundred people in my original sample is 67,000, then we could use that as a point estimate of the average income of all people in the population I drew the sample from. But this sounds a little dangerous, doesn't it? Because I might have gotten the same sample mean from a sample of only ten people and because clearly the mean itself isn't a very good guess if the sample incomes varied widely all over the place.

What about that varying wildly all over the place? Let's take a different imaginary set of three samples:
Sample A:

7, 8, 10, 12, 13

Sample B:

7, 9, 10, 11, 13

Sample C:

10, 10, 10, 10, 10

All these samples have the same mean, 10. (Note that the mean doesn't have to be one of the numbers in the initial sample, it just happened to be in this case.) But the samples are clearly showing very different stuff otherwise, and if we only reported the mean to someone we'd be omitting important information. Sample C is just the same number five times. Sample B has the three numbers in the center closer to each other than is the case in sample A. So A has the most variation of the three. How could we express that in one single number?

Statisticians came up with a way of doing it. To understand the thinking behind the favorite selected for the job it might be useful to discard a few other candidates first.

The starting point would be to note that we need to fix the measure of scatter to something and the mean is already there as a good candidate for that. What if we measured the general variation in the sample by looking at the distance of the various sample values from the mean? The further these values fall from the mean, on average, the more scattered is the sample, after all. Suppose that we calculated all these distances. To get just one number to reflect the dispersion we could use the average of the distances.

Let's try it for sample A. The first distance is 7-10 = -3, the second 8-10 = -2 the third is 10-10 = 0, the fourth 12-10 = 2 and the fifth 13-10 = 3. To make these into one overall measure of the scatter or dispersion we could add them up and then divide by the sample size, five. Except that what we get as the sum of the distances from the mean is zero.

That's why this one was rejected. The problem has to do with the negative and positive values canceling each other out. So a slightly different approach would be to use the absolute values of the distances in these calculations. This would work, but it turns out to be cumbersome later on in various statistical uses the measure has. Still, the idea of getting rid of the negative signs in the distances or deviations around the mean is a good one. Is there any other way we could get this trick to work?

Yes, and that is by squaring the deviations around the mean before we add them up and then average them. To get back to the original units we used we then take a square root of the result. This number is called the standard deviation. The number we have before we take the square root is the variance.

The values for the variance for the three samples A, B and C are 5.20, 4.00 and 0 respectively, and the standard deviations 2.28, 2.00 and 0. For those who know some statistics and want to know more about how to average the sum of deviations around the mean correctly, see the footnote preceded by the asterisk.

To recap the conversation so far: We have two formulas, one of which is the average value in the sample, the mean, and the other one of which is the average squared deviation around the mean (for the variance) or the square root of that (for the standard deviation). I'm not sure if you can see how these could start a pattern for more formulas. For example, suppose that we calculated the average cubed deviation around the mean and so on. What might we get? It turns out that we'd get measures for finding out how lopsided our sample might be and other interesting things like that. Lopsided distributions aren't going to be central in what I'm covering here, but they can be quite fascinating.

You may be silently complaining that none of this seems to have much to do with opinion polls where the data we get tends to consist of verbal answers to questions. Data like that are qualitative, not quantitative, and we can't just storm ahead to calculate means and variances for them. But there is a way around that problem, and that is by counting.

Suppose that we have asked five people whether they prefer Smith or Jones to be their state senator in the next elections, and suppose that four people say they prefer Jones and one person says that she or he prefers Smith. It turns out that all the work we have done can serve here, too, if we make one additional change: We are going to count each expressed preference for Jones as 1 and each expressed preference for Smith as 0 . The data will then look like this:
1, 1, 1, 1, 0
and the mean of these numbers will be 4/5 or 0.80. Using the earlier form for variance gives us the figure 0.16 , which produces 0.40 for the standard deviation.

The only snag here is that we might have as well counted the votes for Smith as 1, and then the mean would have been 0.20. But the variance and standard deviation are unchanged (though usually we have a shortcut formula for counting these rather than doing what I made you do here for learning purposes).

The mean for binary data like this is called the sample proportion rather than the sample mean, and we need to decide which of the two alternatives we are going to focus on. But nothing is lost as you can clearly see if we multiply the proportions by 100 to get percentages. If Jones gets 80% in the poll then Smith must get 20% (assuming everybody expressed preference for one or the other or that we took out the undecideds before the calculations started).

That's probably enough for one post. Note that I gave you no actual formulas and neither did I give you the Greek letters usually employed to denote the population mean, proportion and variance or the letters used to denote the sample equivalents. We'll see how far I get before I have to do something like that. But you should now know what people mean by a sample mean, a sample proportion or a sample standard deviation, and that the mysterious population lends itself to calculating corresponding measures should we ever have enough time and money to do that.
-----
*You may be aware that the formulas for sample variances and standard deviations usually don't employ the sample size, n, as the denominator, but n-1. I omitted the necessary explanation for that here, because my goals are more modest for this series. But someone asked why the formulas usually employ n-1 here rather than n.

The answer has to do with the final uses of these formulas, which is to estimate population equivalents to the sample concepts we have talked about here. There is no really easy and juicy way of explaining this (at least I haven't found one), but perhaps the best explanation has to do with the concept called degrees of freedom (d.o.f.). Roughly, the degrees of freedom is the number of independent sample observations we have for calculating a formula such as the sample mean or the sample standard deviation. Note that when we calculated the mean we could use all the sample data freely in that work. But when we calculated the variance we were using the previously calculated mean, with the added restriction that the sum of the sample observations divided by the sample size must equal that value. So we lost some independence there and this is reflected in the use of n-1 when we figure out the average dispersion in the sample. Sigh. I'm afraid that this wasn't very helpful unless you already knew all about it.

More generally, the degrees of freedom is the sample size minus the number of population parameters already estimated from the sample. Here we have only one such already estimated parameter which would be the population mean, estimated here by the sample mean.