Sunday, August 07, 2011

A Re-Post: The Decline Effect

This is from last January. Jonas Lehrer's article is important because of the implications it has for certain biased types of research into gender differences. Here it stands as a prequel for my posts on gender science during this week.

Jonah Lehrer has written a fascinating article about something called the decline effect. His article reads like a detective story, from the beginning to the end, even to me who knew who-done-it before starting.

The decline effect can be best defined by examples:

By 2001, Eli Lilly's Zyprexa was generating more revenue than Prozac. It remains the company's top-selling drug.

But the data presented at the Brussels meeting made it clear that something strange was happening: the therapeutic power of the drugs appeared to be steadily waning. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineteen-nineties. Many researchers began to argue that the expensive pharmaceuticals weren't any better than first-generation antipsychotics, which have been in use since the fifties. "In fact, sometimes they now look even worse," John Davis, a professor of psychiatry at the University of Illinois at Chicago, told me.

But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It's as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn't yet have an official name, but it's occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to Vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.

For many scientists, the effect is especially troubling because of what it exposes about the scientific process. If replication is what separates the rigor of science from the squishiness of pseudoscience, where do we put all these rigorously validated findings that can no longer be proved? Which results should we believe? Francis Bacon, the early-modern philosopher and pioneer of the scientific method, once declared that experiments were essential, because they allowed us to "put nature to the question." But it appears that nature often gives us different answers.
That last sentence is a false clue. The real reason for the decline effect can ultimately be found in the pressure on researchers to publish "significant" findings in order to get tenure and/or further research grants, "significant" meaning new-and-different positive finding, rather than, say, the mere refutation of an older theory*. Thus, differences will be stressed, not similarities, and few tenure-track academics wish to send an article finding nothing to the relevant journals, even though finding "nothing" can be truly important if that nothing happens to be the efficacy of a new therapeutic drug, say.

Add to this tenure pressure the general "file drawer" problem, i.e., the tendency for journals to publish positive findings (of difference or of support for a new theory) rather than negative findings (of no difference or of no support for a new theory), and you get to the roots of the decline effect:

Early studies often produce exaggerated results.

What happens next? Suppose a study finds something astonishing, interesting and novel. Its creator(s) get invited to give seminars all over the academic world and might even get short-listed to new and much better positions. Suddenly the topic is hot! Hot, and other researchers sharpen their metaphoric pencils to join in the fray.

It's almost like a fad. An excellent example (and one I've often criticized on this blog) has to do with the evolutionary concept of fluctuating asymmetry:

In 1991, the Danish zoologist Anders Møller, at Uppsala University, in Sweden, made a remarkable discovery about sex, barn swallows, and symmetry. It had long been known that the asymmetrical appearance of a creature was directly linked to the amount of mutation in its genome, so that more mutations led to more "fluctuating asymmetry." (An easy way to measure asymmetry in humans is to compare the length of the fingers on each hand.) What Møller discovered is that female barn swallows were far more likely to mate with male birds that had long, symmetrical feathers. This suggested that the picky females were using symmetry as a proxy for the quality of male genes. Møller's paper, which was published in Nature, set off a frenzy of research. Here was an easily measured, widely applicable indicator of genetic quality, and females could be shown to gravitate toward it. Aesthetics was really about genetics.

In the three years following, there were ten independent tests of the role of fluctuating asymmetry in sexual selection, and nine of them found a relationship between symmetry and male reproductive success. It didn't matter if scientists were looking at the hairs on fruit flies or replicating the swallow studies—females seemed to prefer males with mirrored halves. Before long, the theory was applied to humans. Researchers found, for instance, that women preferred the smell of symmetrical men, but only during the fertile phase of the menstrual cycle. Other studies claimed that females had more orgasms when their partners were symmetrical, while a paper by anthropologists at Rutgers analyzed forty Jamaican dance routines and discovered that symmetrical men were consistently rated as better dancers.

Then the theory started to fall apart. In 1994, there were fourteen published tests of symmetry and sexual selection, and only eight found a correlation. In 1995, there were eight papers on the subject, and only four got a positive result. By 1998, when there were twelve additional investigations of fluctuating asymmetry, only a third of them confirmed the theory. Worse still, even the studies that yielded some positive result showed a steadily declining effect size. Between 1992 and 1997, the average effect size shrank by eighty per cent.


What happened? Leigh Simmons, a biologist at the University of Western Australia, suggested one explanation when he told me about his initial enthusiasm for the theory: "I was really excited by fluctuating asymmetry. The early studies made the effect look very robust." He decided to conduct a few experiments of his own, investigating symmetry in male horned beetles. "Unfortunately, I couldn't find the effect," he said. "But the worst part was that when I submitted these null results I had difficulty getting them published. The journals only wanted confirming data. It was too exciting an idea to disprove, at least back then." For Simmons, the steep rise and slow fall of fluctuating asymmetry is a clear example of a scientific paradigm, one of those intellectual fads that both guide and constrain research: after a new paradigm is proposed, the peer-review process is tilted toward positive results. But then, after a few years, the academic incentives shift—the paradigm has become entrenched—so that the most notable results are now those that disprove the theory.
I remember writing about those Jamaican dancing men, for example, because I wondered how the audience could spot such extremely, extremely minute body asymmetries when I have been known to leave the house with two different colored shoes.

Perhaps all we see here IS a Kuhnsian paradigm shift. But note the costs of a poor paradigm: We have been told for over a decade that even human females will pick (!) their mates based on how symmetric their fingers are or something similar to that. Bad paradigms hurt real people out there, in this case by providing sciencey-looking research which makes other pseudo-science pieces come across as more feasible.

More from Simmons:

"A lot of scientific measurement is really hard," Simmons told me. "If you're talking about fluctuating asymmetry, then it's a matter of minuscule differences between the right and left sides of an animal. It's millimetres of a tail feather. And so maybe a researcher knows that he's measuring a good male"—an animal that has successfully mated—"and he knows that it's supposed to be symmetrical. Well, that act of measurement is going to be vulnerable to all sorts of perception biases. That's not a cynical statement. That's just the way human beings work."
So it goes.

Another example of a decline effect can be found in the studies which analyze the ratio of the forefinger to the ring finger (2D-4D) as a measure of how much androgen a person may have been "bathed" with in uterus. All sorts of fascinating conclusions have been based on that idea: Men are better stockbrokers if they have relatively longer ring fingers (more of the good testosterone juice!), male Neanderthals were probably polygamous and adulterous and violent guys because a Neanderthal skeleton's hand bones appear to show a low 2D-4D measure, and the early human cave painters may have included women due to the high 2D-4D ratios of some of the hands painted on cave walls.

But my recent short search on the topic spotted many more recent studies which failed to find any correlation between various forms of behavior and the 2D-4D ratio, and my guess is that what we are observing is the decline effect. Measuring finger lengths is pretty tricky, after all.

Perhaps the most frightening quote from Lehrer's article is this one:

The situation is even worse when a subject is fashionable. In recent years, for instance, there have been hundreds of studies on the various genes that control the differences in disease risk between men and women. These findings have included everything from the mutations responsible for the increased risk of schizophrenia to the genes underlying hypertension. Ioannidis and his colleagues looked at four hundred and thirty-two of these claims. They quickly discovered that the vast majority had serious flaws. But the most troubling fact emerged when he looked at the test of replication: out of four hundred and thirty-two claims, only a single one was consistently replicable. "This doesn't mean that none of these claims will turn out to be true," he says. "But, given that most of them were done badly, I wouldn't hold my breath."
Why is this so frightening? First, it is about diseases. The consequences of bad research are more immediate and more serious in that field. Second, as I have written before, findings like these will be eagerly grasped by the popularizers who will then transmit them to an audience eager to devour anything to do with sex differences. Third, some odd type of psychological reproduction of ignorance is especially potent in this field. I see it all the time. This means that even the correction of incorrect results may not be enough, because the very idea of simple genetic explanations for complicated phenomena is so immensely appealing.

Don't stop this short review from reading the whole article. It's fun.
Thanks to Geralyn Horton for the link to Lehrer's article.

*There can be more to this than just choosing between different manuscripts. It's not unknown for researchers to keep on trying various combinations of variables until they get something significant. This isn't necessarily bad if the study also reports how many other analyses produced only non-significant findings, of course.