A dozen times I’ve heard my mom say that every few years the scientists change their mind on whether eggs are healthy. In the last few years, ketchup bottles have shed claims that lycopene reduces your risk of prostate cancer. I want to explain why these U-turns happen so often. It is the skeleton in the closet the public doesn’t know about.

**A Pictorial Representation**

This cartoon hit the nerd-o-sphere (aka econo-blogosphere) lately. It hits home with virtually every data-head out there. Unfortunately, I am afraid most lay people and the media don’t get the very serious point it makes.

Here is the cartoon:

If you look closely, you will see that only for *green* jelly beans is P<0.05. We say that the “p-value” is less 0.05. When you look for correlations in a sample between two variables (i.e. jelly beans and acne), you essentially end up with two pieces of information. First, the “estimate” which tells you the *magnitude* of the link between the two variables. For example, eating jelly beans might increase the number of pimples by 43%. Second, we have the “p-value” which tells us how sure we are the estimate is not just a statistical anomaly. The p-value tells you how likely it is that you would find a link even if no true relationship exists. The reason you might find such a link is that randomness sometimes looks like a pattern. (For example, sometimes you do flip 10 heads in a row.)

To decide if two variables are likely related, it is important to use both pieces of information – the estimate and the p-value. The rule-of-thumb for saying we think an estimated relationship is true is usually P < 0.05. If P <0.05 it is “statistically significant.” That is, there is less than a 5% chance we would find evidence for an effect that doesn’t actually exist. Of course, this means *if you estimate the effect of 20 different colors of jelly beans on acne, then simply by chance you should expect to find at the 5% significance level at least one “effect” that is not true.*

**Data Mining**

For the one reader still with me after the statistics lesson, I will now explain how we might end up with a bunch of flip-flopping on scientific (and economic) studies.

Suppose you wanted to know if lycopene – which is in tomatoes – is good for health. One study you could do is to look at how much lycopene people in different countries eat because of their diets. Let’s say you have a bunch of data on broad health outcomes like cancer rates or life spans. You could then estimate the effect of having more lycopene in your diet on these broad health outcomes. My gut feeling is you would also find no effect. Because so many things impact life span, the effect of lycopene is probably small.

Of course, it’s no fun to find that something doesn’t matter. So, what you would then do is look at different types of diseases. Maybe you have data on 20 diseases from heart failure to diabetes to brain cancer to prostate cancer. You then estimate the effect of lycopene on each disease individually. Amazingly, you find that “lycopene at statistically significant levels reduces the chance of prostate cancer.” The news reports this important result, Heinz Ketchup prints it on every bottle, and husbands who hate tomatoes have them shoved down their throats!

My point is that it may not be true at all. All the time, follow up studies with different data will overturn the “statistically significant” results because the first finding was due to randomness. If you have enough variables, you can always find significant results. You can find stock prices that correlate merely by chance, or maybe unemployment and skirt length. You might find Viagra also helps with arthritis. In short you get lots of fun things to talk about at cocktail parties.

The danger of data mining is you cease to be able to know what is “truth” and what is randomness.*

**Publication Bias**

The second factor, however, which exacerbates the danger of data mining is “publication bias.” Five different researchers might run studies to determine the effect of lycopene on health. Four of the five might find no effect on prostate cancer. If you knew this you would say there probably is no effect.

The problem is everyone likes sexy results. Saying A doesn’t affect B is kind of boring. Furthermore, journal publications tend to reward results rather than non-results. And if you know anything about academia you know you “publish or perish.” The one guy who found a statistically significant effect will get published in a peer-reviewed journal. The other four who found no effect might not even waste their time sending their results in since the deck is stacked against them. Zero results are boring results.

Most news outlets will report on “peer reviewed” work. The net effect is that what gets reported is biased toward finding effects. In fact, if you wanted to do your own search of the literature you still may never find out about the other studies because they were never published. They are crammed in the dark reaches of someone’s desk.

So, the next time you hear “researchers have found…”, keep in mind that data mining plus publication bias means you should be very skeptical of new findings.

*I am not saying truth is determined by empirical evidence, which is why I put it in quotes.

If you missed my other post from this morning it is here.

Very interesting. This is why I choose to ignore what scientists and statisticians say in general. They are all out to say something significant about something that no one else has ever said and will go to great lengths to get data the seems to support them.

That’s an extreme position… Scientific consensus can generally be trusted. It’s frontier science that warrants a degree of skepticism.

Right, so consensus is different than the first study that comes out that is typically reported. When studies have been repeated or similar studies point in the same direction then we can be more certain about what we know. I don’t see what is extreme about my point. From what I can tell pretty much everyone working in science and economics with data is aware of this issue.

Here is what I think is an interesting post making a similar point I made. He even provides an equation for how you should update your beliefs in a world of publication bias:

http://codeandculture.wordpress.com/2009/03/17/publication-bias/

No, I was responding to Will’s comment about ignoring scientists and statisticians altogether.

Gotcha – That makes a little more sense to me. I missed that that’s who you were responding too.

Technically, a p-value is the probability of obtaining a test statistic at least as extreme from the data if the data generating process posited by the null hypothesis is the true data generating process. Granted, that is a much more complicated exposition, but misunderstanding of statistical analysis begins with misunderstanding of p-values which in turn begins with somewhat sloppy definitions, so it pays to be precise.

I will agree with that.

Upon reflection, I can’t see how your definition would reduce misunderstanding of statistical analysis or where mine would send a non-practitioner into believing something horribly untrue. Trying to explain statistics to non-statistical people means some compromises, but compromises don’t mean you have to say something incorrect. Explaining the general concepts in a short blog post (as long as it is not wrong) is not the road to confusion for the average person. Telling a two year old that “babies come from mommies” might be a perfectly acceptable answer. Is this answer wrong? You seem to say it is “sloppy” to not give a technical biology anatomy lesson to the kid.

No I agree completely. Obviously the definition I gave is much more opaque and doesn’t contribute really to the understanding of statistics for someone not already well-versed. I just wanted to have the textbook definition of a p-value on the record here because there does exist a controversy about the usefulness of classical statistical hypothesis testing and it centers on the perceived sanctity of the p-value and its real interpretation.

If the null hypothesis is that no relationship exists (i.e. beta=0), and “finding a link” is defined as obtaining a t-statistic greater than 1.96 (or whatever for OLS) i.e. generating a point estimate that is more than 1.96 standard errors from zero, then I think your definition is perfectly fine. I realize my comment came off a little condescending, but I wasn’t trying to insinuate that yours was a bad definition, just trying to state a little more generally the idea of a p-value. Because, technically, a p-value can be used as a decision criterion for a broader range of claims than establishing a correlation between two variables, right? In the xkcd case, that’s the claim being made, but you could be using an F-statistic or a chi-square or something.

Thanks Mike. I agree too. And, I am in fact sympathetic to the whole issue about statistical inference that you brought up. I wish it got more talk than it does. As always I really appreciate your input.

For a little more light-heartedness,

http://xkcd.com/892/

That’s pretty good.