November 11, 2013 | 14

clinical effectiveness research accessible. And she explores the limitless comedic potential of clinical epidemiology at her cartoon blog, Statistically Funny. Follow on Twitter @hildabast.

Hilda Bastian likes thinking about bias, uncertainty and how we come to know all sorts of thing. Her day job is making
clinical effectiveness research accessible. And she explores the limitless comedic potential of clinical epidemiology at her cartoon blog, Statistically Funny. Follow on Twitter @hildabast.

Hilda Bastian likes thinking about bias, uncertainty and how we come to know all sorts of thing. Her day job is making
Contact Hilda Bastian via email.

Follow Hilda Bastian on Twitter as @hildabast. Or visit their website.

Follow Hilda Bastian on Twitter as @hildabast. Or visit their website.

Imagine if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false. Oh, the things we would know! Unrealistic to expect such wizardry though, huh?

Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and *abracadabra*! Get a “*p *value” over or under 0.05 and you can be ** 95% certain **it’s either a fluke or it isn’t. You can eliminate the play of chance! You can separate the signal from the noise!

Except that you can’t. That’s not really what testing for statistical significance does. And therein lies the rub.

Testing for statistical significance estimates the probability of getting roughly that result *if* the study hypothesis is assumed to be true. It can’t on its own tell you whether this assumption was right, or whether the results would hold true in different circumstances. It provides a limited picture of probability, taking limited information about the data into account and giving only “yes” or “no” as options.

What’s more, the finding of statistical significance itself can be a “fluke,” and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data. You can read more about that here.

Statistical significance testing can easily sound as though it sorts the wheat from the chaff, telling you what’s “true” and what isn’t. But it can’t do that on its own. What’s more, “significant” doesn’t mean it’s important either. A sliver of an effect can reach the less-than-5% threshold. We’ll come back to what all this means practically shortly.

The common approach to statistical significance testing was so simple to grasp, though, and so easy to do even before there were computers, that it took the science world by storm. As Stephen Stigler explains in his piece on Fisher and the 5% level, “it opened the arcane domain of statistical calculation to a world of experimenters and research workers.”

But it also led to something of an avalanche of abuses. The over-simplistic approach to statistical significance has a lot for which to answer. As John Ioannidis points out here, this is a serious player in science’s failure to replicate results.

Before we go any further, I need to ‘fess up. I’m not a statistician but I’ve been explaining statistical concepts for a long time. I took the easy way out on this subject for the longest time, too. But I now think the perpetuation of the over-simplified ways of explaining this in so much training is a major part of the problem.

The need for us to get better at communicating the complexity of what statistical significance does and does not mean burst forth in question time at our panel on numbers at the recent annual meeting of the National Association of Science Writers in Florida.

Fellow statistics enthusiast and SciAm blogger Kathleen Raven organized and led the panel of me, SciAm mathematician blogger Evelyn Lamb, statistics professor Regina Nuzzo, and mathematician John Allen Paulos. Raven is organizing an ongoing blog called Noise and Numbers, around this crew of fun-loving, science-writing geeks. My slides for that day are here on the left.

Two of the points I was making there are relevant to this issue. Firstly, the need to avoid over-precision and take confidence intervals or standard deviations into account. When you have the data for the confidence intervals, you have a better picture than statistical significance’s *p* value can possibly provide. It’s far more interesting and far more intuitive, too. You can learn more about these concepts here and here.

Secondly, it’s important to not consider the information from one study in isolation, a topic I go into here. One study on its own is rarely going to provide “the” answer.

Which brings us at last to Thomas Bayes, the mathematician and minister from the 1700s whose thinking is critical to debates about calculating and interpreting probability. Bayes argued that we should consider our prior knowledge when we consider probabilities, not just count the frequency of the specific data set in front of us against a fixed, unvarying quantity regardless of the question.

You can read more about Bayesian statistics here on the Wikipedia. An example given there goes like this: suppose someone told you they were speaking to someone. The chances the person was a woman might ordinarily be 50%. But if they said they were speaking to someone with long hair, then that knowledge could increase the probability that the person is a woman. And you could calculate a new probability based on that knowledge.

Statisticians are often characterized as either Bayesians or frequentists. The statistician doing the ward rounds in my cartoon at the top of this post is definitely a Bayesian!

An absolute hewing to *p <0.05* (or 0.001) no matter what would be classically frequentist. Important reasons for being concerned to do this are the weakness of much of our prior knowledge – and the knowledge that people can be very biased and may play fast and loose with data if there aren’t fixed goal posts.

Bayesianism has risen and fallen several times, but increasing statistical sophistication and computer power is enabling it to come to the fore in the 21st century. Nor is everyone in one or the other camp: there’s a lot of “fusion” thinking.

Valen Johnson has just argued in PNAS (Proceedings of the National Academy of Sciences in the USA) that Bayesian methods for calculating statistical significance have evolved to the point that they are ready to influence practise. The implication, according to Johnson, is that the threshold for statistical significance needs to be ratcheted much, much lower – more like 0.005 than 0.05. Gulp. The implications of that for sample sizes needed for studies would be drastic.

It doesn’t really all come down to where the threshold for a *p *value is, though. Statistically significant findings may be important or not for a variety of reasons. One rule of thumb is that when a result does achieve that numerical level, the data are showing something, but it always needs to be embedded in a consideration of more than that. Factors such as how big and important the apparent effect is, for example, and whether or not the confidence intervals suggest the estimate is an extreme long shot or not.

What the debate about the level of statistical significance doesn’t mean, though, is that not being statistically significant is irrelevant. Data that aren’t reaching statistical significance are too weak to reach any conclusion. But just as being statistically significant doesn’t mean something is necessarily “true,” not having enough evidence doesn’t necessarily prove that something is “false.” More on that here.

The debate about Bayesians versus frequentists and hypothesis testing is a vivid reminder that the field of statistics is dynamic – just like other parts of science. Not every statistician will see things the same way. Theories and practises will be contested, knowledge is going to develop. There are many ways to interrogate data and interpret their meaning, and it makes little sense to look at data through the lens of only one measure. The* p* value is not one number to rule them all.

~~~~

For more on statistics and science writing, the website arising from our adventure in Florida is Noise and Numbers.

If you’ve got a good way of explaining statistical significance precisely in lay terms, please add it to the comments! I’m very keen to find better ways to do this. The paragraph explaining what statistically significant actually means has been refined from the original.

A good book free online to help with understanding health statistics is Know Your Chances by Steve Woloshin, Lisa Schwartz and Gilbert Welch.

Gerd Gigerenzer tackles the many limitations and “wishful thinking” about simple hypothesis and significance tests in his article, Mindless statistics. The Wikipedia is a good place to start to learn more too. Another good article on understanding probabilities is by Gerd Gigerenzer and Adrian Edwards here.

Relevant posts on Statistically Funny are:

- You will meet too much false precision
- Nervously approaching significance
- Don’t worry…it’s just a standard deviation
- Alleged effects include howling

The Statistically-Funny cartoons are my original work (Creative Commons, non-commercial, share-alike license).

The picture of the portrait claiming to depict Thomas Bayes is from Wikimedia Commons.

**The thoughts Hilda Bastian expresses here at Absolutely Maybe are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.*

Rights & Permissions

Add a Comment

You must sign in or register as a ScientificAmerican.com member to submit a comment.

Get 6 bi-monthly digital issues

+ 1yr of archive access for just $9.99

p-value is only a difficult concept because it keeps getting repeated incorrectly. Journalists are notoriously bad about this.

P-value =

Assuming that the null hypothesis is true, p-value represents the probability of getting a value at least as extreme as the one observed.

That’s it.

Any claims about the actual probability of the result being true require a prior probability. No statisticians will dispute that. There is debate on how you should be allowed to assign your prior probability. I’m personally pretty freewheeling about it. I think it’s okay to be subjective about your prior, as long as you tell me what your prior is and why you think that. If we had an all-trials registry for medical trials, then your pretest probability could be part of what you specified prior to the trial. Others are much less willing to allow for subjectivity in your priors.

Link to thisI agree that constantly hearing incorrect explanations makes it harder (mea culpa!). But I don’t think it’s only journalists. Gerd Gigerenzer cites statisticians who can’t precisely define statistical significance either.* Little wonder, then, if the rest of us have problems with the concepts. I don’t think the “getting a value at least as extreme” is necessarily easy to grasp. I’m not sure that I’m with you on enthusiastic proponents of interventions specifying subjective priors for clinical trials.

* http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf

Link to thisFake results often come from a bit different error. Classical statistics concerns only single experiments. It breaks down if many labs perform many experiments worldwide and there is a bias to publish only positive results.

Theoretically, a department with 40 people doing 1 experiment per month will publish a fake p<0.05 positive result every two weeks, a fake p<0.005 result every half a year and fake breakthrough with p<0.001 significance about every two years. A hot research field with 100s of scientists working worldwide can publish many papers every year, and all artifacts.

Then, an experiment is usually meant to be a proof of a general principle. Chance positive results are easily extended across all conditions, but negative ones are forgotten.

The problem is that lemming run competition in science makes it impossible to correct for this bias. Competitive scientist can no longer publish negative results or verify positive ones over more conditions. He would get scooped or have too few publications to make career.

Link to thisYikes, Jerzy v 3.0! I hadn’t thought of trying to calculate that – that’s a scary thought! Yes, it’s no wonder we have the problems about bad research that were discussed in this other post I wrote that might interest you (Bad research rising). But I’m not so negative: involving good statisticians more in training, research and research review would make a difference to over-confidence in results. And yes, there has to be far more willingness by authors to publish their “negative” results – and for others to work on replicating important findings.

Link to thisFor the non-technical when it comes to things medical, it is useful to keep in mind that statistical significance is not the same as clinical significance. Recent studies about red meat, bacon, etc asserted a correlation with more of these foods with worse health. But the actual risk for a given demographic may imply that the increased risk is from 1 out of a thousand to 3 out of a thousand. Such numbers would be statistically significant but not clinically significant.

Link to thisHi Hilda,

Thanks for pointing!

You may be also aware of the recent letters of scientists from companies Bayer and Amgen, who found that hey cannot replicate 3/4 to 9/10 of breakthrough publications in cancer.

I think this “positive artifact picking” is one of reasons why there is a gigantic number of publications in molecular biology but very few new drugs in the market.

Link to thisIn my classes in statistics, I have always emphasized that concluding something based on statistical significance is a BET, pure and simple. That is the essence of probability. You NEVER know if you are right – never. And conclusions are ALWAYS tentative. Hard for students to grasp: science isn’t about truth and falsehood and most scientists don’t seem to understand it either, as they are so sure of themselves and their theories.

Link to thisThat’s a good way to think about it, Foozler8. Although based on other things, evidence can mount to an extent that you can be sure: it just won’t be from one study. It can be very hard to accept uncertainty, that’s for sure. It’s also psychologically hard, when you’ve worked on something for a long time, to put it in perspective and be humble about it.

Link to thisNot sure if this helps at all: When I taught statistics, I took pains to explain that the classical statistical tests were approximations that require us to make assumptions that we know may be wrong (e.g., random sampling from a Normal population). In contrast, resampling/randomization tests that we can now do thanks to computers are assumption-free (almost).

For example, suppose we had 10 people drink a cup of coffee and 12 others a bottle of beer and then gave them some test. Here are their scores: 23,27,35,36,41,53,58,60,65,72 and 32,43,48,59,63,65,77,81,98,115,139,185. A classical t-test would be based on the difference between the means in the two groups (47 and 84) and the variability of scores around those means. You get a t of -2.47, which would correspond to a p of 0.023, the probability of observing a difference this big by chance if the coffee/beer had the same or no effect.

The resampling test requires that we assume only that people were randomly assigned to the two groups. If you rearrange all 22 scores into all possible subsets of 10 and 12, there are 646,646 of them. In 10,305 arrangements, there is a mean difference of at least 37 (but you can focus on any statistic; it doesn’t have to be the mean). So the probability of getting a difference at least as large as what you observed by chance alone is 10,305/646,646 = 0.016.

Our t-test p of 0.023 is an approximation to this 0.016.

Link to thisWhile I’m sure you don’t intend to mislead, there’s a fallacy in suggesting that applying conditional probability to events makes one a Bayesian. The definition of conditional probability must be used by all who employ the probability calculus. The fallacy is in thinking because it’s perfectly sensible with ordinary events, that it’s sensible to be a Bayesian when the priors apply to statistical hypotheses! You can call conditional probability Bayes’s theorem if you want, but using it—it’s a theorem after all—does not make one a Bayesian about statistical inference. The Bayesian statistician sees statistical inference in hypotheses.as updating priors by conditional probability.

What do the priors mean? where do they come from? Subjective Bayesians will say they reflect their degrees of belief in the hypothesis under study, e.g., that drug X is free of side-effects, the Higgs particle exists in such an such energy range, prions cause Mad Cow, etc. etc. It’s a matter of their hunch and how much they’ll bet!

Is this really a way to hold researchers more accountable?

Non-subjective Bayesians strive to find priors that influence the data AS LITTLE as possible. In that case who needs them? Posteriors come from two things: priors and likelihoods. We’ve just seen what’s wrong with priors. But likelihoods are problematic too. The trouble with likelihoods* is that they only reflect how well data fits hypotheses. If H makes data x highly probable, then H gets a high likelihood. The Bayesian likelihood component reflects the likelihood ratio statistic

LR: P(x|not-H) divided by P(x|H).

In a typical example not-H would be the null hypothesis, and H some chosen alternative hypothesis. Trouble is, one can always find an H that is maximally likely on the data! One can, for example, construct the hypothesis H to perfectly fit the data (so P(x|H) = 1)! Then the LR in favor of H (over not-H) is high. But can’t frequentists do this too?

Answer: not without costing them in high error probabilities. You see a frequentist will also look at the error probability associated with the inference. They deny x is good evidence for H when

P(LR so high as this| even if not-H) = high!

This is to say that the probability of erroneously finding such high support in favor of H, even if H is false, is very high.

Bayesians reject the use of error probabilities! This follows from inference by way of Bayes theorem.

So remember, when you hear someone say the Bayesian avoids problems with error statistical methods, like significance tests and confidence intervals, the Bayesian posterior probability consists of two things: a subjective prior, and a likelihood ratio Both require considering an exhaustive set of hypotheses that could explain x—else you are Bayesian incoherent—but neither takes into account the error probabilities associated with the method!

*Even if it is assumed the underlying statistical model is valid.

Link to thisTo consider Hilda’s remark: “testing for statistical significance estimates the probability of getting roughly that result if the underlying hypothesis is assumed to be true.”

Rather than “roughly,” the p-value (observed statistical significance level) considers results that fit H (as opposed to not-H) even more strongly than the likelihood ratio (LR) you observed. In other words, we place a probability on the LR statistic. Please note that the likelihood ratio LR can be in terms of P(x|H)/P(x|not-H) or it can be P(x|not-H)/P(x|H). Where x “fits” H better than not-H, the first version has a large LR, the second has a small LR.

(i) If it’s fairly common to get even higher observed support for H, even if H is false, then the p-value is not low .

(ii) If it’s extremely rare to be able to generate such high support for H, even if H is false, then the p-value is low.

Thus, unless the p-value is sufficiently low, x is not taken as evidence against not-H.

Second, the computation is like ordinary, non-statistical testing. We hypothetically assume something (eg., not-H) in order to derive a prediction. Here the prediction is only very probable:

If not-H, then the probability of getting a p-value greater than .01 = .99

Therefore if we get a p-value < .01, we infer there is evidence against not-H (in favor of H).

It’s hypothetical reasoning.

Link to thisThanks, Errorstat. I agree that my translation into “roughly” of the probability of the result falling where it has been observed is itself pretty rough. I’ve struggled with finding a way to explain statistical significance testing in enough detail for it to make sense, and without so much detail that it’s too hard for most people to understand. And it can definitely be done better – I’ll keep trying to improve my way of explaining it. I know that a lot gets lost in attempts at translation.

But then I also believe if we don’t find really good ways to explain these things more simply, then we will continue to have non-statisticians who want to be well-informed by research results misinterpreting aspects of statistical significance. It’s too important to let this continue to be so hard for so many people to get a good handle on the degree of certainty that you should take away from a test of significance on its own. I’ll chew over this for a while and see if I can improve the phrasing.

About the Bayesian and frequentist argument: it’s a weakness of my post, I think, that I’ve not done justice to either end of the spectrum – nor to “fusion” positions. So a general point about not being dismissive of a frequentist perspective is very well taken – as is the one about being careful not to define “a Bayesian” as anyone who considers conditional probabilities (I hope I didn’t do it, but agree, it would be misleading).

I’m sure I will re-visit this cornerstone debate in statistical philosophy and practice again, because it’s important and I’m fascinated by it. I totally agree that letting go of strong goal posts in favor of really fuzzy highly biased approximations based in quicksand would cause us all manner of grief. That said, I don’t know that your comment does justice to subjective or objective Bayesian approaches either. Bayesian methods can be very rigorous and priors can be on sturdy ground – although the lack of sturdy ground can inhibit practical application. What’s more, it’s not unusual to see both approaches being applied, with results of more than one approach to analysis included with a piece of work. Regardless of the type of analysis being done, high degrees of bias and sloppiness aren’t going to leave us very well-informed.

Link to thisI enjoyed the article and the illuminating comments. I would like to add two points:

1) Your definition, “Testing for statistical significance estimates the probability of getting roughly that result if the underlying hypothesis is assumed to be true,” is either unclear or wrong. P values are computed based on the assumption that the NULL hypothesis is correct. The “underlying hypothesis” seems to indicate novel hypothesis that the experiment is designed to test.

2) When I taught I found it useful to offer students the distinction between basic “discovery” research (testing my hair-brained idea in the lab, with the hope to starting off a new path) and “crucial-test” research (exemplified by a phase III NIH trial that would lead to exposing the general population to a new clinical modality, e.g., a drug). In my opinion, it is not a problem if lots or even most basic discovery research is wrong, for two reasons. First, neither lives nor large amounts of money depend on its being right (if lots of money does, then maybe it’s more like a crucial test). Second, the process is self-correcting in that basic research always leads to new experiments to elaborate the first ones and (AND HERE’S THE KEY POINT) properly designed follow-ups should include replicating the initial tests as a lessor included goal. People who don’t design follow-ups in this way often end up with a different phenomenon in their hands.

Link to thisThanks, Geary. I was trying to communicate the concept of null hypothesis without using jargon: I don’t think we can know that people assume “underlying” means novel. In fact, it probably doesn’t mean anything much. I changed it to study hypothesis. I disagree, though, that it doesn’t matter if basic research is wrong. Lives and large amounts of money are involved, because it determines what goes forward, new ideas are built on it and so on. The process is only theoretically self-correcting. In practise there’s a serious dearth of replication science. See for example this post, “Bad research rising.”

Link to this