Wednesday, 10 April 2013

(Appropriately powered) replication's what you need

Thanks to Mark Stokes for picture
There has been some truly excellent coverage this morning of the very important paper published today by Kate Button, Marcus Munafo and colleagues in Nature Reviews Neuroscience, entitled “Power failure: why small sample size undermines the reliability of neuroscience”.

For example, Ed Yong has written a fantastic piece on the issues raised by the realisation that insufficient statistical power plagues much neuroscience research, and Christian Jarrett has an equally good article on the implications of these issues for the field.

As I commented in Ed Yong’s article, I think this is a landmark paper.  It's very good to see these issues receiving exposure in such a widely-read and highly-respected journal - I think it says a lot for the willingness of the neuroscience field to consider and hopefully tackle these problems, which are being identified in so many branches of science.

I really like the section of the paper focusing on the fact that the issues of power failure are a particular problem for replication attempts, which I think is a point not many people are conscious of.  You'll often see an experiment's sample size justified on the basis of an argument like "well, that number is what's used in previous studies".  Button et al demonstrate that such a justification is unlikely to be sufficient.  To be adequately powered, replications need a larger sample size than the original study they’re seeking to replicate.  There are very few replication studies in the literature that fulfil this criterion.

I feel deep down that greater emphasis on replication is the answer to a lot of the current issues facing the field, but the points raised by Button et al are key issues that researchers in the field need to take account of.

The good thing is that I think the field is taking notice of papers such as this one, and is making progress towards developing more robust methodological principles.  Button et al.'s paper, like the recent Nature Neuroscience papers by Kriegeskorte et al and Nieuwenhuis et al, plus the recent moves by Psychological Science, Cortex, and other journals to promote the use of more reliable methodology, are all excellent contributions to that progress.  I think it's a sign of a healthy, thriving scientific discipline that these methods developments are being published in such prominent flagship journals.  It gives me confidence about the future of the field.

Update 10/4/13, 3pm: I'm grateful to Paul Fletcher for highlighting that NeuroImage: Clinical has created a new section to help address concerns about the lack of replication in clinical neuroimaging.  Very happy to publicise any other similar moves to improve things.

Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint, J., Robinson, E., & Munafò, M. (2013). Power failure: why small sample size undermines the reliability of neuroscience Nature Reviews Neuroscience DOI: 10.1038/nrn3475
Chambers, C. (2013). Registered Reports: A new publishing initiative at Cortex Cortex, 49 (3), 609-610 DOI: 10.1016/j.cortex.2012.12.016
Kriegeskorte, N., Simmons, W., Bellgowan, P., & Baker, C. (2009). Circular analysis in systems neuroscience: the dangers of double dipping Nature Neuroscience, 12 (5), 535-540 DOI: 10.1038/nn.2303
Nieuwenhuis, S., Forstmann, B., & Wagenmakers, E. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance Nature Neuroscience, 14 (9), 1105-1107 DOI: 10.1038/nn.2886


  1. Charan Ranganath10 April 2013 at 19:29

    Jon, I agree about the power issue, but not necessarily all the emphasis on methodology of late. The Nieuwenhuis paper is a good example of misguided thinking, IMHO. Their paper is predicated on the idea that we should always test for interactions, but in most neuroscience studies, the hypothesis concerns a *selective effect*, not an interaction. For instance, in a pre- vs post- design, we shoot for an effect in a drug or lesion group and no effect in a control group. In fact, if a true interaction was seen in typical drug or lesion studies, they would be difficult to explain (e.g., why would a placebo make performance worse on the second test?)

    1. @Charan
      I disagree. The scenario you describe (an effect of intervention X on patients but not on controls) is a standard single dissociation and readily detectable through ANOVA or other approaches as a significant interaction.

      A double dissociation may be easier to detect but it is not the only form of interaction that can be observed, and the point that Nieuwenhuis et al make isn't restricted to DDs.

    2. Charan Ranganath10 April 2013 at 21:21

      Dear Chris, you are of course right that a selective effect can be detected with a factorial analysis, but that's not to say that it is the best and only test of the predicted effect. In a factorial analysis, a selective effect will split the difference between interaction and main effect, so my point is that the interaction term of an ANOVA is not the most efficient or straightforward test of a selective effect. But of course in order to prove that the effect is larger in the experimental group than in the control group, an interaction is necessary. So the issue to me is not what is right or wrong, but what conclusion can be supported by a particular analysis.

    3. @Charan. I disagree - the only other viable test I can see is an a priori contrast of the form {-3, 1, 1, 1} but that is rarely performed and requires the strong prediction that three of the conditions are identical. In a factorial design the natural contrast to use is the interaction. Looking just as the simple main effects is more sensitive, but only because it effectively raises Type I error (e.g., where one of the pre-post differences is .049 and the other is .051).

  2. I do agree with you Charan on some aspects of the recent focus on neuroscience methodology. As I hope I've made clear in this article and previously, I'm 100% behind moves to improve the robustness and reliability of the methods we use in our science. I think it's a fundamental part of science (all fields of science!) that methods improve over time, that each time some people are resistant to changing what they know, but eventually the field adopts a new convention.

    However, I disagree with the emphasis of some like John Ioannidis (and to an extent, the tone in parts of Button et al's paper) that studies not adopting some methodological development or other are therefore "false". In fact I think it's a shame some people have the view that any single study can be true or false; to my mind, every experiment we do is about finding another tiny bit of evidence to add to the growing picture. Some of that evidence will take us a little closer to the truth, some a little further away - and we should keep improving our methods so that there's a greater chance each individual study helps rather than hinders. But to assert that a study using p<0.05 whole-brain corrected is true (just to take one example), but a study with an uncorrected threshold is false, is barmy in my opinion, when in 10 years we're highly likely to have moved on methodologically and will think RFT correction is old hat.

  3. Jon, I agree with much of what you say; data are data. The problems lie with the inferences and conclusions drawn from those data if the underlying analytical assumptions and methodological limitations are not fully acknowledged. I think presenting findings as estimates with confidence intervals is helpful in this as the precision and confidence in the effect estimated is clear.