I’ve written previously about the problems associated with an unhealthy fixation on P-values in psychology. Although null hypothesis significance testing (NHST) remains the dominant approach, there are a number of important problems with it. Tressoldi and colleagues summarise some of these in a recent article.

First, NHST focuses on rejection of the null hypothesis at a pre-specified level of probability (typically 5%, or 0.05). The implicit assumption, therefore, is that we are only interested answering “Yes!” to questions of the form “Is there a difference from zero?”. What if we are interested in cases where the answer is “No!”? Since the null hypothesis is hypothetical and unobserved, NHST doesn’t allow us to conclude that the null hypothesis is true.

Second, P-values can vary widely when the same experiment is repeated (for example, because the participants you sample will be different each time) – in other words, it gives very unreliable information about whether a finding is likely to be reproducible. This is important in the context of recent concerns about the poor reproducibility of many scientific findings.

Third, with a large enough sample size we will always be able to reject the null hypothesis. No observed distribution is ever exactly consistent with the null hypothesis, and as sample size increases the likelihood of being able to reject the null increases. This means that trivial differences (for example, a difference in age of a few days) can lead to a P-value less than 0.05 in a large enough sample, despite the difference having no theoretical or practical importance.

The last point is particularly important, and relates to two other limitations. Namely, the P-value doesn’t tell us anything about how large an effect is (i.e., the effect size), or about how precise our estimate of the effect size is. Any measurement will include a degree of error, and it’s important to know how large this is likely to be.

There are a number of things that can be done to address these limitations. One is the routine reporting of effect size and confidence intervals. The confidence interval is essentially a measure of the reliability of our estimate of the effect size, and can be calculated for different ranges. A 95% confidence interval, for example, represents the range of values that we can be 95% confident that the true effect size in the underlying population lies within. Reporting the effect size and associated confidence interval therefore tells us both the likely magnitude of the observed effect, and the degree of precision associated with that estimate. The reporting of effect sizes and confidence intervals is recommended by a number of scientific organisations, including the American Psychological Association, and the International Committee of Medical Journal Editors.

How often does this happen in the best journals? Tressoldi and colleagues go on to assess the frequency with which effect sizes and confidence intervals are reported in some of the most prestigious journals, including *Science*, *Nature*, *Lancet* and *New England Journal of Medicine*. The results showed a clear split. Prestigious medical journals did reasonably well, with most selected articles reporting prospective power (*Lancet* 66%, *New England Journal of Medicine* 61%) and an effect size and associated confidence interval (*Lancet* 86%, *New England Journal of Medicine* 83%). However, non-medical journals did very poorly, with hardly any selected articles reporting prospective power (*Science* 0%, *Nature* 3%) or an effect size and associated confidence interval (*Science* 0%, *Nature* 3%). Conversely, these journals frequently (*Science* 42%, *Nature* 89%) reported P-values in the absence of any other information (such as prospective power, effect size or confidence intervals).

There are a number of reasons why we should be cautious when ranking journals according to metrics intended to reflect quality and convey a sense of prestige. One of these appears to be that many of the articles in the “best” journals neglect some simple reporting procedures for statistics. This may be for a number of reasons – editorial policy, common practices within a particular field, or article formats which encourage extreme brevity. Fortunately the situation appears to be improving – *Nature* recently introduced a methods reporting checklist for new submissions, which includes statistical power and sample size calculation. It’s not perfect (there’s no mention of effect size or confidence intervals, for example), but it’s a start…

**Reference:**

** **Tressoldi, P.E., Giofré, D., Sella, F. & Cumming, G. (2013). High impact = high statistical standards? Not necessarily so. PLoS One, e56180.

**Posted by Marcus Munafo**