I’ve written previously about the problems associated with an unhealthy fixation on P-values in psychology. Although null hypothesis significance testing (NHST) remains the dominant approach, there are a number of important problems with it. Tressoldi and colleagues summarise some of these in a recent article.

First, NHST focuses on rejection of the null hypothesis at a pre-specified level of probability (typically 5%, or 0.05). The implicit assumption, therefore, is that we are only interested answering “Yes!” to questions of the form “Is there a difference from zero?”. What if we are interested in cases where the answer is “No!”? Since the null hypothesis is hypothetical and unobserved, NHST doesn’t allow us to conclude that the null hypothesis is true.

Second, P-values can vary widely when the same experiment is repeated (for example, because the participants you sample will be different each time) – in other words, it gives very unreliable information about whether a finding is likely to be reproducible. This is important in the context of recent concerns about the poor reproducibility of many scientific findings.

Third, with a large enough sample size we will always be able to reject the null hypothesis. No observed distribution is ever exactly consistent with the null hypothesis, and as sample size increases the likelihood of being able to reject the null increases. This means that trivial differences (for example, a difference in age of a few days) can lead to a P-value less than 0.05 in a large enough sample, despite the difference having no theoretical or practical importance.

The last point is particularly important, and relates to two other limitations. Namely, the P-value doesn’t tell us anything about how large an effect is (i.e., the effect size), or about how precise our estimate of the effect size is. Any measurement will include a degree of error, and it’s important to know how large this is likely to be.

There are a number of things that can be done to address these limitations. One is the routine reporting of effect size and confidence intervals. The confidence interval is essentially a measure of the reliability of our estimate of the effect size, and can be calculated for different ranges. A 95% confidence interval, for example, represents the range of values that we can be 95% confident that the true effect size in the underlying population lies within. Reporting the effect size and associated confidence interval therefore tells us both the likely magnitude of the observed effect, and the degree of precision associated with that estimate. The reporting of effect sizes and confidence intervals is recommended by a number of scientific organisations, including the American Psychological Association, and the International Committee of Medical Journal Editors.

How often does this happen in the best journals? Tressoldi and colleagues go on to assess the frequency with which effect sizes and confidence intervals are reported in some of the most prestigious journals, including *Science*, *Nature*, *Lancet* and *New England Journal of Medicine*. The results showed a clear split. Prestigious medical journals did reasonably well, with most selected articles reporting prospective power (*Lancet* 66%, *New England Journal of Medicine* 61%) and an effect size and associated confidence interval (*Lancet* 86%, *New England Journal of Medicine* 83%). However, non-medical journals did very poorly, with hardly any selected articles reporting prospective power (*Science* 0%, *Nature* 3%) or an effect size and associated confidence interval (*Science* 0%, *Nature* 3%). Conversely, these journals frequently (*Science* 42%, *Nature* 89%) reported P-values in the absence of any other information (such as prospective power, effect size or confidence intervals).

There are a number of reasons why we should be cautious when ranking journals according to metrics intended to reflect quality and convey a sense of prestige. One of these appears to be that many of the articles in the “best” journals neglect some simple reporting procedures for statistics. This may be for a number of reasons – editorial policy, common practices within a particular field, or article formats which encourage extreme brevity. Fortunately the situation appears to be improving – *Nature* recently introduced a methods reporting checklist for new submissions, which includes statistical power and sample size calculation. It’s not perfect (there’s no mention of effect size or confidence intervals, for example), but it’s a start…

**Reference:**

** **Tressoldi, P.E., Giofré, D., Sella, F. & Cumming, G. (2013). High impact = high statistical standards? Not necessarily so. PLoS One, e56180.

**Posted by Marcus Munafo**

Could it be possible for journals to just simply require scientists to mention prospective power, effect sizes, and confidence intervals? Can they just do that?

It seems like a tremendous amount of work, not to mention incredibly difficult, for scientists to compute and report these things, and it seems like even more work and to be even more difficult for scientific journals to require these things. Upon second thought, it would be ridiculous to ask scientists and scientific journals to report these statistics. It is ridiculous…

Journals do require this – for example, APA guidelines explicitly require the reporting of effect sizes and confidence intervals.

The problem is that authors aren’t doing it, and journals and reviewers aren’t enforcing it:

http://www.ncbi.nlm.nih.gov/pubmed/21823805

It’s really no effort – standard statistical software packages can already generate what is required. It’s simply that people have been slow to adopt this better standard of reporting in certain fields.

“Journals do require this – for example, APA guidelines explicitly require the reporting of effect sizes and confidence intervals. The problem is that authors aren’t doing it, and journals and reviewers aren’t enforcing it (…)”

Ah, thank you kindly for this information! I don’t own an APA manual, but I have found one online. On page 34 it says something about reporting effect sizes and confidence intervals:

“The inclusion of confidence intervals (for estimates of parameters, for functions of parameters

such as differences in means, and for effect sizes) can be an extremely effective way of reporting

results. Because confidence intervals combine information on location and precision and can often

be directly used to infer significance levels, they are, in general, the best reporting strategy. The use of confidence intervals is therefore strongly recommended.”

This got me thinking:

1. Does “require” equal “strongly recommend”?

2. What exactly makes a journal “scientific”?

In light of your effect sizes, confidence interval- post:

1. If a journal requires following APA guidelines, but these guidelines in turn merely “strongly recommend” things (hence perhaps the term APA “guidelines”) where does the final responsibility lie: APA guidelines? Journals? Scientists?

2. It seems to me that when a journal requires the use of APA guidelines in their articles, the journal attempts to provide some “scientific” credibility. If this makes any sense, I then subsequently wonder whether the APA guidelines are defined optimally. Why don’t they use “require” instead of “recommend” for things, or make a clear priority list of things that are “required” and things that are “recommended”.

I also wonder whether journals give the optimal priority concerning the specific APA guidelines. Why don’t journals “require” certain APA guidelines and “recommend” other less important APA guidelines?

Altogether (and please note that I do not own an APA guideline book, or work in a scientific field, and may have misunderstood things), it seems to me that there could be a) a diffusion of responsibilities and/or b) a lack of clearly stated or prioritised “requirements” present in the scientific publication process, which together just leaves a mess of things in my opinion. If this makes any sense, I find that curious, because it seems so easy for journals (I focus on journals because it seems to me that they are the final step in the publication process, sort of the final quality-check if you will) to make a list of things that are “required” (so not just “recommended”) to be present in their publications.