Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
The p value guides us in judgements of statistical significance, accepting or rejecting the “null hypothesis”, but can exert excessive influence over judgements of clinical significance. Karl Marx wrote of the “fetishism of commodities” in criticising the undue influence of property on capitalism’s social processes. In much the same way, evidence-based medicine has created a “fetishism of p values”. Where used to make an appraisal of the significance of difference between two groups, the p value is termed the “significance level”. The convention of relying on p<0.05 to indicate clinical significance is now deeply embedded in our critical appraisal of research papers, to such an extent that we can sometimes forget its true meaning.
A p value less than 0.05 indicates that the differences between, say, two treatment groups will result from “chance” one time in 20, as a Type I error (signified α). This is conventionally interpreted as indicating clinical significance, useful as an aide memoire but sometimes given unwarranted influence when conceptually detached from the clinical situation in which the test has been conducted. p values must be interpreted in light of the clinical scenario. We misuse p values when treating them in a purely dichotomous manner, as “significant” or “not significant”. The difference between p = 0.051 and p = 0.049 may be actuarially meaningless, and can only be understood with reference to the circumstances and power of the test.
When we carry out repeated independent tests on a single study group, the probability of finding a significant difference is artificially inflated, and the error rate can be calculated on the basis of the number, n, of these independent tests, using the formula:
It has therefore become standard practice to employ a statistical adjustment, known as the Bonferroni adjustment, to counteract the effect of multiple tests.1 This employs the reverse formula to adjust the significance level and maintain an error rate of 0.05:
Bonferroni adjustments are one of many different types of statistical adjustment.2 In the hands of well-informed statisticians, making such adjustments can sometimes be useful. But the trend has been to employ adjustments injudiciously. There are several arguments against Bonferroni adjustments in particular.3
TRADING TYPE I FOR TYPE II ERRORS
The Bonferroni adjustment is employed to minimise Type I errors, but will only do so by increasing the probability of accepting the null hypothesis when the alternative is true, or Type II error. Sins of omission are no less than sins of commission. In a psychiatric setting if, say, we fail to accept the hypothesis that “domestic violence increases the probability of hospital admission”, the Type II error resulting from excessive statistical conservatism is no better than its opposite. Bonferroni adjustments are not a sign of judicious statistical caution, but simply a method of reducing Type I errors and increasing Type II errors.
WHAT GOES INTO THE POT?
In determining the number (“n”) of independent tests, it is common to ascertain the number of independent tests in the published paper. But this neglects the hidden layers of tests employed in addition or in preparation for that paper. Statistical tests may have been carried out, but not published. Tests may have been conducted in studies before the publication in question, but essential to that publication. To be theoretically “pure”, we should consider error rates beyond the immediate study. The Bonferroni adjustment applies a veneer of authenticity.
WHAT IS THE CORRECT “NULL HYPOTHESIS”?
The Bonferroni adjustment attempts to remove the study-wide error rate across a wide range of independent tests. If significance is detected, and the “null hypothesis” seemingly rejected, this leaves us ignorant as to which of the individual tests are significant and which are not. We can only conclude that the “universal” null hypothesis is rejected. But the universal null hypothesis—that some of the individual tests are significant in some way—is of no clinical interest. Clinical interest is served by knowing which test is significant and in what way.
ARE BONFERRONI ADJUSTMENTS EVER JUSTIFIED?
Bonferroni adjustments were developed by Neyman and Pearson4 in the 1920s as a means of enhancing decisions in recurring and repetitive circumstances. As a means of improving decision-making through statistical inference, Bonferroni adjustments have a role to play. They can be misapplied in biomedical research, which is dealing with a distinct paradigm. Bonferroni adjustments only have relevance to us where the universal null hypothesis is of greater interest than individual hypotheses concerning individual independent tests. This is particularly true to hypothesis-generating, rather than hypothesis-testing, research. Even in that instance, there is a need to tether interpretation to clinically relevant information, and to consider the implication for Type I as much as Type II error rates.
There are alternative multiple test procedures to the Bonferroni method2 which overcome some, but not all, of the difficulties described above. These include the Holm method5 which, however, is inaccessible to most biomedical researchers. Judgements on which method, if any, should be employed to adjust for multiple statistical tests need to be made judiciously, with knowledge of what p values actually represent.
p values communicate a specific meaning about probability that should be appraised clinically. p value fetishism can distract from that underlying meaning. p<0.05 is a useful aide memoire, but we should know its limitations. It is preferable to report estimated difference with a confidence interval, and not just as a p value. This fetishism finds its greatest expression in the Bonferroni adjustment for repeated tests. Bonferroni adjustments were created to inform decision making, and have sometimes been misapplied to biomedical research. Alternatives to Bonferroni adjustments are available but, when adjustments have been made, it is desirable to report both adjusted and non-adjusted analyses.