Competing interests None declared.
Provenance and peer review Not commissioned; internally peer reviewed.
Statistics from Altmetric.com
One of the most frequent clinical decisions is the selection of the most appropriate treatment from a number of options.1 Network meta-analysis is probably the best statistical tool we have to answer this question because it allows for estimation of comparative efficacy and ranking interventions even if they have not been investigated head to head in randomised controlled trials.2 Clinicians, however, should be particularly careful when appraising network meta-analysis and should usually avoid simple conclusions, as evidence-based practice is not ‘cookbook’ medicine.3 4 Network meta-analysis is quickly gaining popularity in the literature, but quality is variable.5 Several tools have been developed to evaluate the extent to which the findings from a network meta-analysis would be valid and useful for decision-making.6 However, applying these tools is a time-consuming task and often requires specific expertise. Clinicians have little time for critical appraisal and so need to understand the key elements that help them select network meta-analyses that deserve further attention, optimising time and resources. In this paper, we propose a practical framework to assess the methodological robustness and reliability of results from network meta-analysis. A brief description of the key terms used throughout the paper is available in table 1.
We selected a network meta-analysis about drug treatments for generalised anxiety disorder (GAD),7 which was published in 2011 in the British Medical Journal and has been previously used as a working example in another paper.8 Twenty-seven randomised controlled trials were included in this review, which provided outcome data on 10 competing treatments (duloxetine, escitalopram, fluoxetine, lorazepam, paroxetine, pregabalin, quetiapine, sertraline, tiagabine, venlafaxine) and placebo. Network meta-analysis technique was employed to synthesise the available data, using both a Bayesian hierarchical model and a frequentist meta-regression approach. Three clinical outcomes were considered: response and remission for efficacy, and dropout rate due to adverse events as a measure of tolerability. Results were reported as ORs and the relative ranking of interventions was based on the probability for each treatment of being the best. Fluoxetine resulted to be the most efficacious treatment (62.9% probability of being best), while sertraline ranked first in terms of tolerability (49.3% probability of being best). We reanalysed the 27 studies included in this network following the methods reported in the original article and compared our findings with the published results (details on the statistical model can be found in the online supplementary web appendix). To illustrate how inappropriate methodological approaches and suboptimal presentation of results can affect conclusions from network meta-analysis, we have divided our paper into three sections, according to the specific issues that should always be checked. For the purpose of this article, we present and discuss here only some of the findings from our analyses, but full results are reported in the online supplementary web appendix.
Understanding the evidence base
Which is the network of treatments?
As for all systematic reviews and meta-analyses, the construction of the evidence base for a network meta-analysis should derive from a coherent and clear set of inclusion criteria, which define the competing interventions, the study characteristics and the patient population. It is worth noting that, for the same clinical question, researchers might be interested in comparing all the available interventions for the condition under investigation or only a subset of them. Interventions that are not of direct interest for the clinical question (eg, non-licensed drugs or placebo) can be added to the network to increase the amount of data and provide additional indirect evidence. In theory, both approaches are equally methodologically valid, but different interventions under consideration mean different networks of treatments and, therefore, potentially different results.9 To help reduce the potential for selection bias and the risk of incorrect results, a prespecified rationale for including or excluding interventions from the network should be reported in the study protocol, which should be made available (as URL link or web-only supplementary material). In the GAD network meta-analysis, we first checked the selected interventions and the included studies, to assess whether they were appropriate to answer the review question. The paper reported that ‘in this systematic review we compared the efficacy and tolerability of all drug treatments for generalised anxiety disorder by combining data from published randomised controlled trials. We also carried out a subanalysis comparing the five drugs currently licensed for generalised anxiety disorder in the United Kingdom (duloxetine, escitalopram, paroxetine, pregabalin, and venlafaxine)… 46 trials met the inclusion criteria, but only 27 contained sufficient or appropriate data to be included in the analysis’. It is not clear, though, what was meant by ‘sufficient and appropriate data’ and why some interventions such as alprazolam and buspirone were not incorporated in the analyses. These two drugs were used at a therapeutic dose in two three-arm studies that were included in the network (see references 24 and 35 in Table A of the Supplementary web material in Baldwin et al 7); however, alprazolam and buspirone arms were excluded from the analyses without any justification. The availability of a review protocol would have clarified the selection of interventions and reduced the risk of selection bias.
Is the transitivity assumption likely to hold?
The synthesis of studies making a direct comparison of two treatments makes sense only when the studies are sufficiently similar in important clinical and methodological characteristics (eg, severity of illness at baseline, treatment dose, sample size and study quality—the so-called effect modifiers). Similarly, for an indirect comparison (such as A vs B) to be valid, it is necessary that the sets of direct comparisons (A vs C and B vs C) are similar in their distributions of effect modifiers. Only when this is the case can we assume that the intervention effects are transitive (ie, the previously mentioned subtraction equation holds).10 Transitivity can be viewed as the extension of clinical and methodological homogeneity to comparisons across groups of studies that compare treatments.2 In a network meta-analysis, the inclusion criteria should be wide enough to enable generalisation of the findings, but also sufficiently narrow to ensure the plausibility of the transitivity assumption. Transitivity can be assessed statistically by comparing the distribution of effect modifiers across the available direct comparisons when there are sufficient data,11 but inference on its plausibility should also be based on the clinical understanding of the evidence. For this reason, a clear and transparent presentation of the inclusion criteria is necessary in all network meta-analyses. This information should be reported in Method section of the paper, and all the primary studies described in detail in the Results section.12
In our working example, the systematic review found only one randomised controlled trial of fluoxetine. This was a three-arm study, which compared fluoxetine with venlafaxine and placebo.13 This was actually an analysis of a subgroup of patients from another randomised trial,14 in whom the formal assessment of GAD in comorbidity with major depression was made retrospectively (thus increasing the risk of selection bias) and for whom the randomisation was not stratified according to the comorbid diagnosis of GAD (stratified randomisation is usually employed in a trial in order to achieve approximate balance of patients with two—or more—characteristics that may influence the clinical outcome, without sacrificing the advantages of randomisation). Due to its non-properly randomised design, the rationale for including this study in the network was questionable. Moreover, it probably violated the transitivity assumption, as its population was likely to be substantially different from other included studies, which excluded patients with any current and primary Diagnostic and Statistical Manual of Mental Disorders-IV Axis I diagnosis other than GAD, including major depressive disorder, within the previous 6 months. To assess how much the fluoxetine study influenced the relative effects and ranking of treatments, we evaluated the contribution of each direct comparison in the network15 and found that for the primary efficacy outcome, this study contributed to 9% of the total amount of information and more than 30% in the relative effects of all interventions versus fluoxetine (figure 1). As this study was the only trial providing evidence about fluoxetine, superiority of this drug over other interventions is therefore at least doubtful and the validity of the findings for the whole network is thrown into doubt.
Checking the statistical analysis
Has consistency been assessed properly?
There are several statistical approaches for carrying out a network meta-analysis (eg, hierarchical models,16 meta-regression models17), and all of them, when the underlying assumptions hold, yield comparable results. All existing statistical models for network meta-analysis are based on the integration of the direct and all possible indirect estimates (ie, ‘mixed evidence’), assuming that the different sources of evidence (direct and indirect) are in agreement within the treatment network; this is the so-called consistency assumption.2 Consistency is the statistical manifestation of transitivity, and the validity of the findings from network meta-analysis is highly dependent on the plausibility of this assumption. Statistical inconsistency is inextricably connected to statistical heterogeneity and both need to be explicitly evaluated in a network meta-analysis. Inconsistency should be low enough to ensure validity of the results and heterogeneity should be low enough to make the results relevant to a clinical population of interest. Inconsistency can occur in one in eight networks6 and inconsistency models are used to evaluate it in a network meta-analysis.18 Despite its fundamental importance, however, researchers often assess the consistency assumption, if at all, using inappropriate methods.5
In our working example, it is reported that the authors ‘tested the validity of the mixed treatment model by comparing the consistency of results between the mixed treatment meta-analyses and the direct comparison meta-analyses.’ This is not a valid method for assessing inconsistency because the network estimates (mixed treatment meta-analyses) are a combination of the direct (direct comparison meta-analyses) and indirect estimates and consequently they are not expected to differ much, even in the presence of substantial inconsistency. We applied two tests in our reanalysis, the design-by-treatment interaction test18 (p value=0.85) and the loop-specific approach19 (figure 2), and both did not reveal statistically significant inconsistency; however, the large uncertainty in the estimation of the ratio of ORs (RORs) between direct and indirect estimates in three out of six loops would require further exploration (figure 2). It is probably not by chance that the loop including the Silverstone study (fluoxetine, placebo, venlafaxine) had the largest upper CI limit (ROR=2.51, 95% CI 1.00 to 11.43) and this reinforced our concerns about the inclusion of this study in the network.
Checking the reporting of findings
How was the relative ranking of treatments estimated?
Even though often challenging, presenting the results of network meta-analysis in a way readers can understand has to be the norm. From a decision-making point of view, the most important clinical output of network meta-analysis is the set of relative effects between all pairs of interventions and it can be reported in a league table20 or by using other graphical displays.21 According to our reanalysis, in terms of response, all drugs—except fluoxetine, quetiapine and paroxetine—appeared to be statistically significantly more effective than placebo (figure 3). However, no important differences existed between the 10 interventions in terms of efficacy (table 2) and the predictive intervals in figure 3 indicated that only for lorazepam, tiagabine and venlafaxine, the estimated heterogeneity (τ=0.26 (0.00, 0.53)) was small enough to suggest their possible beneficial effect in a future study.22
The relative ranking of treatments is often used because it offers a concise summary of the findings. The hierarchy of competing interventions should be based on the network meta-analysis estimates presented in the Results section. The most popular ranking approach is based on the ‘ranking probabilities’, that is, the probabilities for each treatment to be placed at a specific ranking position (best treatment, second best, third best and so on) in comparison with all other treatments in the network.23 Similarly to other measures of relative effect, ranking probabilities have a degree of uncertainty.7 Alternative measures that incorporate the entire distribution of the ranking probabilities include the mean (or median) ranks21 and the surface under the cumulative ranking curves (SUCRA)23 (see table 1 for a definition and description). By contrast, other approaches ignore the uncertainty in the relative ranking and focus only on the first position (like the ‘probability of being the best’). This way of ranking treatments can lead to misleading conclusions, because the probability of being the best does not account for the uncertainty in the estimate and can spuriously give higher ranks especially for treatments with little evidence available. This was the case of the original analysis of the data set in our example,7 where fluoxetine resulted the most effective treatment for both response (62.9%) and remission (60.6%). However, it is very likely that the advantage of fluoxetine in the hierarchy was the consequence of using an inappropriate ranking measure that did not properly account for statistical uncertainty. In our reanalysis, we ranked the treatments using the SUCRA percentages and found different results: lorazepam ranked first in terms of response (75.8%) and escitalopram for remission (80.8%) (table 3). Fluoxetine was still among the drugs with potentially better efficacy profile; however, the small differences in the SUCRA values suggest that the most sensible conclusion would have been that there is too large uncertainty around the hierarchy of treatments (see online supplementary web appendix for full details).
The validity of the results from network meta-analysis depends on the plausibility of the transitivity assumption. As in pairwise meta-analysis, the risk of bias introduced by limitations of individual studies must be considered first and judgement should be used to infer about the plausibility of transitivity. Possible effect modifiers could always be clinical (similarity in patients’ characteristics, interventions, settings, length of follow-up, outcomes) and also methodological (similarity in study design and risk of bias). Sometimes, differences in the distribution of these moderators across studies are large enough to make network meta-analysis invalid. Inconsistency exists when treatment effects from direct and indirect evidence are in disagreement. Unlike transitivity, inconsistency can be always evaluated statistically. Network meta-analysis should describe in the protocol a clear strategy to deal with inconsistency, which should be always scrutinised for errors at the raw data level.
Clinicians usually want to know the preferential order of treatments that could be prescribed to an average patient. In this paper, we demonstrated that rankings could be misleading if based on the probability of being the best (see online supplementary web appendix box 1, for a hypothetical example showing how imprecise estimate and large variance can lead to wrong conclusions, if this method is used). In a properly conducted network meta-analysis, ranking measures and probabilities are a convenient way to present results and the corresponding hierarchy of treatments. Good rankings, however, do not necessarily imply large or clinically important differences. Despite the ease of presentation, ranking measures should be presented and interpreted only in light of the estimated relative treatment effects. Clinicians should always be interested in the effect sizes and look at the SUCRAs (together with their degree of uncertainty) rather than the naive rankings.
Further implications for researchers and journal editors
The methodological approach can affect the magnitude of estimated effect associated with an intervention and consequently materially change the study findings.24 Researchers should recognise the complexity of conducting a high-quality network meta-analysis, which require a multidisciplinary team with clinical and technical expertise to adequately cover each step of the research project, including skills in literature search, data extraction and statistical analysis.25
As for all systematic review and standard meta-analyses, results from network meta-analyses should be replicable. Published papers must include all of the information that readers need to completely understand how the study was conducted, independently assess the validity of the analyses and reach their own interpretations.26 The availability of the review protocol and the codes for statistical analyses should become soon a mandatory requirement for all network meta-analyses (mostly needed for the peer review process), as it is important to avoid misconceptions regarding the undertaken analysis.
Competing interests None declared.
Provenance and peer review Not commissioned; internally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.