• Open access
  • Published: 10 October 2023

Clinical systematic reviews – a brief overview

  • Mayura Thilanka Iddagoda 1 , 2 &
  • Leon Flicker 1 , 2  

BMC Medical Research Methodology volume  23 , Article number:  226 ( 2023 ) Cite this article

2031 Accesses

4 Citations

2 Altmetric

Metrics details

Systematic reviews answer research questions through a defined methodology. It is a complex task and multiple articles need to be referred to acquire wide range of required knowledge to conduct a systematic review. The aim of this article is to bring the process into a single paper.

The statistical concepts and sequence of steps to conduct a systematic review or a meta-analysis are examined by authors.

The process of conducting a clinical systematic review is described in seven manageable steps in this article. Each step is explained with examples to understand the method evidently.

A complex process of conducting a systematic review is presented simply in a single article.

Peer Review reports

Systematic reviews are a structured approach to answer a research question based on all suitable available empirical evidence. The statistical methodology used to synthesize results in such a review is called ‘meta-analysis’. There are five types of clinical systematic reviews described in this article (see Fig. 1 ), including intervention, diagnostic test accuracy, prognostic, methodological and qualitative. This review will provide a very brief overview in a narrative fashion. This article does not cover systematic reviews of more epidemiologically based studies. The recommended process undertaken in a systematic review is described under seven steps in this paper [ 1 ].

figure 1

Types of systematic reviews

There are resources for those who are moving from the beginning stage and gaining more expertise (See Table 1 ). Cochrane conducts online interactive master classes on systematic reviews throughout the year and there are web tutorials in the form of e-learning modules. Some groups in Cochrane commission limited number of systematic reviews and can be contacted directly for support ([email protected]). Some institutions have systematic review training programs including John Hopkins (Coursea), Joanna Briggs Institute (JBI education), Yale University (Search strategy), University of York (Centre for Reviews) and Mayo Clinic Libraries. BMC systematic reviews group also introduced “Peer review mentoring” program to support early researchers in systematic reviews. The local University/Hospital librarian is usually a good point of first reference for searches and is able to direct reviewers to other support.

Research question and study protocol

A clearly defined study question is vital and will direct the following steps in a systematic review. The question should have some novelty (e.g. there should be no existing review without new primary studies) and be of interest to the reviewers. Major conflicts of interest can be problematic (e.g. employment by a company that manufactures the intervention). Primary components of a research question should include inclusion criteria, search strategy, analysis or outcome measures and interpretation. Types of reviews will determine the categories of research questions such as intervention, prognostic, diagnostic, etc. [ 1 ].

Study protocol elaborates the research question. The language of the study protocol is important. It is usually written in future tense, accessible language, active voice and full sentences [ 2 ]. Structure of the review protocol is described in Fig. 2 .

figure 2

Structure of the review protocol

Searching studies

The comprehensive search for eligible studies is the most defining step in a systematic review. The guidance by an information specialist, or an experienced librarian, is a key requirement for designing a thorough search strategy [ 3 , 4 ].

The search strategy should explore multiple sources rigorously and it should be reproducible. It is important to balance sensitivity and precision in designing a search plan. A sensitive approach will provide a large number of studies, which lowers the risk of missing relevant studies but may produce a large workload. On the other hand, a focused search (precision) will give a more manageable number of studies but increases the risk of missing studies.

There are multiple sources to search for eligible studies in a systematic review or a meta-analysis. The key databases are Central (Cochrane register of clinical trials), MEDLINE (PubMed) and Embase. There are many other databases, published reviews and reference lists that may be used. Forward citation tracking can be done for searched studies using citation indices like Google Scholar, Scopus or Web of Science. There may be studies presented to different levels of governmental and non-governmental organizations which are not recognized as commercial publishers. These studies are called ‘grey literature’. Extensive investigations in different sources are required to identify grey literature. Information specialists are helpful in finding these studies [ 2 ].

Designing the search strategy requires a structured approach. Again, assistance from a librarian or an information specialist is recommended. PICOS, PICO and PICOTS elements are used to design key concepts. Participants and study design are relevant elements used in all reviews. Intervention reviews require specification of the intervention’s exact nature. Outcomes are important for both intervention and prognostic reviews.

Search terms are then developed using key concepts. There are two main search terms (text words and index terms). Text words or natural language terms appear in most publications. Different authors may use different text words for the same pathology. For an example, words such as injury, wound, trauma are used to describe physical damage to the body. Index terms, on the other hand, are controlled vocabularies defined by database indexers [ 4 ]. Common terms are MeSH (Medical Subject Headings) by MEDLINE and Emtree in Embase. The index terms do not change with the interface (eg. the term ‘wound and injuries’ is used for all types of damage to the body from external causes) [ 5 ].

Search filters are used to identify search terms. The choice of filters depends on the study design, database and interface. There are specific words used to combine search terms called ‘Boolean operators’. The main Boolean operators are ‘OR’ which broaden the search (accidents OR falls will include all studies with both terms) and ‘AND’ which narrow the search (accidents AND falls will select studies with both terms). In standard search strategy all terms within a key concept are combined with ‘OR’ and in-between concepts using ‘AND’.

Limits and restrictions are used in search strategy to improve precision. The common restrictions are language selections, publication date limits and format boundaries. These limits may result in missing relevant studies. It is good practice to explain the reason for restrictions in the search strategy. It is also important to be aware of errors and retractions in selected studies. Information specialists can add terms to remove such studies in the search process. The final step is piloting the search strategy. It will give an opportunity to adjust the search strategy for optimal sensitivity and precision [ 6 ].

All systematic reviews require consistent management of the search studies. It is challenging to manage a large number of studies manually. Reference management software can merge all search results, remove duplicates, record number of studies selected in each step, store methodology and selection criteria, and support exporting selected studies to analysis software. Specific platforms and software packages are extremely useful and can save time and effort in navigating the search and compiling the appropriate data. There are many software packages available for systematic review reference management, including Covidence, Abstracker, CADIMA, SUMARI and DistillerSR.

Throughout the search process, documentation is crucial. Search criteria and strategy, total number of studies in each step, searched databases and non-databases and copies of internet results are important records. In a situation where the search was more than 12 months old, it is advisable to re-run the search to minimize missing novel studies [ 2 , 6 ].

Selecting studies

All the searched studies are selected for quantitative synthesis. Numbers of studies marked in each selection process needs to be documented. The PRISMA flow maps (Fig. 3 ) can be used to report the selection process [ 7 ].

figure 3

PRISMA flow diagram map for systematic review study selection process

During the selection process, it is important to minimize bias. This can be achieved by measures such as having a pre-planned written review protocol with inclusion and exclusion criteria, adding study design as an inclusion criteria and independent study selection by at least 2 researchers. Items to consider in collecting data are source, eligibility, methods, outcomes, and results. Outcomes should be based on what is important to patients, not what researchers have decided to measure. Other items of interest are bibliographic information and references of other relevant studies. The most important decisions for the entire review are whether individual studies will be included or excluded for consideration in subsequent analyses. This may be the major determinant of the final composite results of the review. It is important to resolve any discrepancies in individual judgements by reviewers as objectively as possible, always remembering that individuals may be nature by “lumpers” or "splitters”. Ref (Darwin, Charles (1 August 1857). "Letter no. 2130". Darwin Correspondence Project).

Once the items to collect are decided, data extraction forms can be used to collect data for the review. The extraction form can be set up as paper, soft copy (word, excel or pdf format) or by using a database from specific software (eg: Covidence, EPPI-Reviewer, etc). All recordable outcome measures are collected for optimal analysis. It is nearly always a problem that some included studies may not provide usable data for extraction. These challenges are managed as shown in Table 2 .

It is important to be polite and clear when contacting authors. Imputing missing data carries a risk of error and it is best to get as much possible information from relevant authors. There are different data categories used to report outcomes in research studies. Table 3 summarizes common data types with some examples [ 2 ].

Study quality and bias

The results will not represent accurate evidence when there is bias in a study. These poor-quality studies introduce bias into a systematic review. Risk of bias is decreased, and the study’s quality improved by clearcut randomization, outcome data on all participants (i.e. complete follow-up) and blinding (for both participant and outcome assessor) [ 2 , 8 ].

The Cochrane Risk of bias tool (RoB) [ 9 ] can be used to assess risk of bias in Randomized Control Trials (RCTs). However, in Non-Randomized Studies of Interventions (NRSI), tools such as The Newcastle-Ottawa Scale [ 10 ], ROBINS-I [ 11 ], The DOWNS-Black [ 12 ] can be used to assess risk of bias. Please see bias domains in RCT and NRSI in Table 4 .

Blinding and masking can minimize the bias secondary to deviation from intended interventions. Missing outcome data or attrition due to various issues such as participant withdrawal, loss to follow up and lost data are also common causes for bias in studies. Researchers use imputation to address missing data which could lead to over or underestimation of intervention effects. Sensitivity analysis can be conducted to investigate the effect of such assumptions. Selective reporting is another problem, and it is difficult to identify and sources such as clinical trial registries or published trial protocols can be used to minimize such discrepancies.

Data analysis

Analysis of data is crucial in a systematic review and important aspect of this step are described below [ 2 , 13 ].

  • Effect measure

Outcome data for each selected study will be in different measures. It is important to select a comparable effect measure for all studies for the particular outcome to facilitate synthesis of overall effect measure. Common effect measures for dichotomous outcomes are risk ratios (RR), odds ratios (OR) and risk differences (absolute risk reduction - ARR). These measures are selected for the analysis based on their consistency, mathematical properties, and communication effect For DTA reviews sensitivity and specificity are commonly used.

The mean difference (MD) is the commonest effect measure of continuous outcome data. When interpreting MD, report as many details such as the size of the difference, nature of the outcome (good or bad), characteristics of the scale for better understanding of the results. However, studies in the review may not use the same scales and standardization of results may be required. The standardized mean difference (SMD) can be calculated in such situations if the same concept or measures are used. The SMD is expressed in units of Standard Deviation (SD). It is important to correct the direction of the scale before combining them. All outcome data should be reported along with a measure of uncertainty such as confidence interval (CI).

There are endpoints and changes from baseline data in studies. Endpoint scores are usually reported in standard deviations (SD) and change from baseline data present in MD. Although it is possible to combine two types of data, SMD calculations are inaccurate in such situations. It is also good practice to conduct sensitivity analyses to assess the acceptability of the choices made.

Meta analysis

There are many advantages to performing a meta-analysis. It combines samples and provides more precise quantitative answers to the study objective. Study quality, comparability of data and data formats affect the output of the meta-analysis. The acceptable steps in meta-analysis are described in Table 5 .

  • Heterogeneity

Variation across studies, more than expected by chance, is called heterogeneity. Although there are several types of heterogeneity such as clinical (variations in population and interventions), methodological (differences in designs and outcomes) and statistical (variable measure of effects), statistical heterogeneity is the most important type to discuss in meta-analysis [ 2 , 14 , 15 ].

The heterogeneity assumptions affect data analysis. There are two models as described in Fig. 4 , used to assess heterogeneity. If the heterogeneity is minimal, then the Tau 2 is close to zero and weight estimates are similar from both methods. Tau is the standard deviation of true effect between studies and Tau 2 is the variance.

figure 4

Heterogeneity assumption methods

There are a few tools to assess heterogeneity. These are Q test, I 2 statistics and visual inspection of forest plot. The easiest method is visual inspection of forest plot. Studies without overlap in confidence intervals are not homogenous. At the same time studies spread over null effect line, the heterogeneity is more relevant in analysis to guide the direction of the effect. The chi-squared or Q test believes all studies measure the same effect and a low p value suggests high heterogeneity. However, reliability of the Q test is low in extreme number of studies as the p value becomes less sensitive or too sensitive, thus under- or over-diagnosing heterogeneity respectively. The other tool to diagnose heterogeneity is I 2 statistic, which presents heterogeneity in a percentage value. Low values, below 30%, suggest minimal heterogeneity.

The next step is to deal with heterogeneity by exploring possible causes. Errors in data collection or analysis and true variations in population or intervention are common reasons for outlying results. These identified reasons should be presented cautiously in subgroup analysis. If no cause is identified, mention this in (GRADE approach– described later) the review as unexplained heterogeneity. In each subgroup, the heterogeneity and effect modification should be reported. It is also important to have a logical basis for each factor reported in the subgroup analysis, as too many factors may confuse readers. It is equally important to make sure there is meaningful clinical relevance in these subgroups.

Different study designs and missing data

Some studies may have more than one intervention. It is reasonable to ignore intervention arms of no interest in the review. But if all treatment arms need to be included, the control group could be divided uniformly amongst intervention arms, or all arms could be analyzed together or separately. The unit of analysis error is common in cluster randomized trial analysis, since clusters are considered as units. Similarly, correlation should be considered in crossover trials to minimize over or under weighting the study in analysis. There will be high risk of bias and heterogeneity in analyzing nonrandomized studies (NRS). However, normal effect measures can be used in relatively homogenous NRS meta-analysis.

Sometimes, missing statistics are found, and it is reasonable to calculate means and SDs from available data. Imputation of data should be done cautiously and reported in sensitive analysis.

Reporting and interpretation of results

It is important to report results in depth and not merely statistical values. The main measures used to report meta-analysis are Confidence interval (CI) and SMD [ 2 ].

The CI is the range where the true value probably sits. A narrow CI suggests more precise effects. The CI is usually presented as 95% interval (Corresponding to p value of 0.05) and rarely in 90% interval (P of 0.1). It is statistically significant when CI is away from the line of zero effect. However even statistically significant effects may not have clinical value if it does not meet minimally important change. On the other effects that are not statistically significant may still have clinical importance and raises question regarding the overall power of the meta-analysis to detect clinically important effects.

The SMD is defined above (“ Data analysis ” section) as an effect measure. The value more than zero means significant change of the intervention. However, interpretation of the size of significance is difficult in SMD as it reports units of standard deviation (SD). The Cohen’s rule of thumb (SMD <0.4 small effect, >0.7 large effect and moderate in between), transformation to OR (assuming equal SDs in both control and intervention arms) or calculating estimate MDs in a familiar scale are reasonable methods to report SMD results.

Reporting bias and certainty of evidence

The risk of missing information in a systematic review in the process from writing study protocol to publication is called reporting bias. Many factors such as author beliefs, word limitations, editorial and reviewers’ approvals can cause reporting bias. Funnel plots are a recommended statistical method to detect reporting bias in systematic reviews and meta-analysis.

Reporting the certainty of the results is another important step at the end of study analysis. The Grading of Recommendations, Assessment, Development and Evaluation (GRADE) is a recommended structured approach to report certainty of data. Table 6 describe topics used to rate up or down the certainty according to GRADE system [ 16 ]. Another important aspect of a systematic review is to categorize and present research studies based on the quality of the study.

The final rating of certainty in a meta-analysis is based on combination of all domains in each and overall studies. This information should be mentioned in the result section using numbers and explained in text in the discussion. The same system can be used in narrative synthesis of results in systematic reviews. It is important to remember rate up is only relevant for non-randomized studies and randomized studies starts with higher certainty.

Reporting the review

The last step of a systematic review or meta-analysis is report writing. Here, all parts are merged to write the review in structured format, using the protocol as the starting point. All systematic reviews should have a protocol to begin with as shown in Fig. 5 [ 2 ].

figure 5

Structure for report writing

Summary of finding table

The ‘summary of finding’ table is a useful step in the writing. All the outcomes with a list of studies are recorded in this table. Then the relative / absolute effect (import from forest plots), certainty of evidence (based on GRADE) and comments are included in separate columns. Footnotes can be included for explanation of decisions. There are softwares to develop summary of tables, such as GRADEpro, which is compatible with RevMan [ 17 ].

Presenting results

The first paragraph of the results is the search process. The PRISMA flow (described in Fig. 1 ) is recommended to report the search summary [ 7 ]. The second section is the summary of risk of bias assessment for included studies. This will be only a narrative writing of significant differences, as individual study risk of bias will be presented in data tables in detail. Following this, review findings are presented in structured format.

The effects of interventions are presented in forest plots and data tables/figures. It is important to remember that this is not the section to interpret or infer results. All outcomes planned in the protocol should be reported, including the outcomes without evidence. Consistency of outcomes order should be maintained throughout the review. Present intervention vs no intervention before one vs other intervention. Primary outcomes are compared first, followed by secondary outcomes. Throughout the writing, check the reliability of results among plots, tables, figures, and texts. However, it may not be feasible to publish all plots and tables in the main document. Supplementary materials or appendices are available in journals for less important analyses.

There may be situations where selected studies are too diverse to conduct a meta-analysis. Narrative synthesis is an option in such situations to analyze results. It is easy to examine data by grouping studies in a narrative synthesis. Avoid vote counting of positive and negative studies in narrative reviews.

The first paragraph in the discussion should summarize the main (both positive and negative) findings along with certainty of evidence. The summary of the finding table can be used to identify the most important outcomes. Then describe whether the results address the study questions in the format of PICOS.

The quality of the review evidence is discussed afterwards. All domains of GRADE assessment including inconsistency, indirectness, imprecision, publication bias should be discussed in relation to the conclusions. Selection bias of studies can be included in the strengths/limitations section along with other assumptions made during the review. It is reasonable to mention agreements/disagreements with other reviews at the end in the context of past reviews.

The conclusion is the summary of review findings which guide readers to make decisions in policy making or clinical practice. It is important to mention both positive and negative salient results of the review in the conclusion. Make sure only your study findings are presented, and do not comment on outside sources. At the end of presenting results, recommendations can be mentioned to fill the gaps in evidence. The primary value of systematic reviews is to drive improvements in evidence-based practice, based on the needs of patients.

There are often other versions of the summaries from reviews presenting the major findings in plain language for the benefit of consumers and general public. It is advisable to use bullet points, and subheadings can be phrased as questions (What is the intervention? Whys it is important? What did we find? What are limitations? What is the conclusion?). It is better to write in first person active voice to directly address readers.

All types of summaries should provide consistent information to the main text. When describing uncertainty, be clear with the study limitations. As the summary is painting the study report, focus on the main results and quality of evidence.

Availability of data and materials

Not applicable.

Chandler J, Cumpston M, Thomas J, Higgins JP, Deeks JJ, Clarke MJ, Li T, Page MJ, Welch VA. Chapter 1: introduction. Cochrane Handbook Syst Rev Interv Ver. 2019;5(0):3–8.

Cumpston M, Li T, Page MJ, Chandler J, Welch VA, Higgins JP, Thomas J. Updated guidance for trusted systematic reviews: a new edition of the Cochrane Handbook for Systematic Reviews of Interventions. Cochrane Database Syst Rev. 2019;2019(10).

Mueller M, D’Addario M, Egger M, Cevallos M, Dekkers O, Mugglin C, Scott P. Methods to systematically review and meta-analyse observational studies: a systematic scoping review of recommendations. BMC Med Res Methodol. 2018;18(1):1–8.

Article   Google Scholar  

Chojecki D, Tjosvold L. Documenting and reporting the search process. HTAI Vortal [online]. 2020.

Sorden N. New MeSH Browser Available. NLM Tech Bull. 2016;(413):e2.  https://www.nlm.nih.gov/pubs/techbull/nd16/nd16_mesh_browser_upgrade.html .

Tawfik GM, Dila KA, Mohamed MY, Tam DN, Kien ND, Ahmed AM, Huy NT. A step by step guide for conducting a systematic review and meta-analysis with simulation data. Trop Med Health. 2019;47(1):1–9.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. https://doi.org/10.1136/bmj.n71 .

Article   PubMed   PubMed Central   Google Scholar  

Ma LL, Wang YY, Yang ZH, Huang D, Weng H, Zeng XT. Methodological quality (risk of bias) assessment tools for primary and secondary medical studies: what are they and which is better? Military Med Res. 2020;7(1):1–1.

Higgins JP, Savović J, Page MJ, Elbers RG, Sterne JA. Assessing risk of bias in a randomized trial. Cochrane Handb Syst Rev Interv. 2019:205–28.

Deeks JJ, Dinnes J, D’Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman DG. Evaluating non-randomised intervention studies. Health Technol Assess (Winchester, England). 2003;7(27):iii–173.

CAS   Google Scholar  

Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A, Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L, Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC, Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing risk of bias in non-randomized studies of interventions. BMJ. 2016;355:i4919. https://doi.org/10.1136/bmj.i4919 .

Downs SH, Black N. The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J Epidemiol Community Health. 1998;52(6):377–84.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ahn E, Kang H. Introduction to systematic review and meta-analysis. Korean J Anesthesiol. 2018;71(2):103–12.

Lin L. Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract. 2020;26(1):376–84.

Article   PubMed   Google Scholar  

Mohan BP, Adler DG. Heterogeneity in systematic review and meta-analysis: how to read between the numbers. Gastrointest Endosc. 2019;89(4):902–3.

Schünemann HJ. GRADE: from grading the evidence to developing recommendations. A description of the system and a proposal regarding the transferability of the results of clinical research to clinical practice. Zeitschrift Evidenz, Fortbildung Qualitat Gesundheitswesen. 2009;103(6):391–400.

Taito S. The construct of certainty of evidence has not been disseminated to systematic reviews and clinical practice guidelines; response to ‘The GRADE Working Group’ et al. J Clin Epidemiol. 2022;147:171.

Download references

Acknowledgements

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

University of Western Australia, Stirling Hwy, Crawley, Perth, WA, 6009, Australia

Mayura Thilanka Iddagoda & Leon Flicker

Perioperative Service, Royal Perth Hospital, Wellington Street, Perth, WA, 6000, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

M.I. involved in conceptualization, literature search and writing the Article. L.F. reviewed and corrected contents. All authors reviewed the manuscript.

Corresponding author

Correspondence to Mayura Thilanka Iddagoda .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Iddagoda, M.T., Flicker, L. Clinical systematic reviews – a brief overview. BMC Med Res Methodol 23 , 226 (2023). https://doi.org/10.1186/s12874-023-02047-8

Download citation

Received : 02 March 2023

Accepted : 27 September 2023

Published : 10 October 2023

DOI : https://doi.org/10.1186/s12874-023-02047-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Sytematic review
  • Meta-analysis
  • Risk of bias
  • Certainty of evidence

BMC Medical Research Methodology

ISSN: 1471-2288

systematic review of clinical research

  • Open access
  • Published: 08 June 2023

Guidance to best tools and practices for systematic reviews

  • Kat Kolaski 1 ,
  • Lynne Romeiser Logan 2 &
  • John P. A. Ioannidis 3  

Systematic Reviews volume  12 , Article number:  96 ( 2023 ) Cite this article

20k Accesses

15 Citations

76 Altmetric

Metrics details

Data continue to accumulate indicating that many systematic reviews are methodologically flawed, biased, redundant, or uninformative. Some improvements have occurred in recent years based on empirical methods research and standardization of appraisal tools; however, many authors do not routinely or consistently apply these updated methods. In addition, guideline developers, peer reviewers, and journal editors often disregard current methodological standards. Although extensively acknowledged and explored in the methodological literature, most clinicians seem unaware of these issues and may automatically accept evidence syntheses (and clinical practice guidelines based on their conclusions) as trustworthy.

A plethora of methods and tools are recommended for the development and evaluation of evidence syntheses. It is important to understand what these are intended to do (and cannot do) and how they can be utilized. Our objective is to distill this sprawling information into a format that is understandable and readily accessible to authors, peer reviewers, and editors. In doing so, we aim to promote appreciation and understanding of the demanding science of evidence synthesis among stakeholders. We focus on well-documented deficiencies in key components of evidence syntheses to elucidate the rationale for current standards. The constructs underlying the tools developed to assess reporting, risk of bias, and methodological quality of evidence syntheses are distinguished from those involved in determining overall certainty of a body of evidence. Another important distinction is made between those tools used by authors to develop their syntheses as opposed to those used to ultimately judge their work.

Exemplar methods and research practices are described, complemented by novel pragmatic strategies to improve evidence syntheses. The latter include preferred terminology and a scheme to characterize types of research evidence. We organize best practice resources in a Concise Guide that can be widely adopted and adapted for routine implementation by authors and journals. Appropriate, informed use of these is encouraged, but we caution against their superficial application and emphasize their endorsement does not substitute for in-depth methodological training. By highlighting best practices with their rationale, we hope this guidance will inspire further evolution of methods and tools that can advance the field.

Part 1. The state of evidence synthesis

Evidence syntheses are commonly regarded as the foundation of evidence-based medicine (EBM). They are widely accredited for providing reliable evidence and, as such, they have significantly influenced medical research and clinical practice. Despite their uptake throughout health care and ubiquity in contemporary medical literature, some important aspects of evidence syntheses are generally overlooked or not well recognized. Evidence syntheses are mostly retrospective exercises, they often depend on weak or irreparably flawed data, and they may use tools that have acknowledged or yet unrecognized limitations. They are complicated and time-consuming undertakings prone to bias and errors. Production of a good evidence synthesis requires careful preparation and high levels of organization in order to limit potential pitfalls [ 1 ]. Many authors do not recognize the complexity of such an endeavor and the many methodological challenges they may encounter. Failure to do so is likely to result in research and resource waste.

Given their potential impact on people’s lives, it is crucial for evidence syntheses to correctly report on the current knowledge base. In order to be perceived as trustworthy, reliable demonstration of the accuracy of evidence syntheses is equally imperative [ 2 ]. Concerns about the trustworthiness of evidence syntheses are not recent developments. From the early years when EBM first began to gain traction until recent times when thousands of systematic reviews are published monthly [ 3 ] the rigor of evidence syntheses has always varied. Many systematic reviews and meta-analyses had obvious deficiencies because original methods and processes had gaps, lacked precision, and/or were not widely known. The situation has improved with empirical research concerning which methods to use and standardization of appraisal tools. However, given the geometrical increase in the number of evidence syntheses being published, a relatively larger pool of unreliable evidence syntheses is being published today.

Publication of methodological studies that critically appraise the methods used in evidence syntheses is increasing at a fast pace. This reflects the availability of tools specifically developed for this purpose [ 4 , 5 , 6 ]. Yet many clinical specialties report that alarming numbers of evidence syntheses fail on these assessments. The syntheses identified report on a broad range of common conditions including, but not limited to, cancer, [ 7 ] chronic obstructive pulmonary disease, [ 8 ] osteoporosis, [ 9 ] stroke, [ 10 ] cerebral palsy, [ 11 ] chronic low back pain, [ 12 ] refractive error, [ 13 ] major depression, [ 14 ] pain, [ 15 ] and obesity [ 16 , 17 ]. The situation is even more concerning with regard to evidence syntheses included in clinical practice guidelines (CPGs) [ 18 , 19 , 20 ]. Astonishingly, in a sample of CPGs published in 2017–18, more than half did not apply even basic systematic methods in the evidence syntheses used to inform their recommendations [ 21 ].

These reports, while not widely acknowledged, suggest there are pervasive problems not limited to evidence syntheses that evaluate specific kinds of interventions or include primary research of a particular study design (eg, randomized versus non-randomized) [ 22 ]. Similar concerns about the reliability of evidence syntheses have been expressed by proponents of EBM in highly circulated medical journals [ 23 , 24 , 25 , 26 ]. These publications have also raised awareness about redundancy, inadequate input of statistical expertise, and deficient reporting. These issues plague primary research as well; however, there is heightened concern for the impact of these deficiencies given the critical role of evidence syntheses in policy and clinical decision-making.

Methods and guidance to produce a reliable evidence synthesis

Several international consortiums of EBM experts and national health care organizations currently provide detailed guidance (Table 1 ). They draw criteria from the reporting and methodological standards of currently recommended appraisal tools, and regularly review and update their methods to reflect new information and changing needs. In addition, they endorse the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for rating the overall quality of a body of evidence [ 27 ]. These groups typically certify or commission systematic reviews that are published in exclusive databases (eg, Cochrane, JBI) or are used to develop government or agency sponsored guidelines or health technology assessments (eg, National Institute for Health and Care Excellence [NICE], Scottish Intercollegiate Guidelines Network [SIGN], Agency for Healthcare Research and Quality [AHRQ]). They offer developers of evidence syntheses various levels of methodological advice, technical and administrative support, and editorial assistance. Use of specific protocols and checklists are required for development teams within these groups, but their online methodological resources are accessible to any potential author.

Notably, Cochrane is the largest single producer of evidence syntheses in biomedical research; however, these only account for 15% of the total [ 28 ]. The World Health Organization requires Cochrane standards be used to develop evidence syntheses that inform their CPGs [ 29 ]. Authors investigating questions of intervention effectiveness in syntheses developed for Cochrane follow the Methodological Expectations of Cochrane Intervention Reviews [ 30 ] and undergo multi-tiered peer review [ 31 , 32 ]. Several empirical evaluations have shown that Cochrane systematic reviews are of higher methodological quality compared with non-Cochrane reviews [ 4 , 7 , 9 , 11 , 14 , 32 , 33 , 34 , 35 ]. However, some of these assessments have biases: they may be conducted by Cochrane-affiliated authors, and they sometimes use scales and tools developed and used in the Cochrane environment and by its partners. In addition, evidence syntheses published in the Cochrane database are not subject to space or word restrictions, while non-Cochrane syntheses are often limited. As a result, information that may be relevant to the critical appraisal of non-Cochrane reviews is often removed or is relegated to online-only supplements that may not be readily or fully accessible [ 28 ].

Influences on the state of evidence synthesis

Many authors are familiar with the evidence syntheses produced by the leading EBM organizations but can be intimidated by the time and effort necessary to apply their standards. Instead of following their guidance, authors may employ methods that are discouraged or outdated 28]. Suboptimal methods described in in the literature may then be taken up by others. For example, the Newcastle–Ottawa Scale (NOS) is a commonly used tool for appraising non-randomized studies [ 36 ]. Many authors justify their selection of this tool with reference to a publication that describes the unreliability of the NOS and recommends against its use [ 37 ]. Obviously, the authors who cite this report for that purpose have not read it. Authors and peer reviewers have a responsibility to use reliable and accurate methods and not copycat previous citations or substandard work [ 38 , 39 ]. Similar cautions may potentially extend to automation tools. These have concentrated on evidence searching [ 40 ] and selection given how demanding it is for humans to maintain truly up-to-date evidence [ 2 , 41 ]. Cochrane has deployed machine learning to identify randomized controlled trials (RCTs) and studies related to COVID-19, [ 2 , 42 ] but such tools are not yet commonly used [ 43 ]. The routine integration of automation tools in the development of future evidence syntheses should not displace the interpretive part of the process.

Editorials about unreliable or misleading systematic reviews highlight several of the intertwining factors that may contribute to continued publication of unreliable evidence syntheses: shortcomings and inconsistencies of the peer review process, lack of endorsement of current standards on the part of journal editors, the incentive structure of academia, industry influences, publication bias, and the lure of “predatory” journals [ 44 , 45 , 46 , 47 , 48 ]. At this juncture, clarification of the extent to which each of these factors contribute remains speculative, but their impact is likely to be synergistic.

Over time, the generalized acceptance of the conclusions of systematic reviews as incontrovertible has affected trends in the dissemination and uptake of evidence. Reporting of the results of evidence syntheses and recommendations of CPGs has shifted beyond medical journals to press releases and news headlines and, more recently, to the realm of social media and influencers. The lay public and policy makers may depend on these outlets for interpreting evidence syntheses and CPGs. Unfortunately, communication to the general public often reflects intentional or non-intentional misrepresentation or “spin” of the research findings [ 49 , 50 , 51 , 52 ] News and social media outlets also tend to reduce conclusions on a body of evidence and recommendations for treatment to binary choices (eg, “do it” versus “don’t do it”) that may be assigned an actionable symbol (eg, red/green traffic lights, smiley/frowning face emoji).

Strategies for improvement

Many authors and peer reviewers are volunteer health care professionals or trainees who lack formal training in evidence synthesis [ 46 , 53 ]. Informing them about research methodology could increase the likelihood they will apply rigorous methods [ 25 , 33 , 45 ]. We tackle this challenge, from both a theoretical and a practical perspective, by offering guidance applicable to any specialty. It is based on recent methodological research that is extensively referenced to promote self-study. However, the information presented is not intended to be substitute for committed training in evidence synthesis methodology; instead, we hope to inspire our target audience to seek such training. We also hope to inform a broader audience of clinicians and guideline developers influenced by evidence syntheses. Notably, these communities often include the same members who serve in different capacities.

In the following sections, we highlight methodological concepts and practices that may be unfamiliar, problematic, confusing, or controversial. In Part 2, we consider various types of evidence syntheses and the types of research evidence summarized by them. In Part 3, we examine some widely used (and misused) tools for the critical appraisal of systematic reviews and reporting guidelines for evidence syntheses. In Part 4, we discuss how to meet methodological conduct standards applicable to key components of systematic reviews. In Part 5, we describe the merits and caveats of rating the overall certainty of a body of evidence. Finally, in Part 6, we summarize suggested terminology, methods, and tools for development and evaluation of evidence syntheses that reflect current best practices.

Part 2. Types of syntheses and research evidence

A good foundation for the development of evidence syntheses requires an appreciation of their various methodologies and the ability to correctly identify the types of research potentially available for inclusion in the synthesis.

Types of evidence syntheses

Systematic reviews have historically focused on the benefits and harms of interventions; over time, various types of systematic reviews have emerged to address the diverse information needs of clinicians, patients, and policy makers [ 54 ] Systematic reviews with traditional components have become defined by the different topics they assess (Table 2.1 ). In addition, other distinctive types of evidence syntheses have evolved, including overviews or umbrella reviews, scoping reviews, rapid reviews, and living reviews. The popularity of these has been increasing in recent years [ 55 , 56 , 57 , 58 ]. A summary of the development, methods, available guidance, and indications for these unique types of evidence syntheses is available in Additional File 2 A.

Both Cochrane [ 30 , 59 ] and JBI [ 60 ] provide methodologies for many types of evidence syntheses; they describe these with different terminology, but there is obvious overlap (Table 2.2 ). The majority of evidence syntheses published by Cochrane (96%) and JBI (62%) are categorized as intervention reviews. This reflects the earlier development and dissemination of their intervention review methodologies; these remain well-established [ 30 , 59 , 61 ] as both organizations continue to focus on topics related to treatment efficacy and harms. In contrast, intervention reviews represent only about half of the total published in the general medical literature, and several non-intervention review types contribute to a significant proportion of the other half.

Types of research evidence

There is consensus on the importance of using multiple study designs in evidence syntheses; at the same time, there is a lack of agreement on methods to identify included study designs. Authors of evidence syntheses may use various taxonomies and associated algorithms to guide selection and/or classification of study designs. These tools differentiate categories of research and apply labels to individual study designs (eg, RCT, cross-sectional). A familiar example is the Design Tree endorsed by the Centre for Evidence-Based Medicine [ 70 ]. Such tools may not be helpful to authors of evidence syntheses for multiple reasons.

Suboptimal levels of agreement and accuracy even among trained methodologists reflect challenges with the application of such tools [ 71 , 72 ]. Problematic distinctions or decision points (eg, experimental or observational, controlled or uncontrolled, prospective or retrospective) and design labels (eg, cohort, case control, uncontrolled trial) have been reported [ 71 ]. The variable application of ambiguous study design labels to non-randomized studies is common, making them especially prone to misclassification [ 73 ]. In addition, study labels do not denote the unique design features that make different types of non-randomized studies susceptible to different biases, including those related to how the data are obtained (eg, clinical trials, disease registries, wearable devices). Given this limitation, it is important to be aware that design labels preclude the accurate assignment of non-randomized studies to a “level of evidence” in traditional hierarchies [ 74 ].

These concerns suggest that available tools and nomenclature used to distinguish types of research evidence may not uniformly apply to biomedical research and non-health fields that utilize evidence syntheses (eg, education, economics) [ 75 , 76 ]. Moreover, primary research reports often do not describe study design or do so incompletely or inaccurately; thus, indexing in PubMed and other databases does not address the potential for misclassification [ 77 ]. Yet proper identification of research evidence has implications for several key components of evidence syntheses. For example, search strategies limited by index terms using design labels or study selection based on labels applied by the authors of primary studies may cause inconsistent or unjustified study inclusions and/or exclusions [ 77 ]. In addition, because risk of bias (RoB) tools consider attributes specific to certain types of studies and study design features, results of these assessments may be invalidated if an inappropriate tool is used. Appropriate classification of studies is also relevant for the selection of a suitable method of synthesis and interpretation of those results.

An alternative to these tools and nomenclature involves application of a few fundamental distinctions that encompass a wide range of research designs and contexts. While these distinctions are not novel, we integrate them into a practical scheme (see Fig. 1 ) designed to guide authors of evidence syntheses in the basic identification of research evidence. The initial distinction is between primary and secondary studies. Primary studies are then further distinguished by: 1) the type of data reported (qualitative or quantitative); and 2) two defining design features (group or single-case and randomized or non-randomized). The different types of studies and study designs represented in the scheme are described in detail in Additional File 2 B. It is important to conceptualize their methods as complementary as opposed to contrasting or hierarchical [ 78 ]; each offers advantages and disadvantages that determine their appropriateness for answering different kinds of research questions in an evidence synthesis.

figure 1

Distinguishing types of research evidence

Application of these basic distinctions may avoid some of the potential difficulties associated with study design labels and taxonomies. Nevertheless, debatable methodological issues are raised when certain types of research identified in this scheme are included in an evidence synthesis. We briefly highlight those associated with inclusion of non-randomized studies, case reports and series, and a combination of primary and secondary studies.

Non-randomized studies

When investigating an intervention’s effectiveness, it is important for authors to recognize the uncertainty of observed effects reported by studies with high RoB. Results of statistical analyses that include such studies need to be interpreted with caution in order to avoid misleading conclusions [ 74 ]. Review authors may consider excluding randomized studies with high RoB from meta-analyses. Non-randomized studies of intervention (NRSI) are affected by a greater potential range of biases and thus vary more than RCTs in their ability to estimate a causal effect [ 79 ]. If data from NRSI are synthesized in meta-analyses, it is helpful to separately report their summary estimates [ 6 , 74 ].

Nonetheless, certain design features of NRSI (eg, which parts of the study were prospectively designed) may help to distinguish stronger from weaker ones. Cochrane recommends that authors of a review including NRSI focus on relevant study design features when determining eligibility criteria instead of relying on non-informative study design labels [ 79 , 80 ] This process is facilitated by a study design feature checklist; guidance on using the checklist is included with developers’ description of the tool [ 73 , 74 ]. Authors collect information about these design features during data extraction and then consider it when making final study selection decisions and when performing RoB assessments of the included NRSI.

Case reports and case series

Correctly identified case reports and case series can contribute evidence not well captured by other designs [ 81 ]; in addition, some topics may be limited to a body of evidence that consists primarily of uncontrolled clinical observations. Murad and colleagues offer a framework for how to include case reports and series in an evidence synthesis [ 82 ]. Distinguishing between cohort studies and case series in these syntheses is important, especially for those that rely on evidence from NRSI. Additional data obtained from studies misclassified as case series can potentially increase the confidence in effect estimates. Mathes and Pieper provide authors of evidence syntheses with specific guidance on distinguishing between cohort studies and case series, but emphasize the increased workload involved [ 77 ].

Primary and secondary studies

Synthesis of combined evidence from primary and secondary studies may provide a broad perspective on the entirety of available literature on a topic. This is, in fact, the recommended strategy for scoping reviews that may include a variety of sources of evidence (eg, CPGs, popular media). However, except for scoping reviews, the synthesis of data from primary and secondary studies is discouraged unless there are strong reasons to justify doing so.

Combining primary and secondary sources of evidence is challenging for authors of other types of evidence syntheses for several reasons [ 83 ]. Assessments of RoB for primary and secondary studies are derived from conceptually different tools, thus obfuscating the ability to make an overall RoB assessment of a combination of these study types. In addition, authors who include primary and secondary studies must devise non-standardized methods for synthesis. Note this contrasts with well-established methods available for updating existing evidence syntheses with additional data from new primary studies [ 84 , 85 , 86 ]. However, a new review that synthesizes data from primary and secondary studies raises questions of validity and may unintentionally support a biased conclusion because no existing methodological guidance is currently available [ 87 ].

Recommendations

We suggest that journal editors require authors to identify which type of evidence synthesis they are submitting and reference the specific methodology used for its development. This will clarify the research question and methods for peer reviewers and potentially simplify the editorial process. Editors should announce this practice and include it in the instructions to authors. To decrease bias and apply correct methods, authors must also accurately identify the types of research evidence included in their syntheses.

Part 3. Conduct and reporting

The need to develop criteria to assess the rigor of systematic reviews was recognized soon after the EBM movement began to gain international traction [ 88 , 89 ]. Systematic reviews rapidly became popular, but many were very poorly conceived, conducted, and reported. These problems remain highly prevalent [ 23 ] despite development of guidelines and tools to standardize and improve the performance and reporting of evidence syntheses [ 22 , 28 ]. Table 3.1  provides some historical perspective on the evolution of tools developed specifically for the evaluation of systematic reviews, with or without meta-analysis.

These tools are often interchangeably invoked when referring to the “quality” of an evidence synthesis. However, quality is a vague term that is frequently misused and misunderstood; more precisely, these tools specify different standards for evidence syntheses. Methodological standards address how well a systematic review was designed and performed [ 5 ]. RoB assessments refer to systematic flaws or limitations in the design, conduct, or analysis of research that distort the findings of the review [ 4 ]. Reporting standards help systematic review authors describe the methodology they used and the results of their synthesis in sufficient detail [ 92 ]. It is essential to distinguish between these evaluations: a systematic review may be biased, it may fail to report sufficient information on essential features, or it may exhibit both problems; a thoroughly reported systematic evidence synthesis review may still be biased and flawed while an otherwise unbiased one may suffer from deficient documentation.

We direct attention to the currently recommended tools listed in Table 3.1  but concentrate on AMSTAR-2 (update of AMSTAR [A Measurement Tool to Assess Systematic Reviews]) and ROBIS (Risk of Bias in Systematic Reviews), which evaluate methodological quality and RoB, respectively. For comparison and completeness, we include PRISMA 2020 (update of the 2009 Preferred Reporting Items for Systematic Reviews of Meta-Analyses statement), which offers guidance on reporting standards. The exclusive focus on these three tools is by design; it addresses concerns related to the considerable variability in tools used for the evaluation of systematic reviews [ 28 , 88 , 96 , 97 ]. We highlight the underlying constructs these tools were designed to assess, then describe their components and applications. Their known (or potential) uptake and impact and limitations are also discussed.

Evaluation of conduct

Development.

AMSTAR [ 5 ] was in use for a decade prior to the 2017 publication of AMSTAR-2; both provide a broad evaluation of methodological quality of intervention systematic reviews, including flaws arising through poor conduct of the review [ 6 ]. ROBIS, published in 2016, was developed to specifically assess RoB introduced by the conduct of the review; it is applicable to systematic reviews of interventions and several other types of reviews [ 4 ]. Both tools reflect a shift to a domain-based approach as opposed to generic quality checklists. There are a few items unique to each tool; however, similarities between items have been demonstrated [ 98 , 99 ]. AMSTAR-2 and ROBIS are recommended for use by: 1) authors of overviews or umbrella reviews and CPGs to evaluate systematic reviews considered as evidence; 2) authors of methodological research studies to appraise included systematic reviews; and 3) peer reviewers for appraisal of submitted systematic review manuscripts. For authors, these tools may function as teaching aids and inform conduct of their review during its development.

Description

Systematic reviews that include randomized and/or non-randomized studies as evidence can be appraised with AMSTAR-2 and ROBIS. Other characteristics of AMSTAR-2 and ROBIS are summarized in Table 3.2 . Both tools define categories for an overall rating; however, neither tool is intended to generate a total score by simply calculating the number of responses satisfying criteria for individual items [ 4 , 6 ]. AMSTAR-2 focuses on the rigor of a review’s methods irrespective of the specific subject matter. ROBIS places emphasis on a review’s results section— this suggests it may be optimally applied by appraisers with some knowledge of the review’s topic as they may be better equipped to determine if certain procedures (or lack thereof) would impact the validity of a review’s findings [ 98 , 100 ]. Reliability studies show AMSTAR-2 overall confidence ratings strongly correlate with the overall RoB ratings in ROBIS [ 100 , 101 ].

Interrater reliability has been shown to be acceptable for AMSTAR-2 [ 6 , 11 , 102 ] and ROBIS [ 4 , 98 , 103 ] but neither tool has been shown to be superior in this regard [ 100 , 101 , 104 , 105 ]. Overall, variability in reliability for both tools has been reported across items, between pairs of raters, and between centers [ 6 , 100 , 101 , 104 ]. The effects of appraiser experience on the results of AMSTAR-2 and ROBIS require further evaluation [ 101 , 105 ]. Updates to both tools should address items shown to be prone to individual appraisers’ subjective biases and opinions [ 11 , 100 ]; this may involve modifications of the current domains and signaling questions as well as incorporation of methods to make an appraiser’s judgments more explicit. Future revisions of these tools may also consider the addition of standards for aspects of systematic review development currently lacking (eg, rating overall certainty of evidence, [ 99 ] methods for synthesis without meta-analysis [ 105 ]) and removal of items that assess aspects of reporting that are thoroughly evaluated by PRISMA 2020.

Application

A good understanding of what is required to satisfy the standards of AMSTAR-2 and ROBIS involves study of the accompanying guidance documents written by the tools’ developers; these contain detailed descriptions of each item’s standards. In addition, accurate appraisal of a systematic review with either tool requires training. Most experts recommend independent assessment by at least two appraisers with a process for resolving discrepancies as well as procedures to establish interrater reliability, such as pilot testing, a calibration phase or exercise, and development of predefined decision rules [ 35 , 99 , 100 , 101 , 103 , 104 , 106 ]. These methods may, to some extent, address the challenges associated with the diversity in methodological training, subject matter expertise, and experience using the tools that are likely to exist among appraisers.

The standards of AMSTAR, AMSTAR-2, and ROBIS have been used in many methodological studies and epidemiological investigations. However, the increased publication of overviews or umbrella reviews and CPGs has likely been a greater influence on the widening acceptance of these tools. Critical appraisal of the secondary studies considered evidence is essential to the trustworthiness of both the recommendations of CPGs and the conclusions of overviews. Currently both Cochrane [ 55 ] and JBI [ 107 ] recommend AMSTAR-2 and ROBIS in their guidance for authors of overviews or umbrella reviews. However, ROBIS and AMSTAR-2 were released in 2016 and 2017, respectively; thus, to date, limited data have been reported about the uptake of these tools or which of the two may be preferred [ 21 , 106 ]. Currently, in relation to CPGs, AMSTAR-2 appears to be overwhelmingly popular compared to ROBIS. A Google Scholar search of this topic (search terms “AMSTAR 2 AND clinical practice guidelines,” “ROBIS AND clinical practice guidelines” 13 May 2022) found 12,700 hits for AMSTAR-2 and 1,280 for ROBIS. The apparent greater appeal of AMSTAR-2 may relate to its longer track record given the original version of the tool was in use for 10 years prior to its update in 2017.

Barriers to the uptake of AMSTAR-2 and ROBIS include the real or perceived time and resources necessary to complete the items they include and appraisers’ confidence in their own ratings [ 104 ]. Reports from comparative studies available to date indicate that appraisers find AMSTAR-2 questions, responses, and guidance to be clearer and simpler compared with ROBIS [ 11 , 101 , 104 , 105 ]. This suggests that for appraisal of intervention systematic reviews, AMSTAR-2 may be a more practical tool than ROBIS, especially for novice appraisers [ 101 , 103 , 104 , 105 ]. The unique characteristics of each tool, as well as their potential advantages and disadvantages, should be taken into consideration when deciding which tool should be used for an appraisal of a systematic review. In addition, the choice of one or the other may depend on how the results of an appraisal will be used; for example, a peer reviewer’s appraisal of a single manuscript versus an appraisal of multiple systematic reviews in an overview or umbrella review, CPG, or systematic methodological study.

Authors of overviews and CPGs report results of AMSTAR-2 and ROBIS appraisals for each of the systematic reviews they include as evidence. Ideally, an independent judgment of their appraisals can be made by the end users of overviews and CPGs; however, most stakeholders, including clinicians, are unlikely to have a sophisticated understanding of these tools. Nevertheless, they should at least be aware that AMSTAR-2 and ROBIS ratings reported in overviews and CPGs may be inaccurate because the tools are not applied as intended by their developers. This can result from inadequate training of the overview or CPG authors who perform the appraisals, or to modifications of the appraisal tools imposed by them. The potential variability in overall confidence and RoB ratings highlights why appraisers applying these tools need to support their judgments with explicit documentation; this allows readers to judge for themselves whether they agree with the criteria used by appraisers [ 4 , 108 ]. When these judgments are explicit, the underlying rationale used when applying these tools can be assessed [ 109 ].

Theoretically, we would expect an association of AMSTAR-2 with improved methodological rigor and an association of ROBIS with lower RoB in recent systematic reviews compared to those published before 2017. To our knowledge, this has not yet been demonstrated; however, like reports about the actual uptake of these tools, time will tell. Additional data on user experience is also needed to further elucidate the practical challenges and methodological nuances encountered with the application of these tools. This information could potentially inform the creation of unifying criteria to guide and standardize the appraisal of evidence syntheses [ 109 ].

Evaluation of reporting

Complete reporting is essential for users to establish the trustworthiness and applicability of a systematic review’s findings. Efforts to standardize and improve the reporting of systematic reviews resulted in the 2009 publication of the PRISMA statement [ 92 ] with its accompanying explanation and elaboration document [ 110 ]. This guideline was designed to help authors prepare a complete and transparent report of their systematic review. In addition, adherence to PRISMA is often used to evaluate the thoroughness of reporting of published systematic reviews [ 111 ]. The updated version, PRISMA 2020 [ 93 ], and its guidance document [ 112 ] were published in 2021. Items on the original and updated versions of PRISMA are organized by the six basic review components they address (title, abstract, introduction, methods, results, discussion). The PRISMA 2020 update is a considerably expanded version of the original; it includes standards and examples for the 27 original and 13 additional reporting items that capture methodological advances and may enhance the replicability of reviews [ 113 ].

The original PRISMA statement fostered the development of various PRISMA extensions (Table 3.3 ). These include reporting guidance for scoping reviews and reviews of diagnostic test accuracy and for intervention reviews that report on the following: harms outcomes, equity issues, the effects of acupuncture, the results of network meta-analyses and analyses of individual participant data. Detailed reporting guidance for specific systematic review components (abstracts, protocols, literature searches) is also available.

Uptake and impact

The 2009 PRISMA standards [ 92 ] for reporting have been widely endorsed by authors, journals, and EBM-related organizations. We anticipate the same for PRISMA 2020 [ 93 ] given its co-publication in multiple high-impact journals. However, to date, there is a lack of strong evidence for an association between improved systematic review reporting and endorsement of PRISMA 2009 standards [ 43 , 111 ]. Most journals require a PRISMA checklist accompany submissions of systematic review manuscripts. However, the accuracy of information presented on these self-reported checklists is not necessarily verified. It remains unclear which strategies (eg, authors’ self-report of checklists, peer reviewer checks) might improve adherence to the PRISMA reporting standards; in addition, the feasibility of any potentially effective strategies must be taken into consideration given the structure and limitations of current research and publication practices [ 124 ].

Pitfalls and limitations of PRISMA, AMSTAR-2, and ROBIS

Misunderstanding of the roles of these tools and their misapplication may be widespread problems. PRISMA 2020 is a reporting guideline that is most beneficial if consulted when developing a review as opposed to merely completing a checklist when submitting to a journal; at that point, the review is finished, with good or bad methodological choices. However, PRISMA checklists evaluate how completely an element of review conduct was reported, but do not evaluate the caliber of conduct or performance of a review. Thus, review authors and readers should not think that a rigorous systematic review can be produced by simply following the PRISMA 2020 guidelines. Similarly, it is important to recognize that AMSTAR-2 and ROBIS are tools to evaluate the conduct of a review but do not substitute for conceptual methodological guidance. In addition, they are not intended to be simple checklists. In fact, they have the potential for misuse or abuse if applied as such; for example, by calculating a total score to make a judgment about a review’s overall confidence or RoB. Proper selection of a response for the individual items on AMSTAR-2 and ROBIS requires training or at least reference to their accompanying guidance documents.

Not surprisingly, it has been shown that compliance with the PRISMA checklist is not necessarily associated with satisfying the standards of ROBIS [ 125 ]. AMSTAR-2 and ROBIS were not available when PRISMA 2009 was developed; however, they were considered in the development of PRISMA 2020 [ 113 ]. Therefore, future studies may show a positive relationship between fulfillment of PRISMA 2020 standards for reporting and meeting the standards of tools evaluating methodological quality and RoB.

Choice of an appropriate tool for the evaluation of a systematic review first involves identification of the underlying construct to be assessed. For systematic reviews of interventions, recommended tools include AMSTAR-2 and ROBIS for appraisal of conduct and PRISMA 2020 for completeness of reporting. All three tools were developed rigorously and provide easily accessible and detailed user guidance, which is necessary for their proper application and interpretation. When considering a manuscript for publication, training in these tools can sensitize peer reviewers and editors to major issues that may affect the review’s trustworthiness and completeness of reporting. Judgment of the overall certainty of a body of evidence and formulation of recommendations rely, in part, on AMSTAR-2 or ROBIS appraisals of systematic reviews. Therefore, training on the application of these tools is essential for authors of overviews and developers of CPGs. Peer reviewers and editors considering an overview or CPG for publication must hold their authors to a high standard of transparency regarding both the conduct and reporting of these appraisals.

Part 4. Meeting conduct standards

Many authors, peer reviewers, and editors erroneously equate fulfillment of the items on the PRISMA checklist with superior methodological rigor. For direction on methodology, we refer them to available resources that provide comprehensive conceptual guidance [ 59 , 60 ] as well as primers with basic step-by-step instructions [ 1 , 126 , 127 ]. This section is intended to complement study of such resources by facilitating use of AMSTAR-2 and ROBIS, tools specifically developed to evaluate methodological rigor of systematic reviews. These tools are widely accepted by methodologists; however, in the general medical literature, they are not uniformly selected for the critical appraisal of systematic reviews [ 88 , 96 ].

To enable their uptake, Table 4.1  links review components to the corresponding appraisal tool items. Expectations of AMSTAR-2 and ROBIS are concisely stated, and reasoning provided.

Issues involved in meeting the standards for seven review components (identified in bold in Table 4.1 ) are addressed in detail. These were chosen for elaboration for one (or both) of two reasons: 1) the component has been identified as potentially problematic for systematic review authors based on consistent reports of their frequent AMSTAR-2 or ROBIS deficiencies [ 9 , 11 , 15 , 88 , 128 , 129 ]; and/or 2) the review component is judged by standards of an AMSTAR-2 “critical” domain. These have the greatest implications for how a systematic review will be appraised: if standards for any one of these critical domains are not met, the review is rated as having “critically low confidence.”

Research question

Specific and unambiguous research questions may have more value for reviews that deal with hypothesis testing. Mnemonics for the various elements of research questions are suggested by JBI and Cochrane (Table 2.1 ). These prompt authors to consider the specialized methods involved for developing different types of systematic reviews; however, while inclusion of the suggested elements makes a review compliant with a particular review’s methods, it does not necessarily make a research question appropriate. Table 4.2  lists acronyms that may aid in developing the research question. They include overlapping concepts of importance in this time of proliferating reviews of uncertain value [ 130 ]. If these issues are not prospectively contemplated, systematic review authors may establish an overly broad scope, or develop runaway scope allowing them to stray from predefined choices relating to key comparisons and outcomes.

Once a research question is established, searching on registry sites and databases for existing systematic reviews addressing the same or a similar topic is necessary in order to avoid contributing to research waste [ 131 ]. Repeating an existing systematic review must be justified, for example, if previous reviews are out of date or methodologically flawed. A full discussion on replication of intervention systematic reviews, including a consensus checklist, can be found in the work of Tugwell and colleagues [ 84 ].

Protocol development is considered a core component of systematic reviews [ 125 , 126 , 132 ]. Review protocols may allow researchers to plan and anticipate potential issues, assess validity of methods, prevent arbitrary decision-making, and minimize bias that can be introduced by the conduct of the review. Registration of a protocol that allows public access promotes transparency of the systematic review’s methods and processes and reduces the potential for duplication [ 132 ]. Thinking early and carefully about all the steps of a systematic review is pragmatic and logical and may mitigate the influence of the authors’ prior knowledge of the evidence [ 133 ]. In addition, the protocol stage is when the scope of the review can be carefully considered by authors, reviewers, and editors; this may help to avoid production of overly ambitious reviews that include excessive numbers of comparisons and outcomes or are undisciplined in their study selection.

An association with attainment of AMSTAR standards in systematic reviews with published prospective protocols has been reported [ 134 ]. However, completeness of reporting does not seem to be different in reviews with a protocol compared to those without one [ 135 ]. PRISMA-P [ 116 ] and its accompanying elaboration and explanation document [ 136 ] can be used to guide and assess the reporting of protocols. A final version of the review should fully describe any protocol deviations. Peer reviewers may compare the submitted manuscript with any available pre-registered protocol; this is required if AMSTAR-2 or ROBIS are used for critical appraisal.

There are multiple options for the recording of protocols (Table 4.3 ). Some journals will peer review and publish protocols. In addition, many online sites offer date-stamped and publicly accessible protocol registration. Some of these are exclusively for protocols of evidence syntheses; others are less restrictive and offer researchers the capacity for data storage, sharing, and other workflow features. These sites document protocol details to varying extents and have different requirements [ 137 ]. The most popular site for systematic reviews, the International Prospective Register of Systematic Reviews (PROSPERO), for example, only registers reviews that report on an outcome with direct relevance to human health. The PROSPERO record documents protocols for all types of reviews except literature and scoping reviews. Of note, PROSPERO requires authors register their review protocols prior to any data extraction [ 133 , 138 ]. The electronic records of most of these registry sites allow authors to update their protocols and facilitate transparent tracking of protocol changes, which are not unexpected during the progress of the review [ 139 ].

Study design inclusion

For most systematic reviews, broad inclusion of study designs is recommended [ 126 ]. This may allow comparison of results between contrasting study design types [ 126 ]. Certain study designs may be considered preferable depending on the type of review and nature of the research question. However, prevailing stereotypes about what each study design does best may not be accurate. For example, in systematic reviews of interventions, randomized designs are typically thought to answer highly specific questions while non-randomized designs often are expected to reveal greater information about harms or real-word evidence [ 126 , 140 , 141 ]. This may be a false distinction; randomized trials may be pragmatic [ 142 ], they may offer important (and more unbiased) information on harms [ 143 ], and data from non-randomized trials may not necessarily be more real-world-oriented [ 144 ].

Moreover, there may not be any available evidence reported by RCTs for certain research questions; in some cases, there may not be any RCTs or NRSI. When the available evidence is limited to case reports and case series, it is not possible to test hypotheses nor provide descriptive estimates or associations; however, a systematic review of these studies can still offer important insights [ 81 , 145 ]. When authors anticipate that limited evidence of any kind may be available to inform their research questions, a scoping review can be considered. Alternatively, decisions regarding inclusion of indirect as opposed to direct evidence can be addressed during protocol development [ 146 ]. Including indirect evidence at an early stage of intervention systematic review development allows authors to decide if such studies offer any additional and/or different understanding of treatment effects for their population or comparison of interest. Issues of indirectness of included studies are accounted for later in the process, during determination of the overall certainty of evidence (see Part 5 for details).

Evidence search

Both AMSTAR-2 and ROBIS require systematic and comprehensive searches for evidence. This is essential for any systematic review. Both tools discourage search restrictions based on language and publication source. Given increasing globalism in health care, the practice of including English-only literature should be avoided [ 126 ]. There are many examples in which language bias (different results in studies published in different languages) has been documented [ 147 , 148 ]. This does not mean that all literature, in all languages, is equally trustworthy [ 148 ]; however, the only way to formally probe for the potential of such biases is to consider all languages in the initial search. The gray literature and a search of trials may also reveal important details about topics that would otherwise be missed [ 149 , 150 , 151 ]. Again, inclusiveness will allow review authors to investigate whether results differ in gray literature and trials [ 41 , 151 , 152 , 153 ].

Authors should make every attempt to complete their review within one year as that is the likely viable life of a search. (1) If that is not possible, the search should be updated close to the time of completion [ 154 ]. Different research topics may warrant less of a delay, for example, in rapidly changing fields (as in the case of the COVID-19 pandemic), even one month may radically change the available evidence.

Excluded studies

AMSTAR-2 requires authors to provide references for any studies excluded at the full text phase of study selection along with reasons for exclusion; this allows readers to feel confident that all relevant literature has been considered for inclusion and that exclusions are defensible.

Risk of bias assessment of included studies

The design of the studies included in a systematic review (eg, RCT, cohort, case series) should not be equated with appraisal of its RoB. To meet AMSTAR-2 and ROBIS standards, systematic review authors must examine RoB issues specific to the design of each primary study they include as evidence. It is unlikely that a single RoB appraisal tool will be suitable for all research designs. In addition to tools for randomized and non-randomized studies, specific tools are available for evaluation of RoB in case reports and case series [ 82 ] and single-case experimental designs [ 155 , 156 ]. Note the RoB tools selected must meet the standards of the appraisal tool used to judge the conduct of the review. For example, AMSTAR-2 identifies four sources of bias specific to RCTs and NRSI that must be addressed by the RoB tool(s) chosen by the review authors. The Cochrane RoB-2 [ 157 ] tool for RCTs and ROBINS-I [ 158 ] for NRSI for RoB assessment meet the AMSTAR-2 standards. Appraisers on the review team should not modify any RoB tool without complete transparency and acknowledgment that they have invalidated the interpretation of the tool as intended by its developers [ 159 ]. Conduct of RoB assessments is not addressed AMSTAR-2; to meet ROBIS standards, two independent reviewers should complete RoB assessments of included primary studies.

Implications of the RoB assessments must be explicitly discussed and considered in the conclusions of the review. Discussion of the overall RoB of included studies may consider the weight of the studies at high RoB, the importance of the sources of bias in the studies being summarized, and if their importance differs in relationship to the outcomes reported. If a meta-analysis is performed, serious concerns for RoB of individual studies should be accounted for in these results as well. If the results of the meta-analysis for a specific outcome change when studies at high RoB are excluded, readers will have a more accurate understanding of this body of evidence. However, while investigating the potential impact of specific biases is a useful exercise, it is important to avoid over-interpretation, especially when there are sparse data.

Synthesis methods for quantitative data

Syntheses of quantitative data reported by primary studies are broadly categorized as one of two types: meta-analysis, and synthesis without meta-analysis (Table 4.4 ). Before deciding on one of these methods, authors should seek methodological advice about whether reported data can be transformed or used in other ways to provide a consistent effect measure across studies [ 160 , 161 ].

Meta-analysis

Systematic reviews that employ meta-analysis should not be referred to simply as “meta-analyses.” The term meta-analysis strictly refers to a specific statistical technique used when study effect estimates and their variances are available, yielding a quantitative summary of results. In general, methods for meta-analysis involve use of a weighted average of effect estimates from two or more studies. If considered carefully, meta-analysis increases the precision of the estimated magnitude of effect and can offer useful insights about heterogeneity and estimates of effects. We refer to standard references for a thorough introduction and formal training [ 165 , 166 , 167 ].

There are three common approaches to meta-analysis in current health care–related systematic reviews (Table 4.4 ). Aggregate meta-analyses is the most familiar to authors of evidence syntheses and their end users. This standard meta-analysis combines data on effect estimates reported by studies that investigate similar research questions involving direct comparisons of an intervention and comparator. Results of these analyses provide a single summary intervention effect estimate. If the included studies in a systematic review measure an outcome differently, their reported results may be transformed to make them comparable [ 161 ]. Forest plots visually present essential information about the individual studies and the overall pooled analysis (see Additional File 4  for details).

Less familiar and more challenging meta-analytical approaches used in secondary research include individual participant data (IPD) and network meta-analyses (NMA); PRISMA extensions provide reporting guidelines for both [ 117 , 118 ]. In IPD, the raw data on each participant from each eligible study are re-analyzed as opposed to the study-level data analyzed in aggregate data meta-analyses [ 168 ]. This may offer advantages, including the potential for limiting concerns about bias and allowing more robust analyses [ 163 ]. As suggested by the description in Table 4.4 , NMA is a complex statistical approach. It combines aggregate data [ 169 ] or IPD [ 170 ] for effect estimates from direct and indirect comparisons reported in two or more studies of three or more interventions. This makes it a potentially powerful statistical tool; while multiple interventions are typically available to treat a condition, few have been evaluated in head-to-head trials [ 171 ]. Both IPD and NMA facilitate a broader scope, and potentially provide more reliable and/or detailed results; however, compared with standard aggregate data meta-analyses, their methods are more complicated, time-consuming, and resource-intensive, and they have their own biases, so one needs sufficient funding, technical expertise, and preparation to employ them successfully [ 41 , 172 , 173 ].

Several items in AMSTAR-2 and ROBIS address meta-analysis; thus, understanding the strengths, weaknesses, assumptions, and limitations of methods for meta-analyses is important. According to the standards of both tools, plans for a meta-analysis must be addressed in the review protocol, including reasoning, description of the type of quantitative data to be synthesized, and the methods planned for combining the data. This should not consist of stock statements describing conventional meta-analysis techniques; rather, authors are expected to anticipate issues specific to their research questions. Concern for the lack of training in meta-analysis methods among systematic review authors cannot be overstated. For those with training, the use of popular software (eg, RevMan [ 174 ], MetaXL [ 175 ], JBI SUMARI [ 176 ]) may facilitate exploration of these methods; however, such programs cannot substitute for the accurate interpretation of the results of meta-analyses, especially for more complex meta-analytical approaches.

Synthesis without meta-analysis

There are varied reasons a meta-analysis may not be appropriate or desirable [ 160 , 161 ]. Syntheses that informally use statistical methods other than meta-analysis are variably referred to as descriptive, narrative, or qualitative syntheses or summaries; these terms are also applied to syntheses that make no attempt to statistically combine data from individual studies. However, use of such imprecise terminology is discouraged; in order to fully explore the results of any type of synthesis, some narration or description is needed to supplement the data visually presented in tabular or graphic forms [ 63 , 177 ]. In addition, the term “qualitative synthesis” is easily confused with a synthesis of qualitative data in a qualitative or mixed methods review. “Synthesis without meta-analysis” is currently the preferred description of other ways to combine quantitative data from two or more studies. Use of this specific terminology when referring to these types of syntheses also implies the application of formal methods (Table 4.4 ).

Methods for syntheses without meta-analysis involve structured presentations of the data in any tables and plots. In comparison to narrative descriptions of each study, these are designed to more effectively and transparently show patterns and convey detailed information about the data; they also allow informal exploration of heterogeneity [ 178 ]. In addition, acceptable quantitative statistical methods (Table 4.4 ) are formally applied; however, it is important to recognize these methods have significant limitations for the interpretation of the effectiveness of an intervention [ 160 ]. Nevertheless, when meta-analysis is not possible, the application of these methods is less prone to bias compared with an unstructured narrative description of included studies [ 178 , 179 ].

Vote counting is commonly used in systematic reviews and involves a tally of studies reporting results that meet some threshold of importance applied by review authors. Until recently, it has not typically been identified as a method for synthesis without meta-analysis. Guidance on an acceptable vote counting method based on direction of effect is currently available [ 160 ] and should be used instead of narrative descriptions of such results (eg, “more than half the studies showed improvement”; “only a few studies reported adverse effects”; “7 out of 10 studies favored the intervention”). Unacceptable methods include vote counting by statistical significance or magnitude of effect or some subjective rule applied by the authors.

AMSTAR-2 and ROBIS standards do not explicitly address conduct of syntheses without meta-analysis, although AMSTAR-2 items 13 and 14 might be considered relevant. Guidance for the complete reporting of syntheses without meta-analysis for systematic reviews of interventions is available in the Synthesis without Meta-analysis (SWiM) guideline [ 180 ] and methodological guidance is available in the Cochrane Handbook [ 160 , 181 ].

Familiarity with AMSTAR-2 and ROBIS makes sense for authors of systematic reviews as these appraisal tools will be used to judge their work; however, training is necessary for authors to truly appreciate and apply methodological rigor. Moreover, judgment of the potential contribution of a systematic review to the current knowledge base goes beyond meeting the standards of AMSTAR-2 and ROBIS. These tools do not explicitly address some crucial concepts involved in the development of a systematic review; this further emphasizes the need for author training.

We recommend that systematic review authors incorporate specific practices or exercises when formulating a research question at the protocol stage, These should be designed to raise the review team’s awareness of how to prevent research and resource waste [ 84 , 130 ] and to stimulate careful contemplation of the scope of the review [ 30 ]. Authors’ training should also focus on justifiably choosing a formal method for the synthesis of quantitative and/or qualitative data from primary research; both types of data require specific expertise. For typical reviews that involve syntheses of quantitative data, statistical expertise is necessary, initially for decisions about appropriate methods, [ 160 , 161 ] and then to inform any meta-analyses [ 167 ] or other statistical methods applied [ 160 ].

Part 5. Rating overall certainty of evidence

Report of an overall certainty of evidence assessment in a systematic review is an important new reporting standard of the updated PRISMA 2020 guidelines [ 93 ]. Systematic review authors are well acquainted with assessing RoB in individual primary studies, but much less familiar with assessment of overall certainty across an entire body of evidence. Yet a reliable way to evaluate this broader concept is now recognized as a vital part of interpreting the evidence.

Historical systems for rating evidence are based on study design and usually involve hierarchical levels or classes of evidence that use numbers and/or letters to designate the level/class. These systems were endorsed by various EBM-related organizations. Professional societies and regulatory groups then widely adopted them, often with modifications for application to the available primary research base in specific clinical areas. In 2002, a report issued by the AHRQ identified 40 systems to rate quality of a body of evidence [ 182 ]. A critical appraisal of systems used by prominent health care organizations published in 2004 revealed limitations in sensibility, reproducibility, applicability to different questions, and usability to different end users [ 183 ]. Persistent use of hierarchical rating schemes to describe overall quality continues to complicate the interpretation of evidence. This is indicated by recent reports of poor interpretability of systematic review results by readers [ 184 , 185 , 186 ] and misleading interpretations of the evidence related to the “spin” systematic review authors may put on their conclusions [ 50 , 187 ].

Recognition of the shortcomings of hierarchical rating systems raised concerns that misleading clinical recommendations could result even if based on a rigorous systematic review. In addition, the number and variability of these systems were considered obstacles to quick and accurate interpretations of the evidence by clinicians, patients, and policymakers [ 183 ]. These issues contributed to the development of the GRADE approach. An international working group, that continues to actively evaluate and refine it, first introduced GRADE in 2004 [ 188 ]. Currently more than 110 organizations from 19 countries around the world have endorsed or are using GRADE [ 189 ].

GRADE approach to rating overall certainty

GRADE offers a consistent and sensible approach for two separate processes: rating the overall certainty of a body of evidence and the strength of recommendations. The former is the expected conclusion of a systematic review, while the latter is pertinent to the development of CPGs. As such, GRADE provides a mechanism to bridge the gap from evidence synthesis to application of the evidence for informed clinical decision-making [ 27 , 190 ]. We briefly examine the GRADE approach but only as it applies to rating overall certainty of evidence in systematic reviews.

In GRADE, use of “certainty” of a body of evidence is preferred over the term “quality.” [ 191 ] Certainty refers to the level of confidence systematic review authors have that, for each outcome, an effect estimate represents the true effect. The GRADE approach to rating confidence in estimates begins with identifying the study type (RCT or NRSI) and then systematically considers criteria to rate the certainty of evidence up or down (Table 5.1 ).

This process results in assignment of one of the four GRADE certainty ratings to each outcome; these are clearly conveyed with the use of basic interpretation symbols (Table 5.2 ) [ 192 ]. Notably, when multiple outcomes are reported in a systematic review, each outcome is assigned a unique certainty rating; thus different levels of certainty may exist in the body of evidence being examined.

GRADE’s developers acknowledge some subjectivity is involved in this process [ 193 ]. In addition, they emphasize that both the criteria for rating evidence up and down (Table 5.1 ) as well as the four overall certainty ratings (Table 5.2 ) reflect a continuum as opposed to discrete categories [ 194 ]. Consequently, deciding whether a study falls above or below the threshold for rating up or down may not be straightforward, and preliminary overall certainty ratings may be intermediate (eg, between low and moderate). Thus, the proper application of GRADE requires systematic review authors to take an overall view of the body of evidence and explicitly describe the rationale for their final ratings.

Advantages of GRADE

Outcomes important to the individuals who experience the problem of interest maintain a prominent role throughout the GRADE process [ 191 ]. These outcomes must inform the research questions (eg, PICO [population, intervention, comparator, outcome]) that are specified a priori in a systematic review protocol. Evidence for these outcomes is then investigated and each critical or important outcome is ultimately assigned a certainty of evidence as the end point of the review. Notably, limitations of the included studies have an impact at the outcome level. Ultimately, the certainty ratings for each outcome reported in a systematic review are considered by guideline panels. They use a different process to formulate recommendations that involves assessment of the evidence across outcomes [ 201 ]. It is beyond our scope to describe the GRADE process for formulating recommendations; however, it is critical to understand how these two outcome-centric concepts of certainty of evidence in the GRADE framework are related and distinguished. An in-depth illustration using examples from recently published evidence syntheses and CPGs is provided in Additional File 5 A (Table AF5A-1).

The GRADE approach is applicable irrespective of whether the certainty of the primary research evidence is high or very low; in some circumstances, indirect evidence of higher certainty may be considered if direct evidence is unavailable or of low certainty [ 27 ]. In fact, most interventions and outcomes in medicine have low or very low certainty of evidence based on GRADE and there seems to be no major improvement over time [ 202 , 203 ]. This is still a very important (even if sobering) realization for calibrating our understanding of medical evidence. A major appeal of the GRADE approach is that it offers a common framework that enables authors of evidence syntheses to make complex judgments about evidence certainty and to convey these with unambiguous terminology. This prevents some common mistakes made by review authors, including overstating results (or under-reporting harms) [ 187 ] and making recommendations for treatment. This is illustrated in Table AF5A-2 (Additional File 5 A), which compares the concluding statements made about overall certainty in a systematic review with and without application of the GRADE approach.

Theoretically, application of GRADE should improve consistency of judgments about certainty of evidence, both between authors and across systematic reviews. In one empirical evaluation conducted by the GRADE Working Group, interrater reliability of two individual raters assessing certainty of the evidence for a specific outcome increased from ~ 0.3 without using GRADE to ~ 0.7 by using GRADE [ 204 ]. However, others report variable agreement among those experienced in GRADE assessments of evidence certainty [ 190 ]. Like any other tool, GRADE requires training in order to be properly applied. The intricacies of the GRADE approach and the necessary subjectivity involved suggest that improving agreement may require strict rules for its application; alternatively, use of general guidance and consensus among review authors may result in less consistency but provide important information for the end user [ 190 ].

GRADE caveats

Simply invoking “the GRADE approach” does not automatically ensure GRADE methods were employed by authors of a systematic review (or developers of a CPG). Table 5.3 lists the criteria the GRADE working group has established for this purpose. These criteria highlight the specific terminology and methods that apply to rating the certainty of evidence for outcomes reported in a systematic review [ 191 ], which is different from rating overall certainty across outcomes considered in the formulation of recommendations [ 205 ]. Modifications of standard GRADE methods and terminology are discouraged as these may detract from GRADE’s objectives to minimize conceptual confusion and maximize clear communication [ 206 ].

Nevertheless, GRADE is prone to misapplications [ 207 , 208 ], which can distort a systematic review’s conclusions about the certainty of evidence. Systematic review authors without proper GRADE training are likely to misinterpret the terms “quality” and “grade” and to misunderstand the constructs assessed by GRADE versus other appraisal tools. For example, review authors may reference the standard GRADE certainty ratings (Table 5.2 ) to describe evidence for their outcome(s) of interest. However, these ratings are invalidated if authors omit or inadequately perform RoB evaluations of each included primary study. Such deficiencies in RoB assessments are unacceptable but not uncommon, as reported in methodological studies of systematic reviews and overviews [ 104 , 186 , 209 , 210 ]. GRADE ratings are also invalidated if review authors do not formally address and report on the other criteria (Table 5.1 ) necessary for a GRADE certainty rating.

Other caveats pertain to application of a GRADE certainty of evidence rating in various types of evidence syntheses. Current adaptations of GRADE are described in Additional File 5 B and included on Table 6.3 , which is introduced in the next section.

The expected culmination of a systematic review should be a rating of overall certainty of a body of evidence for each outcome reported. The GRADE approach is recommended for making these judgments for outcomes reported in systematic reviews of interventions and can be adapted for other types of reviews. This represents the initial step in the process of making recommendations based on evidence syntheses. Peer reviewers should ensure authors meet the minimal criteria for supporting the GRADE approach when reviewing any evidence synthesis that reports certainty ratings derived using GRADE. Authors and peer reviewers of evidence syntheses unfamiliar with GRADE are encouraged to seek formal training and take advantage of the resources available on the GRADE website [ 211 , 212 ].

Part 6. Concise Guide to best practices

Accumulating data in recent years suggest that many evidence syntheses (with or without meta-analysis) are not reliable. This relates in part to the fact that their authors, who are often clinicians, can be overwhelmed by the plethora of ways to evaluate evidence. They tend to resort to familiar but often inadequate, inappropriate, or obsolete methods and tools and, as a result, produce unreliable reviews. These manuscripts may not be recognized as such by peer reviewers and journal editors who may disregard current standards. When such a systematic review is published or included in a CPG, clinicians and stakeholders tend to believe that it is trustworthy. A vicious cycle in which inadequate methodology is rewarded and potentially misleading conclusions are accepted is thus supported. There is no quick or easy way to break this cycle; however, increasing awareness of best practices among all these stakeholder groups, who often have minimal (if any) training in methodology, may begin to mitigate it. This is the rationale for inclusion of Parts 2 through 5 in this guidance document. These sections present core concepts and important methodological developments that inform current standards and recommendations. We conclude by taking a direct and practical approach.

Inconsistent and imprecise terminology used in the context of development and evaluation of evidence syntheses is problematic for authors, peer reviewers and editors, and may lead to the application of inappropriate methods and tools. In response, we endorse use of the basic terms (Table 6.1 ) defined in the PRISMA 2020 statement [ 93 ]. In addition, we have identified several problematic expressions and nomenclature. In Table 6.2 , we compile suggestions for preferred terms less likely to be misinterpreted.

We also propose a Concise Guide (Table 6.3 ) that summarizes the methods and tools recommended for the development and evaluation of nine types of evidence syntheses. Suggestions for specific tools are based on the rigor of their development as well as the availability of detailed guidance from their developers to ensure their proper application. The formatting of the Concise Guide addresses a well-known source of confusion by clearly distinguishing the underlying methodological constructs that these tools were designed to assess. Important clarifications and explanations follow in the guide’s footnotes; associated websites, if available, are listed in Additional File 6 .

To encourage uptake of best practices, journal editors may consider adopting or adapting the Concise Guide in their instructions to authors and peer reviewers of evidence syntheses. Given the evolving nature of evidence synthesis methodology, the suggested methods and tools are likely to require regular updates. Authors of evidence syntheses should monitor the literature to ensure they are employing current methods and tools. Some types of evidence syntheses (eg, rapid, economic, methodological) are not included in the Concise Guide; for these, authors are advised to obtain recommendations for acceptable methods by consulting with their target journal.

We encourage the appropriate and informed use of the methods and tools discussed throughout this commentary and summarized in the Concise Guide (Table 6.3 ). However, we caution against their application in a perfunctory or superficial fashion. This is a common pitfall among authors of evidence syntheses, especially as the standards of such tools become associated with acceptance of a manuscript by a journal. Consequently, published evidence syntheses may show improved adherence to the requirements of these tools without necessarily making genuine improvements in their performance.

In line with our main objective, the suggested tools in the Concise Guide address the reliability of evidence syntheses; however, we recognize that the utility of systematic reviews is an equally important concern. An unbiased and thoroughly reported evidence synthesis may still not be highly informative if the evidence itself that is summarized is sparse, weak and/or biased [ 24 ]. Many intervention systematic reviews, including those developed by Cochrane [ 203 ] and those applying GRADE [ 202 ], ultimately find no evidence, or find the evidence to be inconclusive (eg, “weak,” “mixed,” or of “low certainty”). This often reflects the primary research base; however, it is important to know what is known (or not known) about a topic when considering an intervention for patients and discussing treatment options with them.

Alternatively, the frequency of “empty” and inconclusive reviews published in the medical literature may relate to limitations of conventional methods that focus on hypothesis testing; these have emphasized the importance of statistical significance in primary research and effect sizes from aggregate meta-analyses [ 183 ]. It is becoming increasingly apparent that this approach may not be appropriate for all topics [ 130 ]. Development of the GRADE approach has facilitated a better understanding of significant factors (beyond effect size) that contribute to the overall certainty of evidence. Other notable responses include the development of integrative synthesis methods for the evaluation of complex interventions [ 230 , 231 ], the incorporation of crowdsourcing and machine learning into systematic review workflows (eg the Cochrane Evidence Pipeline) [ 2 ], the shift in paradigm to living systemic review and NMA platforms [ 232 , 233 ] and the proposal of a new evidence ecosystem that fosters bidirectional collaborations and interactions among a global network of evidence synthesis stakeholders [ 234 ]. These evolutions in data sources and methods may ultimately make evidence syntheses more streamlined, less duplicative, and more importantly, they may be more useful for timely policy and clinical decision-making; however, that will only be the case if they are rigorously reported and conducted.

We look forward to others’ ideas and proposals for the advancement of methods for evidence syntheses. For now, we encourage dissemination and uptake of the currently accepted best tools and practices for their development and evaluation; at the same time, we stress that uptake of appraisal tools, checklists, and software programs cannot substitute for proper education in the methodology of evidence syntheses and meta-analysis. Authors, peer reviewers, and editors must strive to make accurate and reliable contributions to the present evidence knowledge base; online alerts, upcoming technology, and accessible education may make this more feasible than ever before. Our intention is to improve the trustworthiness of evidence syntheses across disciplines, topics, and types of evidence syntheses. All of us must continue to study, teach, and act cooperatively for that to happen.

Muka T, Glisic M, Milic J, Verhoog S, Bohlius J, Bramer W, et al. A 24-step guide on how to design, conduct, and successfully publish a systematic review and meta-analysis in medical research. Eur J Epidemiol. 2020;35(1):49–60.

Article   PubMed   Google Scholar  

Thomas J, McDonald S, Noel-Storr A, Shemilt I, Elliott J, Mavergames C, et al. Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for cochrane reviews. J Clin Epidemiol. 2021;133:140–51.

Article   PubMed   PubMed Central   Google Scholar  

Fontelo P, Liu F. A review of recent publication trends from top publishing countries. Syst Rev. 2018;7(1):147.

Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, et al. ROBIS: a new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34.

Shea BJ, Grimshaw JM, Wells GA, Boers M, Andersson N, Hamel C, et al. Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews. BMC Med Res Methodol. 2007;7:1–7.

Article   Google Scholar  

Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358: j4008.

Goldkuhle M, Narayan VM, Weigl A, Dahm P, Skoetz N. A systematic assessment of Cochrane reviews and systematic reviews published in high-impact medical journals related to cancer. BMJ Open. 2018;8(3): e020869.

Ho RS, Wu X, Yuan J, Liu S, Lai X, Wong SY, et al. Methodological quality of meta-analyses on treatments for chronic obstructive pulmonary disease: a cross-sectional study using the AMSTAR (Assessing the Methodological Quality of Systematic Reviews) tool. NPJ Prim Care Respir Med. 2015;25:14102.

Tsoi AKN, Ho LTF, Wu IXY, Wong CHL, Ho RST, Lim JYY, et al. Methodological quality of systematic reviews on treatments for osteoporosis: a cross-sectional study. Bone. 2020;139(June): 115541.

Arienti C, Lazzarini SG, Pollock A, Negrini S. Rehabilitation interventions for improving balance following stroke: an overview of systematic reviews. PLoS ONE. 2019;14(7):1–23.

Kolaski K, Romeiser Logan L, Goss KD, Butler C. Quality appraisal of systematic reviews of interventions for children with cerebral palsy reveals critically low confidence. Dev Med Child Neurol. 2021;63(11):1316–26.

Almeida MO, Yamato TP, Parreira PCS, do Costa LOP, Kamper S, Saragiotto BT. Overall confidence in the results of systematic reviews on exercise therapy for chronic low back pain: a cross-sectional analysis using the Assessing the Methodological Quality of Systematic Reviews (AMSTAR) 2 tool. Braz J Phys Ther. 2020;24(2):103–17.

Mayo-Wilson E, Ng SM, Chuck RS, Li T. The quality of systematic reviews about interventions for refractive error can be improved: a review of systematic reviews. BMC Ophthalmol. 2017;17(1):1–10.

Matthias K, Rissling O, Pieper D, Morche J, Nocon M, Jacobs A, et al. The methodological quality of systematic reviews on the treatment of adult major depression needs improvement according to AMSTAR 2: a cross-sectional study. Heliyon. 2020;6(9): e04776.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Riado Minguez D, Kowalski M, Vallve Odena M, Longin Pontzen D, Jelicic Kadic A, Jeric M, et al. Methodological and reporting quality of systematic reviews published in the highest ranking journals in the field of pain. Anesth Analg. 2017;125(4):1348–54.

Churuangsuk C, Kherouf M, Combet E, Lean M. Low-carbohydrate diets for overweight and obesity: a systematic review of the systematic reviews. Obes Rev. 2018;19(12):1700–18.

Article   CAS   PubMed   Google Scholar  

Storman M, Storman D, Jasinska KW, Swierz MJ, Bala MM. The quality of systematic reviews/meta-analyses published in the field of bariatrics: a cross-sectional systematic survey using AMSTAR 2 and ROBIS. Obes Rev. 2020;21(5):1–11.

Franco JVA, Arancibia M, Meza N, Madrid E, Kopitowski K. [Clinical practice guidelines: concepts, limitations and challenges]. Medwave. 2020;20(3):e7887 ([Spanish]).

Brito JP, Tsapas A, Griebeler ML, Wang Z, Prutsky GJ, Domecq JP, et al. Systematic reviews supporting practice guideline recommendations lack protection against bias. J Clin Epidemiol. 2013;66(6):633–8.

Zhou Q, Wang Z, Shi Q, Zhao S, Xun Y, Liu H, et al. Clinical epidemiology in China series. Paper 4: the reporting and methodological quality of Chinese clinical practice guidelines published between 2014 and 2018: a systematic review. J Clin Epidemiol. 2021;140:189–99.

Lunny C, Ramasubbu C, Puil L, Liu T, Gerrish S, Salzwedel DM, et al. Over half of clinical practice guidelines use non-systematic methods to inform recommendations: a methods study. PLoS ONE. 2021;16(4):1–21.

Faber T, Ravaud P, Riveros C, Perrodeau E, Dechartres A. Meta-analyses including non-randomized studies of therapeutic interventions: a methodological review. BMC Med Res Methodol. 2016;16(1):1–26.

Ioannidis JPA. The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. Milbank Q. 2016;94(3):485–514.

Møller MH, Ioannidis JPA, Darmon M. Are systematic reviews and meta-analyses still useful research? We are not sure. Intensive Care Med. 2018;44(4):518–20.

Moher D, Glasziou P, Chalmers I, Nasser M, Bossuyt PMM, Korevaar DA, et al. Increasing value and reducing waste in biomedical research: who’s listening? Lancet. 2016;387(10027):1573–86.

Barnard ND, Willet WC, Ding EL. The misuse of meta-analysis in nutrition research. JAMA. 2017;318(15):1435–6.

Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1. Introduction - GRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011;64(4):383–94.

Page MJ, Shamseer L, Altman DG, Tetzlaff J, Sampson M, Tricco AC, et al. Epidemiology and reporting characteristics of systematic reviews of biomedical research: a cross-sectional study. PLoS Med. 2016;13(5):1–31.

World Health Organization. WHO handbook for guideline development, 2nd edn. WHO; 2014. Available from: https://www.who.int/publications/i/item/9789241548960 . Cited 2022 Jan 20

Higgins J, Lasserson T, Chandler J, Tovey D, Thomas J, Flemying E, et al. Methodological expectations of Cochrane intervention reviews. Cochrane; 2022. Available from: https://community.cochrane.org/mecir-manual/key-points-and-introduction . Cited 2022 Jul 19

Cumpston M, Chandler J. Chapter II: Planning a Cochrane review. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook . Cited 2022 Jan 30

Henderson LK, Craig JC, Willis NS, Tovey D, Webster AC. How to write a cochrane systematic review. Nephrology. 2010;15(6):617–24.

Page MJ, Altman DG, Shamseer L, McKenzie JE, Ahmadzai N, Wolfe D, et al. Reproducible research practices are underused in systematic reviews of biomedical interventions. J Clin Epidemiol. 2018;94:8–18.

Lorenz RC, Matthias K, Pieper D, Wegewitz U, Morche J, Nocon M, et al. AMSTAR 2 overall confidence rating: lacking discriminating capacity or requirement of high methodological quality? J Clin Epidemiol. 2020;119:142–4.

Posadzki P, Pieper D, Bajpai R, Makaruk H, Könsgen N, Neuhaus AL, et al. Exercise/physical activity and health outcomes: an overview of Cochrane systematic reviews. BMC Public Health. 2020;20(1):1–12.

Wells G, Shea B, O’Connell D, Peterson J, Welch V, Losos M. The Newcastile-Ottawa Scale (NOS) for assessing the quality of nonrandomized studies in meta-analyses. The Ottawa Hospital; 2009. Available from: https://www.ohri.ca/programs/clinical_epidemiology/oxford.asp . Cited 2022 Jul 19

Stang A. Critical evaluation of the Newcastle-Ottawa scale for the assessment of the quality of nonrandomized studies in meta-analyses. Eur J Epidemiol. 2010;25(9):603–5.

Stang A, Jonas S, Poole C. Case study in major quotation errors: a critical commentary on the Newcastle-Ottawa scale. Eur J Epidemiol. 2018;33(11):1025–31.

Ioannidis JPA. Massive citations to misleading methods and research tools: Matthew effect, quotation error and citation copying. Eur J Epidemiol. 2018;33(11):1021–3.

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22–42.

Crequit P, Boutron I, Meerpohl J, Williams H, Craig J, Ravaud P. Future of evidence ecosystem series: 2. Current opportunities and need for better tools and methods. J Clin Epidemiol. 2020;123:143–52.

Shemilt I, Noel-Storr A, Thomas J, Featherstone R, Mavergames C. Machine learning reduced workload for the cochrane COVID-19 study register: development and evaluation of the cochrane COVID-19 study classifier. Syst Rev. 2022;11(1):15.

Nguyen P-Y, Kanukula R, McKensie J, Alqaidoom Z, Brennan SE, Haddaway N, et al. Changing patterns in reporting and sharing of review data in systematic reviews with meta-analysis of the effects of interventions: a meta-research study. medRxiv; 2022 Available from: https://doi.org/10.1101/2022.04.11.22273688v3 . Cited 2022 Nov 18

Afshari A, Møller MH. Broken science and the failure of academics—resignation or reaction? Acta Anaesthesiol Scand. 2018;62(8):1038–40.

Butler E, Granholm A, Aneman A. Trustworthy systematic reviews–can journals do more? Acta Anaesthesiol Scand. 2019;63(4):558–9.

Negrini S, Côté P, Kiekens C. Methodological quality of systematic reviews on interventions for children with cerebral palsy: the evidence pyramid paradox. Dev Med Child Neurol. 2021;63(11):1244–5.

Page MJ, Moher D. Mass production of systematic reviews and meta-analyses: an exercise in mega-silliness? Milbank Q. 2016;94(3):515–9.

Clarke M, Chalmers I. Reflections on the history of systematic reviews. BMJ Evid Based Med. 2018;23(4):121–2.

Alnemer A, Khalid M, Alhuzaim W, Alnemer A, Ahmed B, Alharbi B, et al. Are health-related tweets evidence based? Review and analysis of health-related tweets on twitter. J Med Internet Res. 2015;17(10): e246.

PubMed   PubMed Central   Google Scholar  

Haber N, Smith ER, Moscoe E, Andrews K, Audy R, Bell W, et al. Causal language and strength of inference in academic and media articles shared in social media (CLAIMS): a systematic review. PLoS ONE. 2018;13(5): e196346.

Swetland SB, Rothrock AN, Andris H, Davis B, Nguyen L, Davis P, et al. Accuracy of health-related information regarding COVID-19 on Twitter during a global pandemic. World Med Heal Policy. 2021;13(3):503–17.

Nascimento DP, Almeida MO, Scola LFC, Vanin AA, Oliveira LA, Costa LCM, et al. Letter to the editor – not even the top general medical journals are free of spin: a wake-up call based on an overview of reviews. J Clin Epidemiol. 2021;139:232–4.

Ioannidis JPA, Fanelli D, Dunne DD, Goodman SN. Meta-research: evaluation and improvement of research methods and practices. PLoS Biol. 2015;13(10):1–7.

Munn Z, Stern C, Aromataris E, Lockwood C, Jordan Z. What kind of systematic review should I conduct? A proposed typology and guidance for systematic reviewers in the medical and health sciences. BMC Med Res Methodol. 2018;18(1):1–9.

Pollock M, Fernandez R, Becker LA, Pieper D, Hartling L. Chapter V: overviews of reviews. Cochrane handbook for systematic reviews of interventions. In:  Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-v . Cited 2022 Mar 7

Tricco AC, Lillie E, Zarin W, O’Brien K, Colquhoun H, Kastner M, et al. A scoping review on the conduct and reporting of scoping reviews. BMC Med Res Methodol. 2016;16(1):1–10.

Garritty C, Gartlehner G, Nussbaumer-Streit B, King VJ, Hamel C, Kamel C, et al. Cochrane rapid reviews methods group offers evidence-informed guidance to conduct rapid reviews. J Clin Epidemiol. 2021;130:13–22.

Elliott JH, Synnot A, Turner T, Simmonds M, Akl EA, McDonald S, et al. Living systematic review: 1. Introduction—the why, what, when, and how. J Clin Epidemiol. 2017;91:23–30.

Higgins JPT, Thomas J, Chandler J. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook . Cited 2022 Jan 25

Aromataris E, Munn Z. JBI Manual for Evidence Synthesis [internet]. JBI; 2020 [cited 2022 Jan 15]. Available from: https://synthesismanual.jbi.global .

Tufanaru C, Munn Z, Aromartaris E, Campbell J, Hopp L. Chapter 3: Systematic reviews of effectiveness. In Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis [internet]. JBI; 2020 [cited 2022 Jan 25]. Available from: https://synthesismanual.jbi.global .

Leeflang MMG, Davenport C, Bossuyt PM. Defining the review question. In: Deeks JJ, Bossuyt PM, Leeflang MMG, Takwoingi Y, editors. Cochrane handbook for systematic reviews of diagnostic test accuracy [internet]. Cochrane; 2022 [cited 2022 Mar 30]. Available from: https://training.cochrane.org/6-defining-review-question .

Noyes J, Booth A, Cargo M, Flemming K, Harden A, Harris J, et al.Qualitative evidence. In: Higgins J, Tomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions [internet]. Cochrane; 2022 [cited 2022 Mar 30]. Available from: https://training.cochrane.org/handbook/current/chapter-21#section-21-5 .

Lockwood C, Porritt K, Munn Z, Rittenmeyer L, Salmond S, Bjerrum M, et al. Chapter 2: Systematic reviews of qualitative evidence. In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis [internet]. JBI; 2020 [cited 2022 Jul 11]. Available from: https://synthesismanual.jbi.global .

Debray TPA, Damen JAAG, Snell KIE, Ensor J, Hooft L, Reitsma JB, et al. A guide to systematic review and meta-analysis of prediction model performance. BMJ. 2017;356:i6460.

Moola S, Munn Z, Tufanaru C, Aromartaris E, Sears K, Sfetcu R, et al. Systematic reviews of etiology and risk. In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis [internet]. JBI; 2020 [cited 2022 Mar 30]. Available from: https://synthesismanual.jbi.global/ .

Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–49.

Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, de Vet HCW, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147–57.

Munn Z, Moola S, Lisy K, Riitano D, Tufanaru C. Chapter 5: Systematic reviews of prevalence and incidence. In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis [internet]. JBI; 2020 [cited 2022 Mar 30]. Available from: https://synthesismanual.jbi.global/ .

Centre for Evidence-Based Medicine. Study designs. CEBM; 2016. Available from: https://www.cebm.ox.ac.uk/resources/ebm-tools/study-designs . Cited 2022 Aug 30

Hartling L, Bond K, Santaguida PL, Viswanathan M, Dryden DM. Testing a tool for the classification of study designs in systematic reviews of interventions and exposures showed moderate reliability and low accuracy. J Clin Epidemiol. 2011;64(8):861–71.

Crowe M, Sheppard L, Campbell A. Reliability analysis for a proposed critical appraisal tool demonstrated value for diverse research designs. J Clin Epidemiol. 2012;65(4):375–83.

Reeves BC, Wells GA, Waddington H. Quasi-experimental study designs series—paper 5: a checklist for classifying studies evaluating the effects on health interventions—a taxonomy without labels. J Clin Epidemiol. 2017;89:30–42.

Reeves BC, Deeks JJ, Higgins JPT, Shea B, Tugwell P, Wells GA. Chapter 24: including non-randomized studies on intervention effects.  In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-24 . Cited 2022 Mar 1

Reeves B. A framework for classifying study designs to evaluate health care interventions. Forsch Komplementarmed Kl Naturheilkd. 2004;11(Suppl 1):13–7.

Google Scholar  

Rockers PC, Røttingen J, Shemilt I. Inclusion of quasi-experimental studies in systematic reviews of health systems research. Health Policy. 2015;119(4):511–21.

Mathes T, Pieper D. Clarifying the distinction between case series and cohort studies in systematic reviews of comparative studies: potential impact on body of evidence and workload. BMC Med Res Methodol. 2017;17(1):8–13.

Jhangiani R, Cuttler C, Leighton D. Single subject research. In: Jhangiani R, Cuttler C, Leighton D, editors. Research methods in psychology, 4th edn. Pressbooks KPU; 2019. Available from: https://kpu.pressbooks.pub/psychmethods4e/part/single-subject-research/ . Cited 2022 Aug 15

Higgins JP, Ramsay C, Reeves BC, Deeks JJ, Shea B, Valentine JC, et al. Issues relating to study design and risk of bias when including non-randomized studies in systematic reviews on the effects of interventions. Res Synth Methods. 2013;4(1):12–25.

Cumpston M, Lasserson T, Chandler J, Page M. 3.4.1 Criteria for considering studies for this review, Chapter III: Reporting the review. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-iii#section-iii-3-4-1 . Cited 2022 Oct 12

Kooistra B, Dijkman B, Einhorn TA, Bhandari M. How to design a good case series. J Bone Jt Surg. 2009;91(Suppl 3):21–6.

Murad MH, Sultan S, Haffar S, Bazerbachi F. Methodological quality and synthesis of case series and case reports. Evid Based Med. 2018;23(2):60–3.

Robinson K, Chou R, Berkman N, Newberry S, FU R, Hartling L, et al. Methods guide for comparative effectiveness reviews integrating bodies of evidence: existing systematic reviews and primary studies. AHRQ; 2015. Available from: https://archive.org/details/integrating-evidence-report-150226 . Cited 2022 Aug 7

Tugwell P, Welch VA, Karunananthan S, Maxwell LJ, Akl EA, Avey MT, et al. When to replicate systematic reviews of interventions: consensus checklist. BMJ. 2020;370: m2864.

Tsertsvadze A, Maglione M, Chou R, Garritty C, Coleman C, Lux L, et al. Updating comparative effectiveness reviews:current efforts in AHRQ’s effective health care program. J Clin Epidemiol. 2011;64(11):1208–15.

Cumpston M, Chandler J. Chapter IV: Updating a review. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook . Cited 2022 Aug 2

Pollock M, Fernandes RM, Newton AS, Scott SD, Hartling L. A decision tool to help researchers make decisions about including systematic reviews in overviews of reviews of healthcare interventions. Syst Rev. 2019;8(1):1–8.

Pussegoda K, Turner L, Garritty C, Mayhew A, Skidmore B, Stevens A, et al. Identifying approaches for assessing methodological and reporting quality of systematic reviews: a descriptive study. Syst Rev. 2017;6(1):1–12.

Bhaumik S. Use of evidence for clinical practice guideline development. Trop Parasitol. 2017;7(2):65–71.

Moher D, Eastwood S, Olkin I, Drummond R, Stroup D. Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Lancet. 1999;354:1896–900.

Stroup D, Berlin J, Morton S, Olkin I, Williamson G, Rennie D, et al. Meta-analysis of observational studies in epidemiology A proposal for reporting. JAMA. 2000;238(15):2008–12.

Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. J Clin Epidemiol. 2009;62(10):1006–12.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372: n71.

Oxman AD, Guyatt GH. Validation of an index of the quality of review articles. J Clin Epidemiol. 1991;44(11):1271–8.

Centre for Evidence-Based Medicine. Critical appraisal tools. CEBM; 2015. Available from: https://www.cebm.ox.ac.uk/resources/ebm-tools/critical-appraisal-tools . Cited 2022 Apr 10

Page MJ, McKenzie JE, Higgins JPT. Tools for assessing risk of reporting biases in studies and syntheses of studies: a systematic review. BMJ Open. 2018;8(3):1–16.

Article   CAS   Google Scholar  

Ma LL, Wang YY, Yang ZH, Huang D, Weng H, Zeng XT. Methodological quality (risk of bias) assessment tools for primary and secondary medical studies: what are they and which is better? Mil Med Res. 2020;7(1):1–11.

Banzi R, Cinquini M, Gonzalez-Lorenzo M, Pecoraro V, Capobussi M, Minozzi S. Quality assessment versus risk of bias in systematic reviews: AMSTAR and ROBIS had similar reliability but differed in their construct and applicability. J Clin Epidemiol. 2018;99:24–32.

Swierz MJ, Storman D, Zajac J, Koperny M, Weglarz P, Staskiewicz W, et al. Similarities, reliability and gaps in assessing the quality of conduct of systematic reviews using AMSTAR-2 and ROBIS: systematic survey of nutrition reviews. BMC Med Res Methodol. 2021;21(1):1–10.

Pieper D, Puljak L, González-Lorenzo M, Minozzi S. Minor differences were found between AMSTAR 2 and ROBIS in the assessment of systematic reviews including both randomized and nonrandomized studies. J Clin Epidemiol. 2019;108:26–33.

Lorenz RC, Matthias K, Pieper D, Wegewitz U, Morche J, Nocon M, et al. A psychometric study found AMSTAR 2 to be a valid and moderately reliable appraisal tool. J Clin Epidemiol. 2019;114:133–40.

Leclercq V, Hiligsmann M, Parisi G, Beaudart C, Tirelli E, Bruyère O. Best-worst scaling identified adequate statistical methods and literature search as the most important items of AMSTAR2 (A measurement tool to assess systematic reviews). J Clin Epidemiol. 2020;128:74–82.

Bühn S, Mathes T, Prengel P, Wegewitz U, Ostermann T, Robens S, et al. The risk of bias in systematic reviews tool showed fair reliability and good construct validity. J Clin Epidemiol. 2017;91:121–8.

Gates M, Gates A, Duarte G, Cary M, Becker M, Prediger B, et al. Quality and risk of bias appraisals of systematic reviews are inconsistent across reviewers and centers. J Clin Epidemiol. 2020;125:9–15.

Perry R, Whitmarsh A, Leach V, Davies P. A comparison of two assessment tools used in overviews of systematic reviews: ROBIS versus AMSTAR-2. Syst Rev. 2021;10(1):273.

Gates M, Gates A, Guitard S, Pollock M, Hartling L. Guidance for overviews of reviews continues to accumulate, but important challenges remain: a scoping review. Syst Rev. 2020;9(1):1–19.

Aromataris E, Fernandez R, Godfrey C, Holly C, Khalil H, Tungpunkom P. Chapter 10: umbrella reviews. In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis. JBI; 2020. Available from: https://synthesismanual.jbi.global . Cited 2022 Jul 11

Pieper D, Lorenz RC, Rombey T, Jacobs A, Rissling O, Freitag S, et al. Authors should clearly report how they derived the overall rating when applying AMSTAR 2—a cross-sectional study. J Clin Epidemiol. 2021;129:97–103.

Franco JVA, Meza N. Authors should also report the support for judgment when applying AMSTAR 2. J Clin Epidemiol. 2021;138:240.

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med. 2009;6(7): e1000100.

Page MJ, Moher D. Evaluations of the uptake and impact of the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement and extensions: a scoping review. Syst Rev. 2017;6(1):263.

Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372: n160.

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. Updating guidance for reporting systematic reviews: development of the PRISMA 2020 statement. J Clin Epidemiol. 2021;134:103–12.

Welch V, Petticrew M, Petkovic J, Moher D, Waters E, White H, et al. Extending the PRISMA statement to equity-focused systematic reviews (PRISMA-E 2012): explanation and elaboration. J Clin Epidemiol. 2016;70:68–89.

Beller EM, Glasziou PP, Altman DG, Hopewell S, Bastian H, Chalmers I, et al. PRISMA for abstracts: reporting systematic reviews in journal and conference abstracts. PLoS Med. 2013;10(4): e1001419.

Moher D, Shamseer L, Clarke M. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015;4(1):1.

Hutton B, Salanti G, Caldwell DM, Chaimani A, Schmid CH, Cameron C, et al. The PRISMA extension statement for reporting of systematic reviews incorporating network meta-analyses of health care interventions: checklist and explanations. Ann Intern Med. 2015;162(11):777–84.

Stewart LA, Clarke M, Rovers M, Riley RD, Simmonds M, Stewart G, et al. Preferred reporting items for a systematic review and meta-analysis of individual participant data: The PRISMA-IPD statement. JAMA. 2015;313(16):1657–65.

Zorzela L, Loke YK, Ioannidis JP, Golder S, Santaguida P, Altman DG, et al. PRISMA harms checklist: Improving harms reporting in systematic reviews. BMJ. 2016;352: i157.

McInnes MDF, Moher D, Thombs BD, McGrath TA, Bossuyt PM, Clifford T, et al. Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy studies The PRISMA-DTA statement. JAMA. 2018;319(4):388–96.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169(7):467–73.

Wang X, Chen Y, Liu Y, Yao L, Estill J, Bian Z, et al. Reporting items for systematic reviews and meta-analyses of acupuncture: the PRISMA for acupuncture checklist. BMC Complement Altern Med. 2019;19(1):1–10.

Rethlefsen ML, Kirtley S, Waffenschmidt S, Ayala AP, Moher D, Page MJ, et al. PRISMA-S: An extension to the PRISMA statement for reporting literature searches in systematic reviews. J Med Libr Assoc. 2021;109(2):174–200.

Blanco D, Altman D, Moher D, Boutron I, Kirkham JJ, Cobo E. Scoping review on interventions to improve adherence to reporting guidelines in health research. BMJ Open. 2019;9(5): e26589.

Koster TM, Wetterslev J, Gluud C, Keus F, van der Horst ICC. Systematic overview and critical appraisal of meta-analyses of interventions in intensive care medicine. Acta Anaesthesiol Scand. 2018;62(8):1041–9.

Johnson BT, Hennessy EA. Systematic reviews and meta-analyses in the health sciences: best practice methods for research syntheses. Soc Sci Med. 2019;233:237–51.

Pollock A, Berge E. How to do a systematic review. Int J Stroke. 2018;13(2):138–56.

Gagnier JJ, Kellam PJ. Reporting and methodological quality of systematic reviews in the orthopaedic literature. J Bone Jt Surg. 2013;95(11):1–7.

Martinez-Monedero R, Danielian A, Angajala V, Dinalo JE, Kezirian EJ. Methodological quality of systematic reviews and meta-analyses published in high-impact otolaryngology journals. Otolaryngol Head Neck Surg. 2020;163(5):892–905.

Boutron I, Crequit P, Williams H, Meerpohl J, Craig J, Ravaud P. Future of evidence ecosystem series 1. Introduction-evidence synthesis ecosystem needs dramatic change. J Clin Epidemiol. 2020;123:135–42.

Ioannidis JPA, Bhattacharya S, Evers JLH, Der Veen F, Van SE, Barratt CLR, et al. Protect us from poor-quality medical research. Hum Reprod. 2018;33(5):770–6.

Lasserson T, Thomas J, Higgins J. Section 1.5 Protocol development, Chapter 1: Starting a review. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/archive/v6/chapter-01#section-1-5 . Cited 2022 Mar 20

Stewart L, Moher D, Shekelle P. Why prospective registration of systematic reviews makes sense. Syst Rev. 2012;1(1):7–10.

Allers K, Hoffmann F, Mathes T, Pieper D. Systematic reviews with published protocols compared to those without: more effort, older search. J Clin Epidemiol. 2018;95:102–10.

Ge L, Tian J, Li Y, Pan J, Li G, Wei D, et al. Association between prospective registration and overall reporting and methodological quality of systematic reviews: a meta-epidemiological study. J Clin Epidemiol. 2018;93:45–55.

Shamseer L, Moher D, Clarke M, Ghersi D, Liberati A, Petticrew M, et al. Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) 2015: elaboration and explanation. BMJ. 2015;350: g7647.

Pieper D, Rombey T. Where to prospectively register a systematic review. Syst Rev. 2022;11(1):8.

PROSPERO. PROSPERO will require earlier registration. NIHR; 2022. Available from: https://www.crd.york.ac.uk/prospero/ . Cited 2022 Mar 20

Kirkham JJ, Altman DG, Williamson PR. Bias due to changes in specified outcomes during the systematic review process. PLoS ONE. 2010;5(3):3–7.

Victora CG, Habicht JP, Bryce J. Evidence-based public health: moving beyond randomized trials. Am J Public Health. 2004;94(3):400–5.

Peinemann F, Kleijnen J. Development of an algorithm to provide awareness in choosing study designs for inclusion in systematic reviews of healthcare interventions: a method study. BMJ Open. 2015;5(8): e007540.

Loudon K, Treweek S, Sullivan F, Donnan P, Thorpe KE, Zwarenstein M. The PRECIS-2 tool: designing trials that are fit for purpose. BMJ. 2015;350: h2147.

Junqueira DR, Phillips R, Zorzela L, Golder S, Loke Y, Moher D, et al. Time to improve the reporting of harms in randomized controlled trials. J Clin Epidemiol. 2021;136:216–20.

Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Routinely collected data and comparative effectiveness evidence: promises and limitations. CMAJ. 2016;188(8):E158–64.

Murad MH. Clinical practice guidelines: a primer on development and dissemination. Mayo Clin Proc. 2017;92(3):423–33.

Abdelhamid AS, Loke YK, Parekh-Bhurke S, Chen Y-F, Sutton A, Eastwood A, et al. Use of indirect comparison methods in systematic reviews: a survey of cochrane review authors. Res Synth Methods. 2012;3(2):71–9.

Jüni P, Holenstein F, Sterne J, Bartlett C, Egger M. Direction and impact of language bias in meta-analyses of controlled trials: empirical study. Int J Epidemiol. 2002;31(1):115–23.

Vickers A, Goyal N, Harland R, Rees R. Do certain countries produce only positive results? A systematic review of controlled trials. Control Clin Trials. 1998;19(2):159–66.

Jones CW, Keil LG, Weaver MA, Platts-Mills TF. Clinical trials registries are under-utilized in the conduct of systematic reviews: a cross-sectional analysis. Syst Rev. 2014;3(1):1–7.

Baudard M, Yavchitz A, Ravaud P, Perrodeau E, Boutron I. Impact of searching clinical trial registries in systematic reviews of pharmaceutical treatments: methodological systematic review and reanalysis of meta-analyses. BMJ. 2017;356: j448.

Fanelli D, Costas R, Ioannidis JPA. Meta-assessment of bias in science. Proc Natl Acad Sci USA. 2017;114(14):3714–9.

Hartling L, Featherstone R, Nuspl M, Shave K, Dryden DM, Vandermeer B. Grey literature in systematic reviews: a cross-sectional study of the contribution of non-English reports, unpublished studies and dissertations to the results of meta-analyses in child-relevant reviews. BMC Med Res Methodol. 2017;17(1):64.

Hopewell S, McDonald S, Clarke M, Egger M. Grey literature in meta-analyses of randomized trials of health care interventions. Cochrane Database Syst Rev. 2007;2:MR000010.

Shojania K, Sampson M, Ansari MT, Ji J, Garritty C, Radar T, et al. Updating systematic reviews. AHRQ Technical Reviews. 2007: Report 07–0087.

Tate RL, Perdices M, Rosenkoetter U, Wakim D, Godbee K, Togher L, et al. Revision of a method quality rating scale for single-case experimental designs and n-of-1 trials: The 15-item Risk of Bias in N-of-1 Trials (RoBiNT) Scale. Neuropsychol Rehabil. 2013;23(5):619–38.

Tate RL, Perdices M, McDonald S, Togher L, Rosenkoetter U. The design, conduct and report of single-case research: Resources to improve the quality of the neurorehabilitation literature. Neuropsychol Rehabil. 2014;24(3–4):315–31.

Sterne JAC, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366: l4894.

Sterne JA, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355: i4919.

Igelström E, Campbell M, Craig P, Katikireddi SV. Cochrane’s risk of bias tool for non-randomized studies (ROBINS-I) is frequently misapplied: a methodological systematic review. J Clin Epidemiol. 2021;140:22–32.

McKenzie JE, Brennan SE. Chapter 12: Synthesizing and presenting findings using other methods. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-12 . Cited 2022 Apr 10

Ioannidis J, Patsopoulos N, Rothstein H. Reasons or excuses for avoiding meta-analysis in forest plots. BMJ. 2008;336(7658):1413–5.

Stewart LA, Tierney JF. To IPD or not to IPD? Eval Health Prof. 2002;25(1):76–97.

Tierney JF, Stewart LA, Clarke M. Chapter 26: Individual participant data. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-26 . Cited 2022 Oct 12

Chaimani A, Caldwell D, Li T, Higgins J, Salanti G. Chapter 11: Undertaking network meta-analyses. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook . Cited 2022 Oct 12.

Cooper H, Hedges L, Valentine J. The handbook of research synthesis and meta-analysis. 3rd ed. Russell Sage Foundation; 2019.

Sutton AJ, Abrams KR, Jones DR, Sheldon T, Song F. Methods for meta-analysis in medical research. Methods for meta-analysis in medical research; 2000.

Deeks J, Higgins JPT, Altman DG. Chapter 10: Analysing data and undertaking meta-analyses. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic review of interventions. Cochrane; 2022. Available from: http://www.training.cochrane.org/handbook . Cited 2022 Mar 20.

Clarke MJ. Individual patient data meta-analyses. Best Pract Res Clin Obstet Gynaecol. 2005;19(1):47–55.

Catalá-López F, Tobías A, Cameron C, Moher D, Hutton B. Network meta-analysis for comparing treatment effects of multiple interventions: an introduction. Rheumatol Int. 2014;34(11):1489–96.

Debray T, Schuit E, Efthimiou O, Reitsma J, Ioannidis J, Salanti G, et al. An overview of methods for network meta-analysis using individual participant data: when do benefits arise? Stat Methods Med Res. 2016;27(5):1351–64.

Tonin FS, Rotta I, Mendes AM, Pontarolo R. Network meta-analysis : a technique to gather evidence from direct and indirect comparisons. Pharm Pract (Granada). 2017;15(1):943.

Tierney JF, Vale C, Riley R, Smith CT, Stewart L, Clarke M, et al. Individual participant data (IPD) metaanalyses of randomised controlled trials: guidance on their use. PLoS Med. 2015;12(7): e1001855.

Rouse B, Chaimani A, Li T. Network meta-analysis: an introduction for clinicians. Intern Emerg Med. 2017;12(1):103–11.

Cochrane Training. Review Manager RevMan Web. Cochrane; 2022. Available from: https://training.cochrane.org/online-learning/core-software/revman . Cited 2022 Jun 24

MetaXL. MetalXL. Epi Gear; 2016. Available from: http://epigear.com/index_files/metaxl.html . Cited 2022 Jun 24.

JBI. JBI SUMARI. JBI; 2019. Available from: https://sumari.jbi.global/ . Cited 2022 Jun 24.

Ryan R. Cochrane Consumers and Communication Review Group: data synthesis and analysis. Cochrane Consumers and Communication Review Group; 2013. Available from: http://cccrg.cochrane.org . Cited 2022 Jun 24

McKenzie JE, Beller EM, Forbes AB. Introduction to systematic reviews and meta-analysis. Respirology. 2016;21(4):626–37.

Campbell M, Katikireddi SV, Sowden A, Thomson H. Lack of transparency in reporting narrative synthesis of quantitative data: a methodological assessment of systematic reviews. J Clin Epidemiol. 2019;105:1–9.

Campbell M, McKenzie JE, Sowden A, Katikireddi SV, Brennan SE, Ellis S, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ. 2020;368: l6890.

McKenzie JE, Brennan S, Ryan R. Summarizing study characteristics and preparing for synthesis. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook . Cited 2022 Oct 12

AHRQ. Systems to rate the strength of scientific evidence. Evidence report/technology assessment no. 47. AHRQ; 2002. Available from: https://archive.ahrq.gov/clinic/epcsums/strengthsum.htm . Cited 2022 Apr 10.

Atkins D, Eccles M, Flottorp S, Guyatt GH, Henry D, Hill S, et al. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches. BMC Health Serv Res. 2004;4(1):38.

Ioannidis JPA. Meta-research: the art of getting it wrong.  Res Synth Methods. 2010;1(3–4):169–84.

Lai NM, Teng CL, Lee ML. Interpreting systematic reviews:  are we ready to make our own conclusions? A cross sectional study. BMC Med. 2011;9(1):30.

Glenton C, Santesso N, Rosenbaum S, Nilsen ES, Rader T, Ciapponi A, et al. Presenting the results of Cochrane systematic reviews to a consumer audience: a qualitative study. Med Decis Making. 2010;30(5):566–77.

Yavchitz A, Ravaud P, Altman DG, Moher D, HrobjartssonA, Lasserson T, et al. A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity. J Clin Epidemiol. 2016;75:56–65.

Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, et al. GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:7454.

GRADE Working Group. Organizations. GRADE; 2022 [cited 2023 May 2].  Available from: www.gradeworkinggroup.org .

Hartling L, Fernandes RM, Seida J, Vandermeer B, Dryden DM. From the trenches: a cross-sectional study applying the grade tool in systematic reviews of healthcare interventions.  PLoS One. 2012;7(4):e34697.

Hultcrantz M, Rind D, Akl EA, Treweek S, Mustafa RA, Iorio A, et al. The GRADE working group clarifies the construct of certainty of evidence. J Clin Epidemiol. 2017;87:4–13.

Schünemann H, Brozek J, Guyatt G, Oxman AD, Editors. Section 6.3.2. Symbolic representation. GRADE Handbook [internet].  GRADE; 2013 [cited 2022 Jan 27]. Available from: https://gdt.gradepro.org/app/handbook/handbook.html#h.lr8e9vq954 .

Siemieniuk R, Guyatt G What is GRADE? [internet] BMJ Best Practice; 2017 [cited 2022 Jul 20]. Available from: https://bestpractice.bmj.com/info/toolkit/learn-ebm/what-is-grade/ .

Guyatt G, Oxman AD, Sultan S, Brozek J, Glasziou P, Alonso-Coello P, et al. GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. J Clin Epidemiol. 2013;66(2):151–7.

Guyatt GH, Oxman AD, Sultan S, Glasziou P, Akl EA, Alonso-Coello P, et al. GRADE guidelines: 9. Rating up the quality of evidence. J Clin Epidemiol. 2011;64(12):1311–6.

Guyatt GH, Oxman AD, Vist G, Kunz R, Brozek J, Alonso-Coello P, et al. GRADE guidelines: 4. Rating the quality of evidence - Study limitations (risk of bias). J Clin Epidemiol. 2011;64(4):407–15.

Guyatt GH, Oxman AD, Kunz R, Brozek J, Alonso-Coello P, Rind D, et al. GRADE guidelines 6. Rating the quality of evidence - Imprecision. J Clin Epidemiol. 2011;64(12):1283–93.

Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 7. Rating the quality of evidence - Inconsistency. J Clin Epidemiol. 2011;64(12):1294–302.

Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 8. Rating the quality of evidence - Indirectness. J Clin Epidemiol. 2011;64(12):1303–10.

Guyatt GH, Oxman AD, Montori V, Vist G, Kunz R, Brozek J, et al. GRADE guidelines: 5. Rating the quality of evidence - Publication bias. J Clin Epidemiol. 2011;64(12):1277–82.

Andrews JC, Schünemann HJ, Oxman AD, Pottie K, Meerpohl JJ, Coello PA, et al. GRADE guidelines: 15. Going from evidence to recommendation - Determinants of a recommendation’s direction and strength. J Clin Epidemiol. 2013;66(7):726–35.

Fleming PS, Koletsi D, Ioannidis JPA, Pandis N. High quality of the evidence for medical and other health-related interventions was uncommon in Cochrane systematic reviews. J Clin Epidemiol. 2016;78:34–42.

Howick J, Koletsi D, Pandis N, Fleming PS, Loef M, Walach H, et al. The quality of evidence for medical interventions does not improve or worsen: a metaepidemiological study of Cochrane reviews. J Clin Epidemiol. 2020;126:154–9.

Mustafa RA, Santesso N, Brozek J, Akl EA, Walter SD, Norman G, et al. The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses. J Clin Epidemiol. 2013;66(7):736-742.e5.

Schünemann H, Brozek J, Guyatt G, Oxman A, editors. Section 5.4: Overall quality of evidence. GRADE Handbook. GRADE; 2013. Available from: https://gdt.gradepro.org/app/handbook/handbook.html#h.lr8e9vq954a . Cited 2022 Mar 25.

GRADE Working Group. Criteria for using GRADE. GRADE; 2016. Available from: https://www.gradeworkinggroup.org/docs/Criteria_for_using_GRADE_2016-04-05.pdf . Cited 2022 Jan 26

Werner SS, Binder N, Toews I, Schünemann HJ, Meerpohl JJ, Schwingshackl L. Use of GRADE in evidence syntheses published in high-impact-factor nutrition journals: a methodological survey. J Clin Epidemiol. 2021;135:54–69.

Zhang S, Wu QJ, Liu SX. A methodologic survey on use of the GRADE approach in evidence syntheses published in high-impact factor urology and nephrology journals. BMC Med Res Methodol. 2022;22(1):220.

Li L, Tian J, Tian H, Sun R, Liu Y, Yang K. Quality and transparency of overviews of systematic reviews. J Evid Based Med. 2012;5(3):166–73.

Pieper D, Buechter R, Jerinic P, Eikermann M. Overviews of reviews often have limited rigor: a systematic review. J Clin Epidemiol. 2012;65(12):1267–73.

Cochrane Editorial Unit. Appendix 1: Checklist for auditing GRADE and SoF tables in protocols of intervention reviews. Cochrane Training; 2022. Available from: https://training.cochrane.org/gomo/modules/522/resources/8307/Checklist for GRADE and SoF methods in Protocols for Gomo.pdf. Cited 2022 Mar 12

Ryan R, Hill S. How to GRADE the quality of the evidence. Cochrane Consumers and Communication Group. Cochrane; 2016. Available from: https://cccrg.cochrane.org/author-resources .

Cunningham M, France EF, Ring N, Uny I, Duncan EA, Roberts RJ, et al. Developing a reporting guideline to improve meta-ethnography in health research: the eMERGe mixed-methods study. Heal Serv Deliv Res. 2019;7(4):1–116.

Tong A, Flemming K, McInnes E, Oliver S, Craig J. Enhancing transparency in reporting the synthesis of qualitative research: ENTREQ. BMC Med Res Methodol. 2012;12:181.

Gates M, Gates G, Pieper D, Fernandes R, Tricco A, Moher D, et al. Reporting guideline for overviews of reviews of healthcare interventions: development of the PRIOR statement. BMJ. 2022;378:e070849.

Whiting PF, Reitsma JB, Leeflang MMG, Sterne JAC, Bossuyt PMM, Rutjes AWSS, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(4):529–36.

Hayden JA, van der Windt DA, Cartwright JL, Co P. Research and reporting methods assessing bias in studies of prognostic factors. Ann Intern Med. 2013;158(4):280–6.

Critical Appraisal Skills Programme. CASP qualitative checklist. CASP; 2018. Available from: https://casp-uk.net/images/checklist/documents/CASP-Qualitative-Studies-Checklist/CASP-Qualitative-Checklist-2018_fillable_form.pdf . Cited 2022 Apr 26

Hannes K, Lockwood C, Pearson A. A comparative analysis of three online appraisal instruments’ ability to assess validity in qualitative research. Qual Health Res. 2010;20(12):1736–43.

Munn Z, Moola S, Riitano D, Lisy K. The development of a critical appraisal tool for use in systematic reviews addressing questions of prevalence. Int J Heal Policy Manag. 2014;3(3):123–8.

Lewin S, Bohren M, Rashidian A, Munthe-Kaas H, Glenton C, Colvin CJ, et al. Applying GRADE-CERQual to qualitative evidence synthesis findings-paper 2: how to make an overall CERQual assessment of confidence and create a Summary of Qualitative Findings table. Implement Sci. 2018;13(suppl 1):10.

Munn Z, Porritt K, Lockwood C, Aromataris E, Pearson A.  Establishing confidence in the output of qualitative research synthesis: the ConQual approach. BMC Med Res Methodol. 2014;14(1):108.

Flemming K, Booth A, Hannes K, Cargo M, Noyes J. Cochrane Qualitative and Implementation Methods Group guidance series—paper 6: reporting guidelines for qualitative, implementation, and process evaluation evidence syntheses. J Clin Epidemiol. 2018;97:79–85.

Lockwood C, Munn Z, Porritt K. Qualitative research synthesis:  methodological guidance for systematic reviewers utilizing meta-aggregation. Int J Evid Based Health. 2015;13(3):179–87.

Schünemann HJ, Mustafa RA, Brozek J, Steingart KR, Leeflang M, Murad MH, et al. GRADE guidelines: 21 part 1.  Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy. J Clin Epidemiol. 2020;122:129–41.

Schünemann HJ, Mustafa RA, Brozek J, Steingart KR, Leeflang M, Murad MH, et al. GRADE guidelines: 21 part 2. Test accuracy: inconsistency, imprecision, publication bias, and other domains for rating the certainty of evidence and presenting it in evidence profiles and summary of findings tables. J Clin Epidemiol. 2020;122:142–52.

Foroutan F, Guyatt G, Zuk V, Vandvik PO, Alba AC, Mustafa R, et al. GRADE Guidelines 28: use of GRADE for the assessment of evidence about prognostic factors:  rating certainty in identification of groups of patients with different absolute risks. J Clin Epidemiol. 2020;121:62–70.

Janiaud P, Agarwal A, Belbasis L, Tzoulaki I. An umbrella review of umbrella reviews for non-randomized observational evidence on putative risk and protective factors [internet]. OSF protocol; 2021 [cited 2022 May 28]. Available from: https://osf.io/xj5cf/ .

Mokkink LB, Prinsen CA, Patrick DL, Alonso J, Bouter LM, et al. COSMIN methodology for systematic reviews of Patient-Reported Outcome Measures (PROMs) - user manual. COSMIN; 2018 [cited 2022 Feb 15]. Available from:  http://www.cosmin.nl/ .

Thomas J, M P, Noyes J, Chandler J, Rehfuess E, Tugwell P, et al. Chapter 17: Intervention complexity. In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al., editors. Cochrane handbook for systematic reviews of interventions. Cochrane; 2022. Available from: https://training.cochrane.org/handbook/current/chapter-17 . Cited 2022 Oct 12

Guise JM, Chang C, Butler M, Viswanathan M, Tugwell P. AHRQ series on complex intervention systematic reviews—paper 1: an introduction to a series of articles that provide guidance and tools for reviews of complex interventions. J Clin Epidemiol. 2017;90:6–10.

Riaz IB, He H, Ryu AJ, Siddiqi R, Naqvi SAA, Yao Y, et al. A living, interactive systematic review and network meta-analysis of first-line treatment of metastatic renal cell carcinoma [formula presented]. Eur Urol. 2021;80(6):712–23.

Créquit P, Trinquart L, Ravaud P. Live cumulative network meta-analysis: protocol for second-line treatments in advanced non-small-cell lung cancer with wild-type or unknown status for epidermal growth factor receptor. BMJ Open. 2016;6(8):e011841.

Ravaud P, Créquit P, Williams HC, Meerpohl J, Craig JC, Boutron I. Future of evidence ecosystem series: 3. From an evidence synthesis ecosystem to an evidence ecosystem. J Clin Epidemiol. 2020;123:153–61.

Download references

Acknowledgements

Michelle Oakman Hayes for her assistance with the graphics, Mike Clarke for his willingness to answer our seemingly arbitrary questions, and Bernard Dan for his encouragement of this project.

The work of John Ioannidis has been supported by an unrestricted gift from Sue and Bob O’Donnell to Stanford University.

Author information

Authors and affiliations.

Departments of Orthopaedic Surgery, Pediatrics, and Neurology, Wake Forest School of Medicine, Winston-Salem, NC, USA

Kat Kolaski

Department of Physical Medicine and Rehabilitation, SUNY Upstate Medical University, Syracuse, NY, USA

Lynne Romeiser Logan

Departments of Medicine, of Epidemiology and Population Health, of Biomedical Data Science, and of Statistics, and Meta-Research Innovation Center at Stanford (METRICS), Stanford University School of Medicine, Stanford, CA, USA

John P. A. Ioannidis

You can also search for this author in PubMed   Google Scholar

Contributions

All authors participated in the development of the ideas, writing, and review of this manuscript. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Kat Kolaski .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’ s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article has been published simultaneously in BMC Systematic Reviews, Acta Anaesthesiologica Scandinavica, BMC Infectious Diseases, British Journal of Pharmacology, JBI Evidence Synthesis, the Journal of Bone and Joint Surgery Reviews , and the Journal of Pediatric Rehabilitation Medicine .

Supplementary Information

Additional file 2a..

Overviews, scoping reviews, rapid reviews and living reviews.

Additional file 2B.

Practical scheme for distinguishing types of research evidence.

Additional file 4.

Presentation of forest plots.

Additional file 5A.

Illustrations of the GRADE approach.

Additional file 5B.

 Adaptations of GRADE for evidence syntheses.

Additional file 6.

 Links to Concise Guide online resources.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kolaski, K., Logan, L.R. & Ioannidis, J.P.A. Guidance to best tools and practices for systematic reviews. Syst Rev 12 , 96 (2023). https://doi.org/10.1186/s13643-023-02255-9

Download citation

Received : 03 October 2022

Accepted : 19 February 2023

Published : 08 June 2023

DOI : https://doi.org/10.1186/s13643-023-02255-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Certainty of evidence
  • Critical appraisal
  • Methodological quality
  • Risk of bias
  • Systematic review

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

systematic review of clinical research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 14, Issue 3
  • What is a systematic review?
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Jane Clarke
  • Correspondence to Jane Clarke 4 Prime Road, Grey Lynn, Auckland, New Zealand; janeclarkehome{at}gmail.com

https://doi.org/10.1136/ebn.2011.0049

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

A high-quality systematic review is described as the most reliable source of evidence to guide clinical practice. The purpose of a systematic review is to deliver a meticulous summary of all the available primary research in response to a research question. A systematic review uses all the existing research and is sometime called ‘secondary research’ (research on research). They are often required by research funders to establish the state of existing knowledge and are frequently used in guideline development. Systematic review findings are often used within the healthcare setting but may be applied elsewhere. For example, the Campbell Collaboration advocates the application of systematic reviews for policy-making in education, justice and social work.

Systematic reviews can be conducted on all types of primary research. Many are reviews of randomised trials (addressing questions of effectiveness), cross-sectional studies (addressing questions about prevalence or diagnostic accuracy, for example) or cohort studies (addressing questions about prognosis). When qualitative research is reviewed systematically, it may be described as a systematic review, but more often other terms such as meta-synthesis are used.

Systematic review methodology is explicit and precise and aims to minimise bias, thus enhancing the reliability of the conclusions drawn. 1 , 2 The features of a systematic review include:

■ clear aims with predetermined eligibility and relevance criteria for studies;

■ transparent, reproducible methods;

■ rigorous search designed to locate all eligible studies;

■ an assessment of the validity of the findings of the included studies and

■ a systematic presentation, and synthesis, of the included studies. 3

The first step in a systematic review is a meticulous search of all sources of evidence for relevant studies. The databases and citation indexes searched are listed in the methodology section of the review. Next, using predetermined reproducible criteria to screen for eligibility and relevance assessment of titles and the abstracts is completed. Each study is then assessed in terms of methodological quality.

Finally, the evidence is synthesised. This process may or may not include a meta-analysis. A meta-analysis is a statistical summary of the findings of independent studies. 4 Meta-analyses can potentially present more precise estimates of the effects of interventions than those derived from the individual studies alone. These strategies are used to limit bias and random error which may arise during this process. Without these safeguards, then, reviews can mislead, such that we gain an unreliable summary of the available knowledge.

The Cochrane Collaboration is a leader in the production of systematic reviews. Cochrane reviews are published on a monthly basis in the Cochrane Database of Systematic Reviews in The Cochrane Library (see: http://www.thecochranelibrary.com ).

  • Antman EM ,
  • Kupelnick B ,
  • Higgins JPT ,

Competing interests None.

Read the full text or download the PDF:

  • Open access
  • Published: 29 April 2024

What is context in knowledge translation? Results of a systematic scoping review

  • Tugce Schmitt   ORCID: orcid.org/0000-0001-6893-6428 1 ,
  • Katarzyna Czabanowska 1 &
  • Peter Schröder-Bäck 1  

Health Research Policy and Systems volume  22 , Article number:  52 ( 2024 ) Cite this article

217 Accesses

4 Altmetric

Metrics details

Knowledge Translation (KT) aims to convey novel ideas to relevant stakeholders, motivating their response or action to improve people’s health. Initially, the KT literature focused on evidence-based medicine, applying findings from laboratory and clinical research to disease diagnosis and treatment. Since the early 2000s, the scope of KT has expanded to include decision-making with health policy implications.

This systematic scoping review aims to assess the evolving knowledge-to-policy concepts, that is, macro-level KT theories, models and frameworks (KT TMFs). While significant attention has been devoted to transferring knowledge to healthcare settings (i.e. implementing health policies, programmes or measures at the meso-level), the definition of 'context' in the realm of health policymaking at the macro-level remains underexplored in the KT literature. This study aims to close the gap.

A total of 32 macro-level KT TMFs were identified, with only a limited subset of them offering detailed insights into contextual factors that matter in health policymaking. Notably, the majority of these studies prompt policy changes in low- and middle-income countries and received support from international organisations, the European Union, development agencies or philanthropic entities.

Peer Review reports

Few concepts are used by health researchers as vaguely and yet as widely as Knowledge Translation (KT), a catch-all term that accommodates a broad spectrum of ambitions. Arguably, to truly understand the role of context in KT, we first need to clarify what KT means. The World Health Organization (WHO) defines KT as ‘the synthesis, exchange and application of knowledge by relevant stakeholders to accelerate the benefits of global and local innovation in strengthening health systems and improving people’s health’ [ 1 ]. Here, particular attention should be paid to ‘innovation’, given that without unpacking this term, the meaning of KT would still remain ambiguous. Rogers’ seminal work ‘Diffusion of Innovations’ [ 2 ] defines innovation as an idea, practice or object that is perceived as novel by individuals or groups adopting it. In this context, he argues that the objective novelty of an idea in terms of the amount of time passed after its discovery holds little significance [ 2 ]. Rather, it is the subjective perception of newness by the individual that shapes their response [ 2 ]. In other words, if an idea seems novel to individuals, and thereby relevant stakeholders according to the aforementioned WHO definition, it qualifies as an innovation. From this perspective, it can be stated that a fundamental activity of KT is to communicate ideas that could be perceived as original to the targeted stakeholders, with the aim of motivating their response to improve health outcomes. This leaves us with the question of who exactly these stakeholders might be and what kind of actions would be required from them.

The scope of stakeholders in KT has evolved over time, along with their prompted responses. Initially, during the early phases of KT, the focus primarily revolved around healthcare providers and their clinical decisions, emphasising evidence-based medicine. Nearly 50 years ago, the first scientific article on KT was published, introducing Tier 1 KT, which concentrated on applying laboratory discoveries to disease diagnosis or treatment, also known as bench-to-bedside KT [ 3 ]. The primary motivation behind this initial conceptualisation of KT was to engage healthcare providers as the end-users of specific forms of knowledge, primarily related to randomised controlled trials of pharmaceuticals and evidence-based medicine [ 4 ]. In the early 2000s, the second phase of KT (Tier 2) emerged under the term ‘campus-to-clinic KT’ [ 3 ]. This facet, also known as translational research, was concerned with using evidence from health services research in healthcare provision, both in practice and policy [ 4 ]. Consequently, by including decision-makers as relevant end-users, KT scholars expanded the realm of research-to-action from the clinical environment to policy-relevant decision-making [ 5 ]. Following this trajectory, additional KT schemes (Tier 3–Tier 5) have been introduced into academic discourse, encompassing the dissemination, implementation and broader integration of knowledge into public policies [ 6 , 7 ]. Notably, the latest scheme (Tier 5) is becoming increasingly popular and represents the broadest approach, which describes the translation of knowledge to global communities and aims to involve fundamental, universal change in attitudes, policies and social systems [ 7 ].

In other words, a noticeable shift in KT has occurred with time towards macro-level interventions, named initially as evidence- based policymaking and later corrected to evidence- informed policymaking. In parallel with these significant developments, various alternative terms to KT have emerged, including ‘implementation science’, ‘knowledge transfer’, and ‘dissemination and research use’, often with considerable overlap [ 8 ]. Arguably, among the plethora of alternative terms proposed, implementation science stands out prominently. While initially centred on evidence-based medicine at the meso-level (e.g. implementing medical guidelines), it has since broadened its focus to ‘encompass all aspects of research relevant to the scientific study of methods to promote the uptake of research findings into routine settings in clinical, community and policy contexts’ [ 9 ], closely mirroring the definition to KT. Thus, KT, along with activities under different names that share the same objective, has evolved into an umbrella term over the years, encompassing a wide range of strategies aimed at enhancing the impact of research not only on clinical practice but also on public policies [ 10 ]. Following the adoption of such a comprehensive definition of KT, some researchers have asserted that using evidence in public policies is not merely commendable but essential [ 11 ].

In alignment with the evolution of KT from (bio-)medical sciences to public policies, an increasing number of scholars have offered explanations on how health policies should be developed [ 12 ], indicating a growing focus on exploring the mechanisms of health policymaking in the KT literature. However, unlike in the earlier phases of KT, which aimed to transfer knowledge from the laboratory to healthcare provision, decisions made for public policies may be less technical and more complex than those in clinical settings [ 3 , 13 , 14 ]. Indeed, social scientists point out that scholarly works on evidence use in health policies exhibit theoretical shortcomings as they lack engagement with political science and public administration theories and concepts [ 15 , 16 , 17 , 18 ]; only a few of these works employ policy theories and political concepts to guide data collection and make sense of their findings [ 19 ]. Similarly, contemporary literature that conceptualises KT as an umbrella term for both clinical and public policy decision-making, with calls for a generic ‘research-to-action’ [ 20 ], may fail to recognise the different types of actions required to change clinical practices and influence health policies. In many respects, such calls can even lead to a misconception that evidence-informed policymaking is simply a scaled-up version of evidence-based medicine [ 21 ].

In this study, we systematically review knowledge translation theories, models and frameworks (also known as KT TMFs) that were developed for health policies. Essentially, KT TMFs can be depicted as bridges that connect findings across diverse studies, as they establish a common language and standardise the measurement and assessment of desired policy changes [ 22 ]. This makes them essential for generalising implementation efforts and research findings [ 23 ]. While distinctions between a theory, a model or a framework are not always crystal-clear [ 24 ], the following definitions shed light on how they are interpreted in the context of KT. To start with, theory can be described as a set of analytical principles or statements crafted to structure our observations, enhance our understanding and explain the world [ 24 ]. Within implementation science, theories are encapsulated as either generalised models or frameworks. In other words, they are integrated into broader concepts, allowing researchers to form assumptions that help clarify phenomena and create hypotheses for testing [ 25 ].

Whereas theories in the KT literature are explanatory as well as descriptive, KT models are only descriptive with a more narrowly defined scope of explanation [ 24 ]; hence they have a more specific focus than theories [ 25 ]. KT models are created to facilitate the formulation of specific assumptions regarding a set of parameters or variables, which can subsequently be tested against outcomes using predetermined methods [ 25 ]. By offering simplified representations of complex situations, KT models can describe programme elements expected to produce desired results, or theoretical constructs believed to influence or moderate observed outcomes. In this way, they encompass theories related to change or explanation [ 22 ].

Lastly, frameworks in the KT language define a set of variables and the relations among them in a broad sense [ 25 ]. Frameworks, without the aim of providing explanations, solely describe empirical phenomena, representing a structure, overview, outline, system or plan consisting of various descriptive categories and the relations between them that are presumed to account for a phenomenon [ 24 ]. They portray loosely-structured constellations of theoretical constructs, without necessarily specifying their relationships; they can also offer practical methods for achieving implementation objectives [ 22 ]. Some scholars suggest sub-classifications and categorise a framework as ‘actionable’ if it has the potential to facilitate macro-level policy changes [ 11 ].

Context, which encompasses the entire environment in which policy decisions are made, is not peripheral but central to policymaking, playing a crucial role in its conceptualisation [ 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 ]. In the KT literature, the term ‘context’ is frequently employed, albeit often with a lack of precision [ 35 ]. It tends to serve as a broad term including various elements within a situation that are relevant to KT in some way but have not been explicitly identified [36]. However, there is a growing interest in delving deeper into what context refers to, as evidenced by increasing research attention [ 31 , 32 , 37 , 38 , 39 , 40 , 41 ]. While the definition of context in the transfer of knowledge to healthcare settings (i.e. implementing health policies, programmes or measures at the meso-level) has been systematically studied [ 36 , 37 , 42 , 43 ], the question of how KT scholars detail context in health policymaking remains unanswered. With our systematic scoping review, we aim to close this gap.

While KT TMFs, emerged from evidence-based medicine, have historically depicted the use of evidence from laboratories or healthcare organisations as the gold standard, we aimed to assess in this study whether and to what extent the evolving face of KT, addressing health policies, succeeded in foregrounding ‘context’. Our objective was thus not to evaluate the quality of these KT TMFs but rather to explore how scholars have incorporated contextual influences into their reasoning. We conducted a systematic scoping review to explore KT TMFs that are relevant to agenda-setting, policy formulation or policy adoption, in line with the aim of this study. Therefore, publications related to policy implementation in healthcare organisations or at the provincial level, as well as those addressing policy evaluation, did not meet our inclusion criteria. Consequently, given our focus on macro-level interventions, we excluded all articles that concentrate on translating clinical research into practice (meso-level interventions) and health knowledge to patients or citizens (micro-level interventions).

Prior systematic scoping reviews in the area of KT TMFs serve as a valuable foundation upon which to build further studies [ 44 , 45 ]. Using established methodologies may ensure a validated approach, allowing for a more nuanced understanding of KT TMFs in the context of existing scholarly work. Our review methodology employed a similar approach to that followed by Strifler et al. in 2018, who conducted a systematic scoping review of KT TMFs in the field of cancer prevention and management, as well as other chronic diseases [ 44 ]. Their search strategy was preferred over others for two primary reasons. First, Strifler et al. investigated KT TMFs altogether, systematically and comprehensively. Second, unlike many other review studies on KT, they focused on macro-level KT and included all relevant keywords useful for the purpose of our study in their Ovid/MEDLINE search query [ 44 ]. For our scoping review, we adapted their search query with the assistance of a specialist librarian. This process involved eliminating terms associated with cancer and chronic diseases, removing time limitation on the published papers, and including an additional language other than English due to authors’ proficiency in German. We included articles published in peer-reviewed journals until November 2022, excluding opinion papers, conference abstracts and study protocols, without any restriction on publication date or place. Our search query is presented in Table  1 .

Following a screening methodology similar to that employed by Votruba et al. [ 11 ], the first author conducted an initial screening of the titles and abstracts of 2918 unique citations. Full texts were selected and scrutinised if they appeared relevant to the topics of agenda-setting, policy formulation or policy adoption. Among these papers, the first author also identified those that conceptualised a KT TMF. Simultaneously, the last author independently screened 2918 titles and abstracts, randomly selecting 20% of them to identify studies related to macro-level KT. Regarding papers that conceptualised a KT TMF, all those initially selected by the first author underwent a thorough examination by the last author as well. In the papers reviewed by these two authors of this study, KT TMFs were typically presented as either Tables or Figures. In cases where these visual representations did not contain sufficient information about ‘context’, the main body of the study was carefully scrutinised by both reviewers to ensure no relevant information was missed. Any unclear cases were discussed and resolved to achieve 100% inter-rater agreement between the first and second reviewers. This strategy resulted in the inclusion of 32 relevant studies. The flow chart outlining our review process is provided in Fig.  1 .

figure 1

Flow chart of the review process

According to the results of our systematic scoping review (Table  2 ), the first KT TMF developed for health policies dates back to 2003, confirming the emergence of a trend that expanded the meaning of the term Knowledge Translation to include policymakers as end-users of evidence during approximately the same period. In their study, Jacobson et al. [ 46 ] present a framework derived from a literature review to enhance understanding of user groups by organising existing knowledge, identifying gaps and emphasising the importance of learning about new contexts. However, despite acknowledging the significance of the user group context, the paper lacks a thorough explanation of the authors’ understanding of this term. The second study in our scoping review provides some details. Recognising a shift from evidence-based medicine to evidence-based health policymaking in the KT literature, the article by Dobrow et al. from 2004 [ 30 ] emphasises the importance of considering contextual factors. They present a conceptual framework for evidence-based decision-making, highlighting the influence of context in KT. Illustrated through examples from colorectal cancer screening policy development, their conceptual framework emphasises the significance of context in the introduction, interpretation and application of evidence. Third, Lehoux et al. [ 47 ] examine the field of Health Technology Assessment (HTA) and its role in informing decision and policymaking in Canada. By developing a conceptual framework for HTA dissemination and use, they touch on the institutional environment and briefly describe contextual factors.

Notably, the first three publications in our scoping review are authored by scholars affiliated with Canada, which is less of a coincidence, given the role of Canadian Institutes of Health Research (CIHR), the federal funding agency for health research: The CIHR Act (Bill C-13) mandates CIHR to ensure that the translation of health knowledge permeates every aspect of its work [ 48 ]. Moreover, it was CIHR that coined the term Knowledge Translation, defining KT as ‘a dynamic and iterative process that includes the synthesis, dissemination, exchange and ethically sound application of knowledge to improve health, provide more effective health services and products, and strengthen the health care system’ [ 49 ] . This comprehensive definition has since been adapted by international organisations (IOs), including WHO. The first document published by WHO that utilised KT to influence health policies dates back to 2005, entitled ‘Bridging the “know-do” gap: Meeting on knowledge translation in global health’, an initiative that was supported by the Canadian Coalition for Global Health Research, the Canadian International Development Agency, the German Agency for Technical Cooperation and the WHO Special Programme on Research and Training in Tropical Diseases [ 1 ]. Following this official recognition by WHO, studies in our scoping review after 2005 indicate a noticeable expansion of KT, encompassing a wider geographical area than Canada.

The article of Ashford et al. from 2006 [ 50 ] discusses the challenge of policy decisions in Kenya in the health field being disconnected from scientific evidence and presents a model for translating knowledge into policy actions through agenda-setting, coalition building and policy learning. However, the framework lacks explicit incorporation of contextual factors influencing health policies. Bauman et al. [ 51 ] propose a six-step framework for successful dissemination of physical activity evidence, illustrated through four case studies from three countries (Canada, USA and Brazil) and a global perspective. They interpret contextual factors as barriers and facilitators to physical activity and public health innovations. Focusing on the USA, Gold [ 52 ] explains factors, processes and actors that shape pathways between research and its use in a summary diagram, including a reference to ‘other influences in process’ for context. Green et al. [ 4 ] examine the gap between health research and its application in public health without focusing on a specific geographical area. Their study comprehensively reviews various concepts of diffusion, dissemination and implementation in public health, proposing ways to blend diffusion theory with other theories. Their ‘utilization-focused surveillance framework’ interprets context as social determinants as structures, economics, politics and culture.

Further, the article by Dhonukshe-Rutten et al. from 2010 [ 53 ] presents a general framework that outlines the process of translating nutritional requirements into policy applications from a European perspective. The framework incorporates scientific evidence, stakeholder interests and the socio-political context. The description of this socio-political context is rather brief, encompassing political and social priorities, legal context, ethical issues and economic implications. Ir et al. [ 54 ] analyse the use of knowledge in shaping policy on health equity funds in Cambodia, with the objective of understanding how KT contributes to the development of health policies that promote equity. Yet no information on context is available in the framework that they suggest. A notable exception among these early KT TMFs until 2010 is the conceptual framework for analysing integration of targeted health interventions into health systems by Atun et al. [ 55 ], in which the authors provide details about the factors that have an influence on the process of bringing evidence to health policies. Focusing on the adoption, diffusion and assimilation of health interventions, their conceptual framework provides a systematic approach for evaluating and informing policies in this field. Compared to the previous studies discussed above, their definition of context for this framework is comprehensive (Table  2 ). Overall, most of the studies containing macro-level KT TMFs published until 2010 either do not fully acknowledge contextual factors or provide generic terms such as cultural, political and economic for brief description (9 out of 10; 90%).

Studies published after 2010 demonstrate a notable geographical shift, with a greater emphasis on low- and middle-income countries (LMICs). By taking the adoption of the directly observed treatment, short-course (DOTS) strategy for tuberculosis control in Mexico as a case study, Bissell et al. [ 56 ] examine policy transfer to Mexico and its relevance to operational research efforts and suggest a model for analysis of health policy transfer. The model interprets context as health system, including political, economic, social, cultural and technological features. Focusing on HIV/AIDS in India, Tran et al. [ 57 ] explore KT by considering various forms of evidence beyond scientific evidence, such as best practices derived from programme experience and disseminated through personal communication. Their proposed framework aims to offer an analytical tool for understanding how evidence-based influence is exerted. In their framework, no information is available on context. Next, Bertone et al. [ 58 ] report on the effectiveness of Communities of Practice (CoPs) in African countries and present a conceptual framework for analysing and assessing transnational CoPs in health policy. The framework organises the key elements of CoPs, linking available resources, knowledge management activities, policy and practice changes, and improvements in health outcomes. Context is only briefly included in this framework.

Some other studies include both European and global perspectives. The publication from Timotijevic et al. from 2013 [ 59 ] introduces an epistemological framework that examines the considerations influencing the policy-making process, with a specific focus on micronutrient requirements in Europe. They present case studies from several European countries, highlighting the relevance of the framework in understanding the policy context related to micronutrients. Context is interpreted in this framework as global trends, data, media, broader consumer beliefs, ethical considerations, and wider social, legal, political, and economic environment. Next, funded by the European Union, the study by Onwujekwe et al. [ 60 ] examines the role of different types of evidence in health policy development in Nigeria. Although they cover the factors related to policy actors in their framework for assessing the role of evidence in policy development, they provide no information on context. Moreover, Redman et al. [ 61 ] present the SPIRIT Action Framework, which aims to enhance the use of research in policymaking. Context is interpreted in this framework as policy influences, i.e. public opinion, media, economic climate, legislative/policy infrastructure, political ideology and priorities, stakeholder interests, expert advice, and resources. From a global perspective, Spicer et al. [ 62 ] explore the contextual factors that influenced the scale-up of donor-funded maternal and newborn health innovations in Ethiopia, India and Nigeria, highlighting the importance of context in assessing and adapting innovations. Their suggested contextual factors influencing government decisions to accept, adopt and finance innovations at scale are relatively comprehensive (Table  2 ).

In terms of publication frequency, the pinnacle of reviewed KT studies was in 2017. Among six studies published in 2017, four lack details about context in their KT conceptualisations and one study touches on context very briefly. Bragge et al. [ 5 ] brought for their study an international terminology working group together to develop a simplified framework of interventions to integrate evidence into health practices, systems, and policies, named as the Aims, Ingredients, Mechanism, Delivery framework, albeit without providing details on contextual factors. Second, Mulvale et al. [ 63 ] present a conceptual framework that explores the impact of policy dialogues on policy development, illustrating how these dialogues can influence different stages of the policy cycle. Similar to the previous one, this study too, lacks information on context. In a systematic review, Sarkies et al. [ 64 ] evaluate the effectiveness of research implementation strategies in promoting evidence-informed policy decisions in healthcare. The study explores the factors associated with effective strategies and their inter-relationship, yet without further information on context. Fourth, Houngbo et al. [ 65 ] focus on the development of a strategy to implement a good governance model for health technology management in the public health sector, drawing from their experience in Benin. They outline a six-phase model that includes preparatory analysis, stakeholder identification and problem analysis, shared analysis and visioning, development of policy instruments for pilot testing, policy development and validation, and policy implementation and evaluation. They provide no information about context in their model. Fifth, Mwendera et al. [ 66 ] present a framework for improving the use of malaria research in policy development in Malawi, which was developed based on case studies exploring the policymaking process, the use of local malaria research, and assessing facilitators and barriers to research utilisation. Contextual setting is considered as Ministry of Health (MoH) with political set up, leadership system within the MoH, government policies and cultural set up. In contrast to these five studies, Ellen et al. [ 67 ] present a relatively comprehensive framework to support evidence-informed policymaking in ageing and health. The framework includes thought-provoking questions to discover contextual factors (Table  2 ).

Continuing the trend, studies published after 2017 focus increasingly on LMICs. In their embedded case study, Ongolo-Zogo et al. [ 68 ] examine the influence of two Knowledge Translation Platforms (KTPs) on policy decisions to achieve the health millennium development goals in Cameroon and Uganda. It explores how these KTPs influenced policy through interactions within policy issue networks, engagement with interest groups, and the promotion of evidence-supported ideas, ultimately shaping the overall policy climate for evidence-informed health system policymaking. Contextual factors are thereby interpreted as institutions (structures, legacies, policy networks), interests, ideas (values, research evidence) and external factors (reports, commitments). Focusing on the ‘Global South’, Plamondon et al. [ 69 ] suggest blending integrated knowledge translation with global health governance as an approach for strengthening leadership for health equity action. In terms of contextual factors, they include some information such as adapting knowledge to local context, consideration of the composition of non-traditional actors, such as civil society and private sector, in governance bodies and guidance for meaningful engagement between actors, particularly in shared governance models. Further, Vincenten et al. [ 70 ] propose a conceptual model to enhance understanding of interlinking factors that influence the evidence implementation process. Their evidence implementation model for public health systems refers to ‘context setting’, albeit without providing further detail.

Similarly, the study by Motani et al. from 2019 [ 71 ] assesses the outcomes and lessons learned from the EVIDENT partnership that focused on knowledge management for evidence-informed decision-making in nutrition and health in Africa. Although they mention ‘contextualising evidence’ in their conceptual framework, information about context is lacking. Focusing on Latin America and the Caribbean, Varallyay et al. [ 72 ] introduce a conceptual framework for evaluating embedded implementation research in various contexts. The framework outlines key stages of evidence-informed decision-making and provides guidance on assessing embeddedness and critical contextual factors. Compared to others, their conceptual framework provides a relatively comprehensive elaboration on contextual factors. In addition, among all the studies reviewed, Leonard et al. [ 73 ] present an exceptionally comprehensive analysis, where they identify the facilitators and barriers to the sustainable implementation of evidence-based health innovations in LMICs. Through a systematic literature review, they scrutinise 79 studies and categorise the identified barriers and facilitators into seven groups: context, innovation, relations and networks, institutions, knowledge, actors, and resources. The first one, context, contains rich information that could be seen in Table  2 .

Continuing from LMICs, Votruba et al. [ 74 ] present in their study the EVITA (EVIdence To Agenda setting) conceptual framework for mental health research-policy interrelationships in LMICs with some information about context, detailed as external influences and political context. In a follow-up study, they offer an updated framework for understanding evidence-based mental health policy agenda-setting [ 75 ]. In their revised framework, context is interpreted as external context and policy sphere, encompassing policy agenda, window of opportunity, political will and key individuals. Lastly, to develop a comprehensive monitoring and evaluation framework for evidence-to-policy networks, Kuchenmüller et al. [ 76 ] present the EVIPNet Europe Theory of Change and interpret contextual factors for evidence-informed policymaking as political, economic, logistic and administrative. Overall, it can be concluded that studies presenting macro-level KT TMFs from 2011 until 2022 focus mainly on LMICs (15 out of 22; close to 70%) and the majority of them were funded by international (development) organisations, the European Commission and global health donor agencies. An overwhelming number of studies among them (19 out of 22; close to 90%) provide either no information on contextual details or these were included only partly with some generic terms in KT TMFs.

Our systematic scoping review suggests that the approach of KT, which has evolved from evidence-based medicine to evidence-informed policymaking, tends to remain closely tied to its clinical origins when developing TMFs. In other words, macro-level KT TMFs place greater emphasis on the (public) health issue at hand rather than considering the broader decision-making context, a viewpoint shared by other scholars as well [ 30 ]. One reason could be that in the early stages of KT TMFs, the emphasis primarily focused on implementing evidence-based practices within clinical settings. At that time, the spotlight was mostly on content, including aspects like clinical studies, checklists and guidelines serving as the evidence base. In those meso-level KT TMFs, a detailed description of context, i.e. the overall environment in which these practices should be implemented, might have been deemed less necessary, given that healthcare organisations, such as hospitals to implement medical guidelines or surgical safety checklists, show similar characteristics globally.

However, as the scope of KT TMFs continues to expand to include the influence on health policies, a deeper understanding of context-specific factors within different jurisdictions and the dynamics of the policy process is becoming increasingly crucial. This is even more important for KT scholars aiming to conceptualise large-scale changes, as described in KT Tier 5, which necessitate a thorough understanding of targeted behaviours within societies. As the complexity of interventions increases due to the growing number of stakeholders either affecting or being affected by them, the interventions are surrounded by a more intricate web of attitudes, incentives, relationships, rules of engagement and spheres of influence [ 7 ]. The persisting emphasis on content over context in the evolving field of KT may oversimplify the complex process of using evidence in policymaking and understanding the society [ 77 ]. Some scholars argue that this common observation in public health can be attributed to the dominance of experts primarily from medical sciences [ 78 , 79 , 80 ]. Our study confirms the potential limitation of not incorporating insights from political science and public policy studies, which can lead to what is often termed a ‘naïve’ conceptualisation of evidence-to-policy schemes [ 15 , 16 , 17 ]. It is therefore strongly encouraged that the emerging macro-level KT concepts draw on political science and public administration if KT scholars intend to effectively communicate new ideas to policymakers, with the aim of prompting their action or response. We summarised our findings into three points.

Firstly, KT scholars may want to identify and pinpoint exactly where a change should occur within the policy process. The main confusion that we observed in the KT literature arises from a lack of understanding of how public policies are made. Notably, the term ‘evidence-informed policymaking’ can refer to any stage of the policy cycle, spanning from agenda-setting to policy formulation, adoption, implementation and evaluation. Understanding these steps will allow researchers to refine their language when advocating for policy changes across various jurisdictions; for instance, the word ‘implementation’ is often inappropriately used in KT literature. As commonly known, at the macro-level, public policies take the form of legislation, law-making and regulation, thereby shaping the practices or policies to be implemented at the meso- and micro-levels [ 81 ]. In other words, the process of using specific knowledge to influence health policies, however evidence-based it might be, falls mostly under the responsibility and jurisdiction of sovereign states. For this reason, macro-level KT TMFs should reflect the importance of understanding the policy context and the complexities associated with policymaking, rather than suggesting flawed or unrealistic top-down ‘implementation’ strategies in countries by foregrounding the content, or the (public) health issue at hand.

Our second observation from this systematic scoping review points towards a selective perception among researchers when reporting on policy interventions. Research on KT does not solely exist due to the perceived gap between scientific evidence and policy but also because of the pressures the organisations or researchers face in being accountable to their funding sources, ensuring the continuity of financial support for their activities and claiming output legitimacy to change public policies [ 8 ]. This situation indirectly compels researchers working to influence health policies in the field to provide ‘evidence-based’ feedback on the success of their projects to donors [ 82 ]. In doing so, researchers may overly emphasise the content of the policy intervention in their reporting to secure further funding, while they underemphasis the contextual factors. These factors, often perceived as a given, might actually be the primary facilitators of their success. Such a lack of transparency regarding the definition of context is particularly visible in the field of global health, where LMICs often rely on external donors. It is important to note that this statement is not intended as a negative critique of their missions or an evaluation of health outcomes in countries following such missions. Rather, it seeks to explain the underlying reason why researchers, particularly those reliant on donors in LMICs, prioritise promoting the concept of KT from a technical standpoint, giving less attention to contextual factors in their reasoning.

Lastly, and connected to the previous point, it is our observation that the majority of macro-level KT TMFs fail to give adequate consideration to both power dynamics in countries (internal vs. external influences) and the actual role that government plays in public policies. Notably, although good policymaking entails an honest effort to use the best available evidence, the belief that this will completely negate the role of power and politics in decision-making is a technocratic illusion [ 83 ]. Among the studies reviewed, the framework put forth by Leonard et al. [ 73 ] offers the most comprehensive understanding of context and includes a broad range of factors (such as political, social, and economic) discovered also in other reviewed studies. Moreover, the framework, developed through an extensive systematic review, offers a more in-depth exploration of these contextual factors than merely listing them as a set of keywords. Indeed, within the domains of political science and public policy, such factors shaping health policies have received considerable scholarly attention for decades. To define what context entails, Walt refers in her book ‘Health Policy: An Introduction to Process and Power’ [ 84 ] to the work of Leichter from 1979 [ 85 ], who provides a scheme for analysing public policy. This includes i) situational factors, which are transient, impermanent, or idiosyncratic; ii) structural factors, which are relatively unchanging elements of the society and polity; iii) cultural factors, which are value commitments of groups; and iv) environmental factors, which are events, structures and values that exist outside the boundaries of a political system and influence decisions within it. His detailed sub-categories for context can be found in Table  3 . This flexible public policy framework may offer KT researchers a valuable approach to understanding contextual factors and provide some guidance to define the keywords to focus on. Scholars can adapt this framework to suit a wide range of KT topics, creating more context-sensitive and comprehensive KT TMFs.

Admittedly, our study has certain limitations. Despite choosing one of the most comprehensive bibliographic databases for our systematic scoping review, which includes materials from biomedicine, allied health fields, biological and physical sciences, humanities, and information science in relation to medicine and healthcare, we acknowledge that we may have missed relevant articles indexed in other databases. Hence, exclusively using Ovid/MEDLINE due to resource constraints may have narrowed the scope and diversity of scholarly literature examined in this study. Second, our review was limited to peer-reviewed publications in English and German. Future studies could extend our findings by examining the extent to which contextual factors are detailed in macro-level KT TMFs published in grey literature and in different languages. Given the abundance of KT reports, working papers or policy briefs published by IOs and development agencies, such an endeavour could enrich our findings and either support or challenge our conclusions. Nonetheless, to our knowledge, this study represents the first systematic review and critical appraisal of emerging knowledge-to-policy concepts, also known as macro-level KT TMFs. It successfully blends insights from both biomedical and public policy disciplines, and could serve as a roadmap for future research.

The translation of knowledge to policymakers involves more than technical skills commonly associated with (bio-)medical sciences, such as creating evidence-based guidelines or clinical checklists. Instead, evidence-informed policymaking reflects an ambition to engage in the political dimensions of states. Therefore, the evolving KT concepts addressing health policies should be seen as a political decision-making process, rather than a purely analytical one, as is the case with evidence-based medicine. To better understand the influence of power dynamics and governance structures in policymaking, we suggest that future macro-level KT TMFs draw on insights from political science and public administration. Collaborative, interdisciplinary research initiatives could be undertaken to bridge the gap between these fields. Technocratic KT TMFs that overlook contextual factors risk propagating misconceptions in academic circles about how health policies are made, as they become increasingly influential over time. Research, the systematic pursuit of knowledge, is neither inherently good nor bad; it can be sought after, used or misused, like any other tool in policymaking. What is needed in the KT discourse is not another generic call for ‘research-to-action’ but rather an understanding of the dividing line between research-to- clinical -action and research-to- political -action.

Availability of data and materials

Available upon reasonable request.

WHO. Bridging the ‘Know-Do’ Gap: Meeting on Knowledge Translation in Global Health : 10–12 October 2005 World Health Organization Geneva, Switzerland [Internet]. 2005. https://www.measureevaluation.org/resources/training/capacity-building-resources/high-impact-research-training-curricula/bridging-the-know-do-gap.pdf

Rogers EM. Diffusion of innovations. 3rd ed. New York: Free Press; 1983.

Google Scholar  

Greenhalgh T, Wieringa S. Is it time to drop the ‘knowledge translation’ metaphor? A critical literature review. J R Soc Med. 2011;104(12):501–9.

Article   PubMed   PubMed Central   Google Scholar  

Green LW, Ottoson JM, García C, Hiatt RA. Diffusion theory and knowledge dissemination, utilization, and integration in public health. Annu Rev Public Health. 2009;30(1):151–74.

Article   PubMed   Google Scholar  

Bragge P, Grimshaw JM, Lokker C, Colquhoun H, Albrecht L, Baron J, et al. AIMD—a validated, simplified framework of interventions to promote and integrate evidence into health practices, systems, and policies. BMC Med Res Methodol. 2017;17(1):38.

Zarbin M. What Constitutes Translational Research? Implications for the Scope of Translational Vision Science and Technology. Transl Vis Sci Technol 2020;9(8).

Hassmiller Lich K, Frerichs L, Fishbein D, Bobashev G, Pentz MA. Translating research into prevention of high-risk behaviors in the presence of complex systems: definitions and systems frameworks. Transl Behav Med. 2016;6(1):17–31.

Tetroe JM, Graham ID, Foy R, Robinson N, Eccles MP, Wensing M, et al. Health research funding agencies’ support and promotion of knowledge translation: an international study. Milbank Q. 2008;86(1):125–55.

Eccles MP, Mittman BS. Welcome to Implementation Science. Implement Sci. 2006;1(1):1.

Article   PubMed Central   Google Scholar  

Rychetnik L, Bauman A, Laws R, King L, Rissel C, Nutbeam D, et al. Translating research for evidence-based public health: key concepts and future directions. J Epidemiol Community Health. 2012;66(12):1187–92.

Votruba N, Ziemann A, Grant J, Thornicroft G. A systematic review of frameworks for the interrelationships of mental health evidence and policy in low- and middle-income countries. Health Res Policy Syst. 2018;16(1):85.

Delnord M, Tille F, Abboud LA, Ivankovic D, Van Oyen H. How can we monitor the impact of national health information systems? Results from a scoping review. Eur J Public Health. 2020;30(4):648–59.

Malterud K, Bjelland AK, Elvbakken KT. Evidence-based medicine—an appropriate tool for evidence-based health policy? A case study from Norway. Health Res Policy Syst. 2016;14(1):15.

Borst RAJ, Kok MO, O’Shea AJ, Pokhrel S, Jones TH, Boaz A. Envisioning and shaping translation of knowledge into action: a comparative case-study of stakeholder engagement in the development of a European tobacco control tool. Health Policy. 2019;123(10):917–23.

Liverani M, Hawkins B, Parkhurst JO. Political and institutional influences on the use of evidence in public health policy: a systematic review. PLoS ONE. 2013;8(10): e77404.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Cairney P. The politics of evidence-based policy making, 1st ed. London: Palgrave Macmillan UK: Imprint: Palgrave Pivot, Palgrave Macmillan; 2016.

Parkhurst J. The Politics of Evidence: From evidence-based policy to the good governance of evidence [Internet]. Routledge; 2016. https://www.taylorfrancis.com/books/9781315675008

Cairney P, Oliver K. Evidence-based policymaking is not like evidence-based medicine, so how far should you go to bridge the divide between evidence and policy? Health Res Policy Syst. 2017;15(1):35.

Verboom B, Baumann A. Mapping the Qualitative Evidence Base on the Use of Research Evidence in Health Policy-Making: A Systematic Review. Int J Health Policy Manag. 2020;16.

Ward V, House A, Hamer S. Developing a framework for transferring knowledge into action: a thematic analysis of the literature. J Health Serv Res Policy. 2009;14(3):156–64.

Swinburn B, Gill T, Kumanyika S. Obesity prevention: a proposed framework for translating evidence into action. Obes Rev. 2005;6(1):23–33.

Article   CAS   PubMed   Google Scholar  

Damschroder LJ. Clarity out of chaos: Use of theory in implementation research. Psychiatry Res. 2020;283: 112461.

Birken SA, Rohweder CL, Powell BJ, Shea CM, Scott J, Leeman J, et al. T-CaST: an implementation theory comparison and selection tool. Implement Sci. 2018;13(1):143.

Nilsen P. Making sense of implementation theories, models and frameworks. Implement Sci. 2015;10(1):53.

Rapport F, Clay-Williams R, Churruca K, Shih P, Hogden A, Braithwaite J. The struggle of translating science into action: foundational concepts of implementation science. J Eval Clin Pract. 2018;24(1):117–26.

Hagenaars LL, Jeurissen PPT, Klazinga NS. The taxation of unhealthy energy-dense foods (EDFs) and sugar-sweetened beverages (SSBs): An overview of patterns observed in the policy content and policy context of 13 case studies. Health Policy. 2017;121(8):887–94.

Sheikh K, Gilson L, Agyepong IA, Hanson K, Ssengooba F, Bennett S. Building the field of health policy and systems research: framing the questions. PLOS Med. 2011;8(8): e1001073.

Tran NT, Hyder AA, Kulanthayan S, Singh S, Umar RSR. Engaging policy makers in road safety research in Malaysia: a theoretical and contextual analysis. Health Policy. 2009;90(1):58–65.

Walt G, Gilson L. Reforming the health sector in developing countries: the central role of policy analysis. Health Policy Plan. 1994;9(4):353–70.

Dobrow MJ, Goel V, Upshur REG. Evidence-based health policy: context and utilisation. Soc Sci Med. 2004;58(1):207–17.

Barnfield A, Savolainen N, Lounamaa A. Health Promotion Interventions: Lessons from the Transfer of Good Practices in CHRODIS-PLUS. Int J Environ Res Public Health. 2020;17(4).

van de Goor I, Hämäläinen RM, Syed A, Juel Lau C, Sandu P, Spitters H, et al. Determinants of evidence use in public health policy making: results from a study across six EU countries. Health Policy Amst Neth. 2017;121(3):273–81.

Article   Google Scholar  

Ornstein JT, Hammond RA, Padek M, Mazzucca S, Brownson RC. Rugged landscapes: complexity and implementation science. Implement Sci. 2020;15(1):85.

Seward N, Hanlon C, Hinrichs-Kraples S, Lund C, Murdoch J, Taylor Salisbury T, et al. A guide to systems-level, participatory, theory-informed implementation research in global health. BMJ Glob Health. 2021;6(12): e005365.

Pfadenhauer LM, Gerhardus A, Mozygemba K, Lysdahl KB, Booth A, Hofmann B, et al. Making sense of complexity in context and implementation: the Context and Implementation of Complex Interventions (CICI) framework. Implement Sci. 2017;12(1):21.

Rogers L, De Brún A, McAuliffe E. Defining and assessing context in healthcare implementation studies: a systematic review. BMC Health Serv Res. 2020;20(1):591.

Nilsen P, Bernhardsson S. Context matters in implementation science: a scoping review of determinant frameworks that describe contextual determinants for implementation outcomes. BMC Health Serv Res. 2019;19(1):189.

Arksey H, O’Malley L, Baldwin S, Harris J, Mason A, Golder S. Literature review report: services to support carers of people with mental health problems. 2002;182.

Tabak RG, Khoong EC, Chambers D, Brownson RC. Bridging research and practice. Am J Prev Med. 2012;43(3):337–50.

O’Donovan MA, McCallion P, McCarron M, Lynch L, Mannan H, Byrne E. A narrative synthesis scoping review of life course domains within health service utilisation frameworks. HRB Open Res. 2019.

Michie S, Johnston M, Abraham C, Lawton R, Parker D, Walker A, et al. Making psychological theory useful for implementing evidence based practice: a consensus approach. Qual Saf Health Care. 2005;14(1):26–33.

Bate P, Robert G, Fulop N, Øvretviet J, Dixon-Woods M. Perspectives on context: a collection of essays considering the role of context in successful quality improvement [Internet]. 2014. https://www.health.org.uk/sites/default/files/PerspectivesOnContext_fullversion.pdf

Ziemann A, Brown L, Sadler E, Ocloo J, Boaz A, Sandall J. Influence of external contextual factors on the implementation of health and social care interventions into practice within or across countries—a protocol for a ‘best fit’ framework synthesis. Syst Rev. 2019. https://doi.org/10.1186/s13643-019-1180-8 .

Strifler L, Cardoso R, McGowan J, Cogo E, Nincic V, Khan PA, et al. Scoping review identifies significant number of knowledge translation theories, models, and frameworks with limited use. J Clin Epidemiol. 2018;100:92–102.

Esmail R, Hanson HM, Holroyd-Leduc J, Brown S, Strifler L, Straus SE, et al. A scoping review of full-spectrum knowledge translation theories, models, and frameworks. Implement Sci. 2020;15(1):11.

Jacobson N, Butterill D, Goering P. Development of a framework for knowledge translation: understanding user context. J Health Serv Res Policy. 2003;8(2):94–9.

Lehoux P, Denis JL, Tailliez S, Hivon M. Dissemination of health technology assessments: identifying the visions guiding an evolving policy innovation in Canada. J Health Polit Policy Law. 2005;30(4):603–42.

Parliament of Canada. Government Bill (House of Commons) C-13 (36–2) - Royal Assent - Canadian Institutes of Health Research Act [Internet]. https://parl.ca/DocumentViewer/en/36-2/bill/C-13/royal-assent/page-31 . Accessed 1 Apr 2023.

Straus SE, Tetroe J, Graham I. Defining knowledge translation. CMAJ Can Med Assoc J. 2009;181(3–4):165–8.

Ashford L. Creating windows of opportunity for policy change: incorporating evidence into decentralized planning in Kenya. Bull World Health Organ. 2006;84(8):669–72.

Bauman AE, Nelson DE, Pratt M, Matsudo V, Schoeppe S. Dissemination of physical activity evidence, programs, policies, and surveillance in the international public health arena. Am J Prev Med. 2006;31(4):57–65.

Gold M. Pathways to the use of health services research in policy. Health Serv Res. 2009;44(4):1111–36.

Dhonukshe-Rutten RAM, Timotijevic L, Cavelaars AEJM, Raats MM, de Wit LS, Doets EL, et al. European micronutrient recommendations aligned: a general framework developed by EURRECA. Eur J Clin Nutr. 2010;64(2):S2-10.

Ir P, Bigdeli M, Meessen B, Van Damme W. Translating knowledge into policy and action to promote health equity: The Health Equity Fund policy process in Cambodia 2000–2008. Health Policy. 2010;96(3):200–9.

Atun R, de Jongh T, Secci F, Ohiri K, Adeyi O. Integration of targeted health interventions into health systems: a conceptual framework for analysis. Health Policy Plan. 2010;25(2):104–11.

Bissell K, Lee K, Freeman R. Analysing policy transfer: perspectives for operational research. Int J Tuberc Lung Dis Off J Int Union Tuberc Lung Dis. 2011;15(9).

Tran NT, Bennett SC, Bishnu R, Singh S. Analyzing the sources and nature of influence: how the Avahan program used evidence to influence HIV/AIDS prevention policy in India. Implement Sci. 2013;8(1):44.

Bertone MP, Meessen B, Clarysse G, Hercot D, Kelley A, Kafando Y, et al. Assessing communities of practice in health policy: a conceptual framework as a first step towards empirical research. Health Res Policy Syst. 2013;11(1):39.

Timotijevic L, Brown KA, Lähteenmäki L, de Wit L, Sonne AM, Ruprich J, et al. EURRECA—a framework for considering evidence in public health nutrition policy development. Crit Rev Food Sci Nutr. 2013;53(10):1124–34.

Onwujekwe O, Uguru N, Russo G, Etiaba E, Mbachu C, Mirzoev T, et al. Role and use of evidence in policymaking: an analysis of case studies from the health sector in Nigeria. Health Res Policy Syst. 2015;13(1):46.

Redman S, Turner T, Davies H, Williamson A, Haynes A, Brennan S, et al. The SPIRIT action framework: a structured approach to selecting and testing strategies to increase the use of research in policy. Soc Sci Med. 2015;136–137:147–55.

Spicer N, Berhanu D, Bhattacharya D, Tilley-Gyado RD, Gautham M, Schellenberg J, et al. ‘The stars seem aligned’: a qualitative study to understand the effects of context on scale-up of maternal and newborn health innovations in Ethiopia, India and Nigeria. Glob Health. 2016;12(1):75.

Mulvale G, McRae SA, Milicic S. Teasing apart “the tangled web” of influence of policy dialogues: lessons from a case study of dialogues about healthcare reform options for Canada. Implement Sci IS. 2017;12.

Sarkies MN, Bowles KA, Skinner EH, Haas R, Lane H, Haines TP. The effectiveness of research implementation strategies for promoting evidence-informed policy and management decisions in healthcare: a systematic review. Implement Sci. 2017;12(1):132.

Houngbo PTh, Coleman HLS, Zweekhorst M, De Cock Buning TJ, Medenou D, Bunders JFG. A Model for Good Governance of Healthcare Technology Management in the Public Sector: Learning from Evidence-Informed Policy Development and Implementation in Benin. PLoS ONE. 2017;12(1):e0168842.

Mwendera C, de Jager C, Longwe H, Hongoro C, Phiri K, Mutero CM. Development of a framework to improve the utilisation of malaria research for policy development in Malawi. Health Res Policy Syst. 2017;15(1):97.

Ellen ME, Panisset U, de AraujoCarvalho I, Goodwin J, Beard J. A knowledge translation framework on ageing and health. Health Policy. 2017;121(3):282–91.

Ongolo-Zogo P, Lavis JN, Tomson G, Sewankambo NK. Assessing the influence of knowledge translation platforms on health system policy processes to achieve the health millennium development goals in Cameroon and Uganda: a comparative case study. Health Policy Plan. 2018;33(4):539–54.

Plamondon KM, Pemberton J. Blending integrated knowledge translation with global health governance: an approach for advancing action on a wicked problem. Health Res Policy Syst. 2019;17(1):24.

Vincenten J, MacKay JM, Schröder-Bäck P, Schloemer T, Brand H. Factors influencing implementation of evidence-based interventions in public health systems—a model. Cent Eur J Public Health. 2019;27(3):198–203.

Motani P, Van de Walle A, Aryeetey R, Verstraeten R. Lessons learned from Evidence-Informed Decision-Making in Nutrition & Health (EVIDENT) in Africa: a project evaluation. Health Res Policy Syst. 2019;17(1):12.

Varallyay NI, Langlois EV, Tran N, Elias V, Reveiz L. Health system decision-makers at the helm of implementation research: development of a framework to evaluate the processes and effectiveness of embedded approaches. Health Res Policy Syst. 2020;18(1):64.

Leonard E, de Kock I, Bam W. Barriers and facilitators to implementing evidence-based health innovations in low- and middle-income countries: a systematic literature review. Eval Program Plann. 2020;82: 101832.

Votruba N, Grant J, Thornicroft G. The EVITA framework for evidence-based mental health policy agenda setting in low- and middle-income countries. Health Policy Plan. 2020;35(4):424–39.

Votruba N, Grant J, Thornicroft G. EVITA 2.0, an updated framework for understanding evidence-based mental health policy agenda-setting: tested and informed by key informant interviews in a multilevel comparative case study. Health Res Policy Syst. 2021;19(1):35.

Kuchenmüller T, Chapman E, Takahashi R, Lester L, Reinap M, Ellen M, et al. A comprehensive monitoring and evaluation framework for evidence to policy networks. Eval Program Plann. 2022;91: 102053.

Ettelt S. The politics of evidence use in health policy making in Germany—the case of regulating hospital minimum volumes. J Health Polit Policy Law. 2017;42(3):513–38.

Greer SL, Bekker M, de Leeuw E, Wismar M, Helderman JK, Ribeiro S, et al. Policy, politics and public health. Eur J Public Health. 2017;27(suppl 4):40–3.

Fafard P, Cassola A. Public health and political science: challenges and opportunities for a productive partnership. Public Health. 2020;186:107–9.

Löblová O. Epistemic communities and experts in health policy-making. Eur J Public Health. 2018;28(suppl 3):7–10.

Maddalena V. Evidence-Based Decision-Making 8: Health Policy, a Primer for Researchers. In: Parfrey PS, Barrett BJ, editors. Clinical Epidemiology: Practice and Methods. New York, NY: Springer; 2015. (Methods in Molecular Biology).

Louis M, Maertens L. Why international organizations hate politics - Depoliticizing the world [Internet]. London and New York: Routledge; 2021. (Global Institutions). https://library.oapen.org/bitstream/handle/20.500.12657/47578/1/9780429883279.pdf

Hassel A, Wegrich K. How to do public policy. 1st ed. Oxford: Oxford University Press; 2022.

Book   Google Scholar  

Walt G. Health policy: an introduction to process and power. 7th ed. Johannesburg: Witwatersrand University Press; 2004.

Leichter HM. A comparative approach to policy analysis: health care policy in four nations. Cambridge: Cambridge University Press; 1979.

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Department of International Health, Care and Public Health Research Institute – CAPHRI, Faculty of Health, Medicine and Life Sciences, Maastricht University, Maastricht, The Netherlands

Tugce Schmitt, Katarzyna Czabanowska & Peter Schröder-Bäck

You can also search for this author in PubMed   Google Scholar

Contributions

TS: Conceptualization, Methodology, Formal analysis, Investigation, Writing—Original Draft; KC: Writing—Review & Editing; PSB: Validation, Formal analysis, Writing—Review & Editing, Supervision.

Corresponding author

Correspondence to Tugce Schmitt .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Schmitt, T., Czabanowska, K. & Schröder-Bäck, P. What is context in knowledge translation? Results of a systematic scoping review. Health Res Policy Sys 22 , 52 (2024). https://doi.org/10.1186/s12961-024-01143-5

Download citation

Received : 26 June 2023

Accepted : 11 April 2024

Published : 29 April 2024

DOI : https://doi.org/10.1186/s12961-024-01143-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Knowledge Translation
  • Evidence-informed policymaking
  • Health systems

Health Research Policy and Systems

ISSN: 1478-4505

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

systematic review of clinical research

  • Open access
  • Published: 27 April 2024

Assessing fragility of statistically significant findings from randomized controlled trials assessing pharmacological therapies for opioid use disorders: a systematic review

  • Leen Naji   ORCID: orcid.org/0000-0003-0994-1109 1 , 2 , 3 ,
  • Brittany Dennis 4 , 5 ,
  • Myanca Rodrigues 2 ,
  • Monica Bawor 6 ,
  • Alannah Hillmer 7 ,
  • Caroul Chawar 8 ,
  • Eve Deck 9 ,
  • Andrew Worster 2 , 4 ,
  • James Paul 10 ,
  • Lehana Thabane 11 , 2 &
  • Zainab Samaan 12 , 2  

Trials volume  25 , Article number:  286 ( 2024 ) Cite this article

202 Accesses

2 Altmetric

Metrics details

The fragility index is a statistical measure of the robustness or “stability” of a statistically significant result. It has been adapted to assess the robustness of statistically significant outcomes from randomized controlled trials. By hypothetically switching some non-responders to responders, for instance, this metric measures how many individuals would need to have responded for a statistically significant finding to become non-statistically significant. The purpose of this study is to assess the fragility index of randomized controlled trials evaluating opioid substitution and antagonist therapies for opioid use disorder. This will provide an indication as to the robustness of trials in the field and the confidence that should be placed in the trials’ outcomes, potentially identifying ways to improve clinical research in the field. This is especially important as opioid use disorder has become a global epidemic, and the incidence of opioid related fatalities have climbed 500% in the past two decades.

Six databases were searched from inception to September 25, 2021, for randomized controlled trials evaluating opioid substitution and antagonist therapies for opioid use disorder, and meeting the necessary requirements for fragility index calculation. Specifically, we included all parallel arm or two-by-two factorial design RCTs that assessed the effectiveness of any opioid substitution and antagonist therapies using a binary primary outcome and reported a statistically significant result. The fragility index of each study was calculated using methods described by Walsh and colleagues. The risk of bias of included studies was assessed using the Revised Cochrane Risk of Bias tool for randomized trials.

Ten studies with a median sample size of 82.5 (interquartile range (IQR) 58, 179, range 52–226) were eligible for inclusion. Overall risk of bias was deemed to be low in seven studies, have some concerns in two studies, and be high in one study. The median fragility index was 7.5 (IQR 4, 12, range 1–26).

Conclusions

Our results suggest that approximately eight participants are needed to overturn the conclusions of the majority of trials in opioid use disorder. Future work should focus on maximizing transparency in reporting of study results, by reporting confidence intervals, fragility indexes, and emphasizing the clinical relevance of findings.

Trial registration

PROSPERO CRD42013006507. Registered on November 25, 2013.

Peer Review reports

Introduction

Opioid use disorder (OUD) has become a global epidemic, and the incidence of opioid related fatality is unparalleled to the rates observed in North America, having climbed 500% in the past two decades [ 1 , 2 ]. There is a dire need to identify the most effective treatment modality to maintain patient engagement in treatment, mitigate high risk consumption patterns, as well as eliminate overdose risk. Numerous studies have aimed to identify the most effective treatment modality for OUD [ 3 , 4 , 5 ]. Unfortunately, this multifaceted disease is complicated by the interplay between both neurobiological and social factors, impacting our current body of evidence and clinical decision making. Optimal treatment selection is further challenged by the rising number of pharmacological opioid substitution and antagonist therapies (OSAT) [ 6 ]. Despite this growing body of evidence and available therapies, we have yet to arrive to a consensus regarding the best treatment modality given the substantial variability in research findings and directly conflicting results [ 6 , 7 , 8 , 9 ]. More concerning, international clinical practice guidelines rely on out-of-date systematic review evidence to inform guideline development [ 10 ]. In fact, these guidelines make strong recommendations based on a fraction of the available evidence, employing trials with restrictive eligibility criteria which fail to reflect the common OUD patients seen in clinical practice [ 10 ].

A major factor hindering our ability to advance the field of addiction medicine is our failure to apply the necessary critical lens to the growing body of evidence used to inform clinical practice. While distinct concerns exist regarding the external validity of randomized trials in addiction medicine, the robustness of the universally recognized “well designed” trials remains unknown [ 10 ]. The reliability of the results of clinical trials rests on not only the sample size of the study but also the number of outcome events. In fact, a shift in the results of only a few events could in theory render the findings of the trial null, impacting the traditional hypothesis tests above the standard threshold accepted as “statistical significance.” A metric of this fragility was first introduced in 1990, known formally as the fragility index (FI) [ 11 ]. In 2014, it was adapted for use as a tool to assess the robustness of findings from randomized controlled trials (RCTs) [ 12 ]. Briefly, the FI determines the minimum number of participants whose outcome would have to change from non-event to event in order for a statistically significant result to become non-significant. Larger FIs indicate more robust findings [ 11 , 13 ]. Additionally, when the number of study participants lost to follow-up exceeds the FI of the trial, this implies that the outcome of these participants could have significantly altered the statistical significance and final conclusions of the study. The FI has been applied across multiple fields, often yielding similar results such that the change in a small number of outcome events has been powerful enough to overturn the statistical conclusions of many “well-designed” trials [ 13 ].

The concerning state of the OUD literature has left us with guidelines which neither acknowledge the lack of external validity and actually go so far as to rank the quality of the evidence as good, despite the concerning limitations we have raised [ 10 ]. Such alarming practices necessitate vigilance on behalf of methodologists and practitioners to be critical and open to a thorough review of the evidence in the field of addiction medicine [ 12 ]. Given the complex nature of OUD treatment and the increasing number of available therapies, concentrated efforts are needed to ensure the reliability and internal validity of the results of clinical trials used to inform guidelines. Application of the FI can serve to provide additional insight into the robustness of the evidence in addiction medicine. The purpose of this study is to assess the fragility of findings of RCTs assessing OSAT for OUD.

Systematic review protocol

We conducted a systematic review of the evidence surrounding OSATs for OUD [ 5 ]. The study protocol was registered with PROSPERO a priori (PROSPERO CRD42013006507). We searched Medline, EMBASE, PubMed, PsycINFO, Web of Science, and Cochrane Library for relevant studies from inception to September 25, 2021. We included all RCTs evaluating the effectiveness of any OSAT for OUD, which met the criteria required for FI calculation. Specifically, we included all parallel arm or two-by-two factorial design RCTs that allocated patients at a 1:1 ratio, assessed the effectiveness of any OSAT using a binary primary or co-primary outcome, and reported this outcome to be statistically significant ( p < 0.05).

All titles, abstracts, and full texts were screened for eligibility by two reviewers independently and in duplicate. Any discrepancies between the two reviewers were discussed for consensus, and a third reviewer was called upon when needed.

Data extraction and risk of bias assessment (ROB)

Two reviewers extracted the following data from the included studies in duplicate and independently using a pilot-tested excel data extraction sheet: sample size, whether a sample size calculation was conducted, statistical test used, primary outcome, number of responders and non-responders in each arm, number lost to follow-up, and the p -value. The 2021 Thomson Reuters Journal Impact Factor for each included study was also recorded. The ROB of included studies for the dichotomous outcome used in the FI calculation was assessed using the Revised Cochrane ROB tool for randomized trials [ 14 ]. Two reviewers independently assessed the included studies based on the following domains for potential ROB: randomization process, deviations from the intended interventions, missing outcome data, measurement of the outcome, and selection of the reported results.

Statistical analyses

Study characteristics were summarized using descriptive statistics. Means and standard deviations (SD), as well as medians and interquartile ranges (IQR: Q 25 , Q 75 ) were used as measures of central tendency for continuous outcomes with normal and skewed distributions, respectively. Frequencies and percentages were used to summarize categorical variables. The FI was calculated using a publicly available free online calculator, using the methods described by Walsh et al. [ 12 , 15 ] In summary, the number of events and non-events in each treatment arm were entered into a two-by-two contingency table for each trial. An event was added to the treatment arm with the smaller number of events, while subtracting a non-event from the same arm, thus keeping the overall sample size the same. Each time this was done, the two-sided p -value for Fisher’s exact test was recalculated. The FI was defined as the number of non-events that needed to be switched to events for the p -value to reach non-statistical significance (i.e., ≥0.05).

We intended to conduct a linear regression and Spearman’s rank correlations to assess the association between FI and journal impact factor, study sample size, and number events. However, we were not powered to do so given the limited number of eligible studies included in this review and thus refrained from conducting any inferential statistics.

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for reporting (see Supplementary Material ) [ 16 ].

Study selection

Our search yielded 13,463 unique studies, of which 104 were RCTs evaluating OSAT for OUD. Among these, ten studies met the criteria required for FI calculation and were included in our analyses. Please refer to Fig. 1 for the search results, study inclusion flow diagram, and Table 1 for details on included studies.

figure 1

PRISMA flow diagram delineating study selection

Characteristics of included studies

The included studies were published between 1980 and 2018, in eight different journals with a median impact factor of 8.48 (IQR 6.53–56.27, range 3.77–91.25). Four studies reported on a calculated sample size [ 17 , 18 , 19 , 20 ], and only one study specified that reporting guidelines were used [ 21 ]. Treatment retention was the most commonly reported primary outcome ( k = 8). The median sample size of included studies was 82.5 (IQR 58–179, range 52–226).

Overall ROB was deemed to be low in seven studies [ 17 , 19 , 20 , 21 , 22 , 23 , 24 ], have some concerns in two studies [ 18 , 25 ], and be high in one study [ 26 ] due to a high proportion of missing outcome data that was not accounted for in the analyses. We present a breakdown of the ROB assessment of the included studies for the dichotomous outcome of interest in Table 2 .

  • Fragility index

The median FI of included studies was 7.5 (IQR 4–12; range 1–26). The FI of individual studies is reported in Table 1 . The number of participants lost to follow-up exceeded the FI in two studies [ 23 , 26 ]. We find that there is a relatively positive correlation between the FI and sample size. However, no clear correlation was appreciated between FI and journal impact factor or number of events.

This is the first study to evaluate the FI in the field of addiction medicine, and more specifically in OUD trials. Among the ten RCTs evaluating the OSAT for OUD, we found that, in some cases, changing the outcome of one or two participants could completely alter the study’s conclusions and render the results statistically non-significant.

We compare our findings to those of Holek et al. , wherein they examined the mean FI across all reviews published in PubMed between 2014 and 2019 that assessed the distribution of FI indices, irrespective of discipline (though none were in addiction medicine) [ 13 ]. Among 24 included reviews with a median sample size of 134 (IQR 82, 207), they found a mean FI of 4 (95% CI 3, 5) [ 13 ]. This is slightly lower than our calculated our median FI of 7.5 (IQR 4–12; range 1–26). It is important to note that half of the reviews included in the study by Holek et al. were conducted in surgical disciplines, which are generally subjected to more limitations to internal and external validity, as it is often not possible to conceal allocation, blind participants, or operators, and the intervention is operator dependent. [ 27 ] To date, no study has directly applied FI to the findings of trials in OUD. In the HIV/AIDS literature, however, a population which is commonly shared with addiction medicine due to the prevalence of the comorbidities coexisting, the median fragility across all trials assessing anti-retroviral therapies ( n = 39) was 6 (IQR = 1, 11) [ 28 ], which is more closely related to our calculated FI. Among the included studies, only 3 were deemed to be at high risk of bias, whereas 13 and 20 studies were deemed to be at low and some risk of bias, respectively.

Loss-to-follow-up plays an important role in the interpretation of the FI. For instance, when the number of study participants lost to follow-up exceeds the FI of the trial, this implies that the outcome of these participants could have significantly altered the statistical significance and final conclusions of the study. While only two of the included studies had an FI that was greater than the total number of participants lost to follow-up [ 23 , 26 ], this metric is less important in our case given the primary outcome assessed by the majority of trials was retention in treatment, rendering loss to follow-up an outcome itself. In our report, we considered participants to be lost to follow-up if they left the study for reasons that were known and not necessarily indicative of treatment failure, such as due to factors beyond the participants, control including incarceration or being transferred to another treatment location.

Findings from our analysis of the literature as well as the application of FI to the existing clinical trials in the field of addiction medicine demonstrates significant concerns regarding the robustness of the evidence. This, in conjunction with the large differences between the clinical population and trial participants of opioid-dependent patients inherent in addiction medicine trials, raises larger concerns as to a growing body of evidence with deficiencies in both internal and external validity. The findings from this study raise important clinical concerns regarding the applicability of the current evidence to treating patients in the context of the opioid epidemic. Are we recommending the appropriate treatments for patients with OUD based on robust and applicable evidence? Are we completing our due diligence and ensuring clinicians and researchers alike understand the critical issues rampant in the literature, including the fragility of the data and misconceptions of p -values? Are we possibly putting our patients at risk employing such treatment based on fragile data? These questions cannot be answered until the appropriate re-evaluation of the evidence takes place employing both the use pragmatic trial designs as well as transparent metrics to reflect the reliability and robustness of the findings.

Strengths and limitations

Our study is strengthened by a comprehensive search strategy, rigorous and systematic screening of studies, and the use of an objective measure to gauge the robustness of studies (i.e., FI). The limitations of this study are inherent in the limitations of the FI. Precisely, that it can only be calculated for RCTs with a 1:1 allocation ratio, a parallel arm or two-by-two factorial design, and a dichotomous primary outcome. As a result, 94 RCTs evaluating OSAT for OUD were excluded for not meeting these criteria (Fig. 1 ). Nonetheless, the FI provides a general sense of the robustness of the available studies, and our data reflect studies published across almost four decades in journals of varying impact factor.

Future direction

This study serves as further evidence for the need of a shift away from p -values [ 29 , 30 ]. Although there is increasingly a shift among statisticians to shift away from relying on statistical significance due to its inability to convey clinical importance [ 31 ], this remains the simplest way and most commonly reported metric in manuscripts. p -values provide a simple statistical measure to confirm or refute a null hypothesis, by providing a measure of how likely the observed result would be if the null hypothesis were true. An arbitrary cutoff of 5% is traditionally used as a threshold for rejecting the null hypothesis. However, a major drawback of the p -value is that it does not take into account the effect size of the outcome measure, such that a small incremental change that may not be clinically significant may still be statistically significant in a large enough trial. Contrastingly, a very large effect size that has biological plausibility, for instance, may not reach statistical significance if the trial size is not large enough [ 29 , 30 ]. This is highly problematic given the common misconceptions surrounding the p -value. Increasing emphasis is being placed on the importance of transparency in outcome reporting, and the reporting of confidence intervals to allow the reader to gauge the uncertainty in the evidence, and make a clinically informed decision about whether a finding is clinically significant or not. It has also been recommended that studies report FI where possible to provide readers with a comprehensible way of gauging the robustness of their findings [ 12 , 13 ]. There is a strive to make all data publicly available, allowing for replication of study findings as well as pooling of data among databases for generating more robust analyses using larger pragmatic samples [ 32 ]. Together, these efforts aim to increase transparency of research and facilitate data sharing to allow for stronger and more robust evidence to be produced, allowing for advancements in evidence-based medicine and improvements in the quality of care delivered to patients.

Our results suggest that approximately eight participants are needed to overturn the conclusions of the majority of trials in addiction medicine. Findings from our analysis of the literature and application of FI to the existing clinical trials in the field of addiction medicine demonstrates significant concerns regarding the overall quality and specifically robustness and stability of the evidence and the conclusions of the trials. Findings from this work raises larger concerns as to a growing body of evidence with deficiencies in both internal and external validity. In order to advance the field of addiction medicine, we must re-evaluate the quality of the evidence and consider employing pragmatic trial designs as well as transparent metrics to reflect the reliability and robustness of the findings. Placing emphasis on clinical relevance and reporting the FI along with confidence intervals may provide researchers, clinicians, and guideline developers with a transparent method to assess the outcomes from clinical trials, ensuring vigilance in decisions regarding management and treatment of patients with substance use disorders.

Availability of data and materials

All data generated or analyzed during this study are included in this published article (and its supplementary information files).

Abbreviations

Interquartile range

  • Opioid use disorder

Opioid substitution and antagonist therapies

  • Randomized controlled trials

Risk of bias

Standard deviation

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Products - Vital Statistics Rapid Release - Provisional Drug Overdose Data. https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm . Accessed April 26, 2020.

Spencer MR, Miniño AM, Warner M. Drug overdose deaths in the United States, 2001–2021. NCHS Data Brief, no 457. Hyattsville, MD: National Center for Health Statistics. 2022. https://doi.org/10.15620/cdc:122556 .

Mattick RP, Breen C, Kimber J, Davoli M. Methadone maintenance therapy versus no opioid replacement therapy for opioid dependence. Cochrane Database Syst Rev. 2009;(3).  https://doi.org/10.1002/14651858.CD002209.PUB2/FULL .

Hedrich D, Alves P, Farrell M, Stöver H, Møller L, Mayet S. The effectiveness of opioid maintenance treatment in prison settings: a systematic review. Addiction. 2012;107(3):501–17. https://doi.org/10.1111/J.1360-0443.2011.03676.X .

Article   PubMed   Google Scholar  

Dennis BB, Naji L, Bawor M, et al. The effectiveness of opioid substitution treatments for patients with opioid dependence: a systematic review and multiple treatment comparison protocol. Syst Rev. 2014;3(1):105. https://doi.org/10.1186/2046-4053-3-105 .

Article   PubMed   PubMed Central   Google Scholar  

Dennis BB, Sanger N, Bawor M, et al. A call for consensus in defining efficacy in clinical trials for opioid addiction: combined results from a systematic review and qualitative study in patients receiving pharmacological assisted therapy for opioid use disorder. Trials. 2020;21(1). https://doi.org/10.1186/s13063-019-3995-y .

British Columbia Centre on Substance Use. (2017). A Guideline for the Clinical Management of Opioid Use Disorder . http://www.bccsu.ca/care-guidance-publications/ . Accessed December 4, 2020.

Kampman  K, Jarvis M. American Society of Addiction Medicine (ASAM) national practice guideline for the use of medications in the treatment of addiction involving opioid use. J Addict Med. 2015;9(5):358–367.

Srivastava A, Wyman J, Fcfp MD, Mph D. Methadone treatment for people who use fentanyl: recommendations. 2021. www.metaphi.ca . Accessed November 14, 2023.

Dennis BB, Roshanov PS, Naji L, et al. Opioid substitution and antagonist therapy trials exclude the common addiction patient: a systematic review and analysis of eligibility criteria. Trials. 2015;16(1):1. https://doi.org/10.1186/s13063-015-0942-4 .

Article   CAS   Google Scholar  

Feinstein AR. The unit fragility index: an additional appraisal of “statistical significance” for a contrast of two proportions. J Clin Epidemiol. 1990;43(2):201–9. https://doi.org/10.1016/0895-4356(90)90186-S .

Article   CAS   PubMed   Google Scholar  

Walsh M, Srinathan SK, McAuley DF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index. J Clin Epidemiol. 2014;67(6):622–8. https://doi.org/10.1016/j.jclinepi.2013.10.019 .

Holek M, Bdair F, Khan M, et al. Fragility of clinical trials across research fields: a synthesis of methodological reviews. Contemp Clin Trials. 2020;97. doi: https://doi.org/10.1016/j.cct.2020.106151

Sterne JAC, Savović J, Page MJ, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366. doi: https://doi.org/10.1136/bmj.l4898

Kane SP. Fragility Index Calculator. ClinCalc: https://clincalc.com/Stats/FragilityIndex.aspx . Updated July 19, 2018. Accessed October 17, 2023.

Page MJ, McKenzie JE, Bossuyt PM, The PRISMA, et al. statement: an updated guideline for reporting systematic reviews. BMJ. 2020;2021:372. https://doi.org/10.1136/bmj.n71 .

Article   Google Scholar  

Petitjean S, Stohler R, Déglon JJ, et al. Double-blind randomized trial of buprenorphine and methadone in opiate dependence. Drug Alcohol Depend. 2001;62(1):97–104. https://doi.org/10.1016/S0376-8716(00)00163-0 .

Sees KL, Delucchi KL, Masson C, et al. Methadone maintenance vs 180-day psychosocially enriched detoxification for treatment of opioid dependence: a randomized controlled trial. JAMA. 2000;283(10):1303–10. https://doi.org/10.1001/JAMA.283.10.1303 .

Kakko J, Dybrandt Svanborg K, Kreek MJ, Heilig M. 1-year retention and social function after buprenorphine-assisted relapse prevention treatment for heroin dependence in Sweden: a randomised, placebo-controlled trial. Lancet (London, England). 2003;361(9358):662–8. https://doi.org/10.1016/S0140-6736(03)12600-1 .

Oviedo-Joekes E, Brissette S, Marsh DC, et al. Diacetylmorphine versus methadone for the treatment of opioid addiction. N Engl J Med. 2009;361(8):777–86. https://doi.org/10.1056/NEJMoa0810635 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hulse GK, Morris N, Arnold-Reed D, Tait RJ. Improving clinical outcomes in treating heroin dependence: randomized, controlled trial of oral or implant naltrexone. Arch Gen Psychiatry. 2009;66(10):1108–15. https://doi.org/10.1001/ARCHGENPSYCHIATRY.2009.130 .

Krupitsky EM, Zvartau EE, Masalov DV, et al. Naltrexone for heroin dependence treatment in St. Petersburg, Russia. J Subst Abuse Treat. 2004;26(4):285–94. https://doi.org/10.1016/j.jsat.2004.02.002 .

Krook AL, Brørs O, Dahlberg J, et al. A placebo-controlled study of high dose buprenorphine in opiate dependents waiting for medication-assisted rehabilitation in Oslo. Norway Addiction. 2002;97(5):533–42. https://doi.org/10.1046/J.1360-0443.2002.00090.X .

Hartnoll RL, Mitcheson MC, Battersby A, et al. Evaluation of heroin maintenance in controlled trial. Arch Gen Psychiatry. 1980;37(8):877–84. https://doi.org/10.1001/ARCHPSYC.1980.01780210035003 .

Fischer G, Gombas W, Eder H, et al. Buprenorphine versus methadone maintenance for the treatment of opioid dependence. Addiction. 1999;94(9):1337–47. https://doi.org/10.1046/J.1360-0443.1999.94913376.X .

Yancovitz SR, Des Jarlais DC, Peyser NP, et al. A randomized trial of an interim methadone maintenance clinic. Am J Public Health. 1991;81(9):1185–91. https://doi.org/10.2105/AJPH.81.9.1185 .

Demange MK, Fregni F. Limits to clinical trials in surgical areas. Clinics (Sao Paulo). 2011;66(1):159–61. https://doi.org/10.1590/S1807-59322011000100027 .

Wayant C, Meyer C, Gupton R, Som M, Baker D, Vassar M. The fragility index in a cohort of HIV/AIDS randomized controlled trials. J Gen Intern Med. 2019;34(7):1236–43. https://doi.org/10.1007/S11606-019-04928-5 .

Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305–7. https://doi.org/10.1038/D41586-019-00857-9 .

Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124. https://doi.org/10.1371/journal.pmed.0020124 .

Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med. 1999;130(12):995–1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008 .

Allison DB, Shiffrin RM, Stodden V. Reproducibility of research: issues and proposed remedies. Proc Natl Acad Sci U S A. 2018;115(11):2561–2. https://doi.org/10.1073/PNAS.1802324115 .

Download references

Acknowledgements

The authors received no funding for this work.

Author information

Authors and affiliations.

Department of Family Medicine, David Braley Health Sciences Centre, McMaster University, 100 Main St W, 3rdFloor, Hamilton, ON, L8P 1H6, Canada

Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada

Leen Naji, Myanca Rodrigues, Andrew Worster, Lehana Thabane & Zainab Samaan

Department of Medicine, Montefiore Medical Center, New York, NY, USA

Department of Medicine, McMaster University, Hamilton, ON, Canada

Brittany Dennis & Andrew Worster

Department of Medicine, University of British Columbia, Vancouver, Canada

Brittany Dennis

Department of Medicine, Imperial College Healthcare NHS Trust, London, UK

Monica Bawor

Department of Psychiatry and Behavaioral Neurosciences, McMaster University, Hamilton, ON, Canada

Alannah Hillmer

Physician Assistant Program, University of Toronto, Toronto, ON, Canada

Caroul Chawar

Department of Family Medicine, Western University, London, ON, Canada

Department of Anesthesia, McMaster University, Hamilton, ON, Canada

Biostatistics Unit, Research Institute at St Joseph’s Healthcare, Hamilton, ON, Canada

Lehana Thabane

Department of Psychiatry and Behavioral Neurosciences, McMaster University, Hamilton, ON, Canada

Zainab Samaan

You can also search for this author in PubMed   Google Scholar

Contributions

LN, BD, MB, LT, and ZS conceived the research question and protocol. LN, BD, MR, and AH designed the search strategy and ran the literature search. LN, BD, MR, AH, CC, and ED contributed to screening studies for eligibility and data extraction. LN and LT analyzed data. All authors contributed equally to the writing and revision of the manuscript. All authors approved the final version of the manuscript.

Corresponding author

Correspondence to Leen Naji .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Naji, L., Dennis, B., Rodrigues, M. et al. Assessing fragility of statistically significant findings from randomized controlled trials assessing pharmacological therapies for opioid use disorders: a systematic review. Trials 25 , 286 (2024). https://doi.org/10.1186/s13063-024-08104-x

Download citation

Received : 11 December 2022

Accepted : 10 April 2024

Published : 27 April 2024

DOI : https://doi.org/10.1186/s13063-024-08104-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Research methods
  • Critical appraisal
  • Systematic review

ISSN: 1745-6215

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

systematic review of clinical research

Photodynamic Therapy for Colorectal Cancer: A Systematic Review of Clinical Research

Affiliations.

  • 1 Department of Surgery, 7938University of Toronto, Toronto, ON, Canada.
  • 2 Institute of Biomedical Engineering, 7938University of Toronto, Toronto, ON, Canada.
  • 3 10051Princess Margaret Cancer Centre, Toronto, ON, Canada.
  • 4 7989University Health Network, Toronto, ON, Canada.
  • PMID: 35428418
  • PMCID: PMC9667091
  • DOI: 10.1177/15533506221083545

Background: Photodynamic therapy (PDT) is a therapeutic modality that can be used to ablate tumors using the localized generation of reactive oxygen species by combining a photosensitizer, light, and molecular oxygen. This modality holds promise as an adjunctive therapy in the management of colorectal cancer and could be incorporated into neoadjuvant treatment plans under the auspices of prospective clinical trials.

Methods: We conducted a search of primary literature published until January 2021, based on PRISMA guidelines. Primary clinical studies of PDT for the management of colorectal cancer were included. Screening, inclusion, quality assessment, and data collection were performed in duplicate. Analyses were descriptive or thematic.

Results: Nineteen studies were included, most of which were case series. The total number of patients reported to have received PDT for colorectal cancer was 137, almost all of whom received PDT with palliative intent. The most common photosensitizer was hematoporphyin derivative or Photofrin. The light dose used varied from 32 J/cm 2 to 500 J/cm 2 . Complete tumor response (cure) was reported in 40%, with partial response reported in 43.2%. Symptomatic improvement was reported in 51.9% of patients. In total, 32 complications were reported, the most common of which was a skin photosensitivity reaction.

Conclusions: PDT for the management of colorectal cancer has not been well studied, despite promising results in early clinical case series. New, well designed, prospective clinical trials are required to establish and define the role of PDT in the management of colorectal cancer.

Keywords: adjuvant therapy; colon cancer; colorectal cancer; neoadjuvant therapy; photodynamic therapy; photofrin; photosensitizer; phototherapy; rectal cancer.

Publication types

  • Systematic Review
  • Colorectal Neoplasms* / drug therapy
  • Photochemotherapy* / adverse effects
  • Photochemotherapy* / methods
  • Photosensitizing Agents / therapeutic use
  • Prospective Studies
  • Photosensitizing Agents
  • - Google Chrome

Intended for healthcare professionals

  • Access provided by Google Indexer
  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Efficacy of psilocybin...

Efficacy of psilocybin for treating symptoms of depression: systematic review and meta-analysis

Linked editorial.

Psilocybin for depression

  • Related content
  • Peer review

This article has a correction. Please see:

  • EXPRESSION OF CONCERN: Efficacy of psilocybin for treating symptoms of depression: systematic review and meta-analysis - May 04, 2024
  • Athina-Marina Metaxa , masters graduate researcher 1 ,
  • Mike Clarke , professor 2
  • 1 Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford OX2 6GG, UK
  • 2 Northern Ireland Methodology Hub, Centre for Public Health, ICS-A Royal Hospitals, Belfast, Ireland, UK
  • Correspondence to: A-M Metaxa athina.metaxa{at}hmc.ox.ac.uk (or @Athina_Metaxa12 on X)
  • Accepted 6 March 2024

Objective To determine the efficacy of psilocybin as an antidepressant compared with placebo or non-psychoactive drugs.

Design Systematic review and meta-analysis.

Data sources Five electronic databases of published literature (Cochrane Central Register of Controlled Trials, Medline, Embase, Science Citation Index and Conference Proceedings Citation Index, and PsycInfo) and four databases of unpublished and international literature (ClinicalTrials.gov, WHO International Clinical Trials Registry Platform, ProQuest Dissertations and Theses Global, and PsycEXTRA), and handsearching of reference lists, conference proceedings, and abstracts.

Data synthesis and study quality Information on potential treatment effect moderators was extracted, including depression type (primary or secondary), previous use of psychedelics, psilocybin dosage, type of outcome measure (clinician rated or self-reported), and personal characteristics (eg, age, sex). Data were synthesised using a random effects meta-analysis model, and observed heterogeneity and the effect of covariates were investigated with subgroup analyses and metaregression. Hedges’ g was used as a measure of treatment effect size, to account for small sample effects and substantial differences between the included studies’ sample sizes. Study quality was appraised using Cochrane’s Risk of Bias 2 tool, and the quality of the aggregated evidence was evaluated using GRADE guidelines.

Eligibility criteria Randomised trials in which psilocybin was administered as a standalone treatment for adults with clinically significant symptoms of depression and change in symptoms was measured using a validated clinician rated or self-report scale. Studies with directive psychotherapy were included if the psychotherapeutic component was present in both experimental and control conditions. Participants with depression regardless of comorbidities (eg, cancer) were eligible.

Results Meta-analysis on 436 participants (228 female participants), average age 36-60 years, from seven of the nine included studies showed a significant benefit of psilocybin (Hedges’ g=1.64, 95% confidence interval (CI) 0.55 to 2.73, P<0.001) on change in depression scores compared with comparator treatment. Subgroup analyses and metaregressions indicated that having secondary depression (Hedges’ g=3.25, 95% CI 0.97 to 5.53), being assessed with self-report depression scales such as the Beck depression inventory (3.25, 0.97 to 5.53), and older age and previous use of psychedelics (metaregression coefficient 0.16, 95% CI 0.08 to 0.24 and 4.2, 1.5 to 6.9, respectively) were correlated with greater improvements in symptoms. All studies had a low risk of bias, but the change from baseline metric was associated with high heterogeneity and a statistically significant risk of small study bias, resulting in a low certainty of evidence rating.

Conclusion Treatment effects of psilocybin were significantly larger among patients with secondary depression, when self-report scales were used to measure symptoms of depression, and when participants had previously used psychedelics. Further research is thus required to delineate the influence of expectancy effects, moderating factors, and treatment delivery on the efficacy of psilocybin as an antidepressant.

Systematic review registration PROSPERO CRD42023388065.

Figure1

  • Download figure
  • Open in new tab
  • Download powerpoint

Introduction

Depression affects an estimated 300 million people around the world, an increase of nearly 20% over the past decade. 1 Worldwide, depression is also the leading cause of disability. 2

Drugs for depression are widely available but these seem to have limited efficacy, can have serious adverse effects, and are associated with low patient adherence. 3 4 Importantly, the treatment effects of antidepressant drugs do not appear until 4-7 weeks after the start of treatment, and remission of symptoms can take months. 4 5 Additionally, the likelihood of relapse is high, with 40-60% of people with depression experiencing a further depressive episode, and the chance of relapse increasing with each subsequent episode. 6 7

Since the early 2000s, the naturally occurring serotonergic hallucinogen psilocybin, found in several species of mushrooms, has been widely discussed as a potential treatment for depression. 8 9 Psilocybin’s mechanism of action differs from that of classic selective serotonin reuptake inhibitors (SSRIs) and might improve the treatment response rate, decrease time to improvement of symptoms, and prevent relapse post-remission. Moreover, more recent assessments of harm have consistently reported that psilocybin generally has low addictive potential and toxicity and that it can be administered safely under clinical supervision. 10

The renewed interest in psilocybin’s antidepressive effects led to several clinical trials on treatment resistant depression, 11 12 major depressive disorder, 13 and depression related to physical illness. 14 15 16 17 These trials mostly reported positive efficacy findings, showing reductions in symptoms of depression within a few hours to a few days after one dose or two doses of psilocybin. 11 12 13 16 17 18 These studies reported only minimal adverse effects, however, and drug harm assessments in healthy volunteers indicated that psilocybin does not induce physiological toxicity, is not addictive, and does not lead to withdrawal. 19 20 Nevertheless, these findings should be interpreted with caution owing to the small sample sizes and open label design of some of these studies. 11 21

Several systematic reviews and meta-analyses since the early 2000s have investigated the use of psilocybin to treat symptoms of depression. Most found encouraging results, but as well as people with depression some included healthy volunteers, 22 and most combined data from studies of multiple serotonergic psychedelics, 23 24 25 even though each compound has unique neurobiological effects and mechanisms of action. 26 27 28 Furthermore, many systematic reviews included non-randomised studies and studies in which psilocybin was tested in conjunction with psychotherapeutic interventions, 25 29 30 31 32 which made it difficult to distinguish psilocybin’s treatment effects. Most systematic reviews and meta-analyses did not consider the impact of factors that could act as moderators to psilocybin’s effects, such as type of depression (primary or secondary), previous use of psychedelics, psilocybin dosage, type of outcome measure (clinician rated or self-reported), and personal characteristics (eg, age, sex). 25 26 29 30 31 32 Lastly, systematic reviews did not consider grey literature, 33 34 which might have led to a substantial overestimation of psilocybin’s efficacy as a treatment for depression. In this review we focused on randomised trials that contained an unconfounded evaluation of psilocybin in adults with symptoms of depression, regardless of country and language of publication.

In this systematic review and meta-analysis of indexed and non-indexed randomised trials we investigated the efficacy of psilocybin to treat symptoms of depression compared with placebo or non-psychoactive drugs. The protocol was registered in the International Prospective Register of Systematic Reviews (see supplementary Appendix A). The study overall did not deviate from the pre-registered protocol; one clarification was made to highlight that any non-psychedelic comparator was eligible for inclusion, including placebo, niacin, micro doses of psychedelics, and drugs that are considered the standard of care in depression (eg, SSRIs).

Inclusion and exclusion criteria

Double blind and open label randomised trials with a crossover or parallel design were eligible for inclusion. We considered only studies in humans and with a control condition, which could include any type of non -active comparator, such as placebo, niacin, or micro doses of psychedelics.

Eligible studies were those that included adults (≥18 years) with clinically significant symptoms of depression, evaluated using a clinically validated tool for depression and mood disorder outcomes. Such tools included the Beck depression inventory, Hamilton depression rating scale, Montgomery-Åsberg depression rating scale, profile of mood states, and quick inventory of depressive symptomatology. Studies of participants with symptoms of depression and comorbidities (eg, cancer) were also eligible. We excluded studies of healthy participants (without depressive symptomatology).

Eligible studies investigated the effect of psilocybin as a standalone treatment on symptoms of depression. Studies with an active psilocybin condition that involved micro dosing (ie, psilocybin <100 μg/kg, according to the commonly accepted convention 22 35 ) were excluded. We included studies with directive psychotherapy if the psychotherapeutic component was present in both the experimental and the control conditions, so that the effects of psilocybin could be distinguished from those of psychotherapy. Studies involving group therapy were also excluded. Any non-psychedelic comparator was eligible for inclusion, including placebo, niacin, and micro doses of psychedelics.

Changes in symptoms, measured by validated clinician rated or self-report scales, such as the Beck depression inventory, Hamilton depression rating scale, Montgomery-Åsberg depression rating scale, profile of mood states, and quick inventory of depressive symptomatology were considered. We excluded outcomes that were measured less than three hours after psilocybin had been administered because any reported changes could be attributed to the transient cognitive and affective effects of the substance being administered. Aside from this, outcomes were included irrespective of the time point at which measurements were taken.

Search strategy

We searched major electronic databases and trial registries of psychological and medical research, with no limits on the publication date. Databases were the Cochrane Central Register of Controlled Trials via the Cochrane Library, Embase via Ovid, Medline via Ovid, Science Citation Index and Conference Proceedings Citation Index-Science via Web of Science, and PsycInfo via Ovid. A search through multiple databases was necessary because each database includes unique journals. Supplementary Appendix B shows the search syntax used for the Cochrane Central Register of Controlled Trials, which was slightly modified to comply with the syntactic rules of the other databases.

Unpublished and grey literature were sought through registries of past and ongoing trials, databases of conference proceedings, government reports, theses, dissertations, and grant registries (eg, ClinicalTrials.gov, WHO International Clinical Trials Registry Platform, ProQuest Dissertations and Theses Global, and PsycEXTRA). The references and bibliographies of eligible studies were checked for relevant publications. The original search was done in January 2023 and updated search was performed on 10 August 2023.

Data collection, extraction, and management

The results of the literature search were imported to the Endnote X9 reference management software, and the references were imported to the Covidence platform after removal of duplicates. Two reviewers (AM and DT) independently screened the title and abstract of each reference and then screened the full text of potentially eligible references. Any disagreements about eligibility were resolved through discussion. If information was insufficient to determine eligibility, the study’s authors were contacted. The reviewers were not blinded to the studies’ authors, institutions, or journal of publication.

The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram shows the study selection process and reasons for excluding studies that were considered eligible for full text screening. 36

Critical appraisal of individual studies and of aggregated evidence

The methodological quality of eligible studies was assessed using the Cochrane Risk of Bias 2 tool (RoB 2) for assessing risk of bias in randomised trials. 37 In addition to the criteria specified by RoB 2, we considered the potential impact of industry funding and conflicts of interest. The overall methodological quality of the aggregated evidence was evaluated using GRADE (Grading of Recommendations, Assessment, Development and Evaluation). 38

If we found evidence of heterogeneity among the trials, then small study biases, such as publication bias, were assessed using a funnel plot and asymmetry tests (eg, Egger’s test). 39

We used a template for data extraction (see supplementary Appendix C) and summarised the extracted data in tabular form, outlining personal characteristics (age, sex, previous use of psychedelics), methodology (study design, dosage), and outcome related characteristics (mean change from baseline score on a depression questionnaire, response rates, and remission rates) of the included studies. Response conventionally refers to a 50% decrease in symptom severity based on scores on a depression rating scale, whereas remission scores are specific to a questionnaire (eg, score of ≤5 on the quick inventory of depressive symptomatology, score of ≤10 on the Montgomery-Åsberg depression rating scale, 50% or greater reduction in symptoms, score of ≤7 on the Hamilton depression rating scale, or score of ≤12 on the Beck depression inventory). Across depression scales, higher scores signify more severe symptoms of depression.

Continuous data synthesis

From each study we extracted the baseline and post-intervention means and standard deviations (SDs) of the scores between comparison groups for the depression questionnaires and calculated the mean differences and SDs of change. If means and SDs were not available for the included studies, we extracted the values from available graphs and charts using the Web Plot Digitizer application ( https://automeris.io/WebPlotDigitizer/ ). If it was not possible to calculate SDs from the graphs or charts, we generated values by converting standard errors (SEs) or confidence intervals (CIs), depending on availability, using formulas in the Cochrane Handbook (section 7.7.3.2). 40

Standardised mean differences were calculated for each study. We chose these rather than weighted mean differences because, although all the studies measured depression as the primary outcome, they did so with different questionnaires that score depression based on slightly different items. 41 If we had used weighted mean differences, any variability among studies would be assumed to reflect actual methodological or population differences and not differences in how the outcome was measured, which could be misleading. 40

The Hedges’ g effect size estimate was used because it tends to produce less biased results for studies with smaller samples (<20 participants) and when sample sizes differ substantially between studies, in contrast with Cohen’s d. 42 According to the Cochrane Handbook, the Hedges’ g effect size measure is synonymous with the standardised mean difference, 40 and the terms may be used interchangeably. Thus, a Hedges’ g of 0.2, 0.5, 0.8, or 1.2 corresponds to a small, medium, large, or very large effect, respectively. 40

Owing to variation in the participants’ personal characteristics, psilocybin dosage, type of depression investigated (primary or secondary), and type of comparators, we used a random effects model with a Hartung-Knapp-Sidik-Jonkman modification. 43 This model also allowed for heterogeneity and within study variability to be incorporated into the weighting of the results of the included studies. 44 Lastly, this model could help to generalise the findings beyond the studies and patient populations included, making the meta-analysis more clinically useful. 45 We chose the Hartung-Knapp-Sidik-Jonkman adjustment in favour of more widely used random effects models (eg, DerSimonian and Laird) because it allows for better control of type 1 errors, especially for studies with smaller samples, and provides a better estimation of between study variance by accounting for small sample sizes. 46 47

For studies in which multiple treatment groups were compared with a single placebo group, we split the placebo group to avoid multiplicity. 48 Similarly, if studies included multiple primary outcomes (eg, change in depression at three weeks and at six weeks), we split the treatment groups to account for overlapping participants. 40

Prediction intervals (PIs) were calculated and reported to show the expected effect range of a similar future study, in a different setting. In a random effects model, within study measures of variability, such as CIs, can only show the range in which the average effect size could lie, but they are not informative about the range of potential treatment effects given the heterogeneity between studies. 49 Thus, we used PIs as an indication of variation between studies.

Heterogeneity and sensitivity analysis

Statistical heterogeneity was tested using the χ 2 test (significance level P<0.1) and I 2 statistic, and heterogeneity among included studies was evaluated visually and displayed graphically using a forest plot. If substantial or considerable heterogeneity was found (I 2 ≥50% or P<0.1), 50 we considered the study design and characteristics of the included studies. Sources of heterogeneity were explored by subgroup analysis, and the potential effects on the results are discussed.

Planned sensitivity analyses to assess the effect of unpublished studies and studies at high risk of bias were not done because all included studies had been published and none were assessed as high risk of bias. Exclusion sensitivity plots were used to display graphically the impact of individual studies and to determine which studies had a particularly large influence on the results of the meta-analysis. All sensitivity analyses were carried out with Stata 16 software.

Subgroup analysis

To reduce the risk of errors caused by multiplicity and to avoid data fishing, we planned subgroup analyses a priori and limited to: (1) patient characteristics, including age and sex; (2) comorbidities, such as a serious physical condition (previous research indicates that the effects of psilocybin may be less strong for such participants, compared with participants with no comorbidities) 33 ; (3) number of doses and amount of psilocybin administered, because some previous meta-analyses found that a higher number of doses and a higher dose of psilocybin both predicted a greater reduction in symptoms of depression, 34 whereas others reported the opposite 33 ; (4) psilocybin administered alongside psychotherapeutic guidance or as a standalone treatment; (5) severity of depressive symptoms (clinical v subclinical symptomatology); (6) clinician versus patient rated scales; and (7) high versus low quality studies, as determined by RoB 2 assessment scores.

Metaregression

Given that enough studies were identified (≥10 distinct observations according to the Cochrane Handbook’s suggestion 40 ), we performed metaregression to investigate whether covariates, or potential effect modifiers, explained any of the statistical heterogeneity. The metaregression analysis was carried out using Stata 16 software.

Random effects metaregression analyses were used to determine whether continuous variables such as participants’ age, percentage of female participants, and percentage of participants who had previously used psychedelics modified the effect estimate, all of which have been implicated in differentially affecting the efficacy of psychedelics in modifying mood. 51 We chose this approach in favour of converting these continuous variables into categorical variables and conducting subgroup analyses for two primary reasons; firstly, the loss of any data and subsequent loss of statistical power would increase the risk of spurious significant associations, 51 and, secondly, no cut-offs have been agreed for these factors in literature on psychedelic interventions for mood disorders, 52 making any such divisions arbitrary and difficult to reconcile with the findings of other studies. The analyses were based on within study averages, in the absence of individual data points for each participant, with the potential for the results to be affected by aggregate bias, compromising their validity and generalisability. 53 Furthermore, a group level analysis may not be able to detect distinct interactions between the effect modifiers and participant subgroups, resulting in ecological bias. 54 As a result, this analysis should be considered exploratory.

Sensitivity analysis

A sensitivity analysis was performed to determine if choice of analysis method affected the primary findings of meta-analysis. Specifically, we reanalysed the data on change in depression score using a random effects Dersimonian and Laird model without the Hartung-Knapp-Sidik-Jonkman modification and compared the results with those of the originally used model. This comparison is particularly important in the presence of substantial heterogeneity and the potential of small study effects to influence the intervention effect estimate. 55

Patient and public involvement

Research on novel depression treatments is of great interest to both patients and the public. Although patients and members of the public were not directly involved in the planning or writing of this manuscript owing to a lack of available funding for recruitment and researcher training, patients and members of the public read the manuscript after submission.

Figure 1 presents the flow of studies through the systematic review and meta-analysis. 56 A total of 4884 titles were retrieved from the five databases of published literature, and a further 368 titles were identified from the databases of unpublished and international literature in February 2023. After the removal of duplicate records, we screened the abstracts and titles of 875 reports. A further 12 studies were added after handsearching of reference lists and conference proceedings and abstracts. Overall, nine studies totalling 436 participants were eligible. The average age of the participants ranged from 36-60 years. During an updated search on 10 August 2023, no further studies were identified.

Fig 1

Flow of studies in systematic review and meta-analysis

After screening of the title and abstract, 61 titles remained for full text review. Native speakers helped to translate papers in languages other than English. The most common reasons for exclusion were the inclusion of healthy volunteers, absence of control groups, and use of a survey based design rather than an experimental design. After full text screening, nine studies were eligible for inclusion, and 15 clinical trials prospectively registered or underway as of August 2023 were noted for potential future inclusion in an update of this review (see supplementary Appendix D).

We sent requests for further information to the authors of studies by Griffiths et al, 57 Barrett, 58 and Benville et al, 59 because these studies appeared to meet the inclusion criteria but were only provided as summary abstracts online. A potentially eligible poster presentation from the 58th annual meeting of the American College of Neuropsychopharmacology was identified but the lead author (Griffiths) clarified that all information from the presentation was included in the studies by Davis et al 13 and Gukasyan et al 60 ; both of which we had already deemed ineligible.

Barrett 58 reported the effects of psilocybin on the cognitive flexibility and verbal reasoning of a subset of patients with major depressive disorder from Griffith et al’s trial, 61 compared with a waitlist group, but when contacted, Barrett explained that the results were published in the study by Doss et al, 62 which we had already screened and judged ineligible (see supplementary Appendix E). Benville et al’s study 59 presented a follow-up of Ross et al’s study 17 on a subset of patients with cancer and high suicidal ideation and desire for hastened death at baseline. Measures of antidepressant effects of psilocybin treatment compared with niacin were taken before and after treatment crossover, but detailed results are not reported. Table 1 describes the characteristics of the included studies and table 2 lists the main findings of the studies.

Characteristics of included studies

  • View inline

Main findings of included studies

Side effects and adverse events

Side effects reported in the included studies were minor and transient (eg, short term increases in blood pressure, headache, and anxiety), and none were coded as serious. Cahart-Harris et al noted one instance of abnormal dreams and insomnia. 63 This side effect profile is consistent with findings from other meta-analyses. 30 68 Owing to the different scales and methods used to catalogue side effects and adverse events across trials, it was not possible to combine these data quantitatively (see supplementary Appendix F).

Risk of bias

The Cochrane RoB 2 tools were used to evaluate the included studies ( table 3 ). RoB 2 for randomised trials was used for the five reports of parallel randomised trials (Carhart-Harris et al 63 and its secondary analysis Barba et al, 64 Goodwin et al 18 and its secondary analysis Goodwin et al, 65 and von Rotz et al 66 ) and RoB 2 for crossover trials was used for the four reports of crossover randomised trials (Griffiths et al, 14 Grob et al, 15 and Ross et al 17 and its follow-up Ross et al 67 ). Supplementary Appendix G provides a detailed explanation of the assessment of the included studies.

Summary risk of bias assessment of included studies, based on domains in Cochrane Risk of Bias 2 tool

Quality of included studies

Confidence in the quality of the evidence for the meta-analysis was assessed using GRADE, 38 through the GRADEpro GDT software program. Figure 2 shows the results of this assessment, along with our summary of findings.

Fig 2

GRADE assessment outputs for outcomes investigated in meta-analysis (change in depression scores and response and remission rates). The risk in the intervention group (and its 95% CI) is based on the assumed risk in the comparison group and the relative effect of the intervention (and its 95% CI). BDI=Beck depression inventory; CI=confidence interval; GRADE=Grading of Recommendations, Assessment, Development and Evaluation; HADS-D=hospital anxiety and depression scale; HAM-D=Hamilton depression rating scale; MADRS=Montgomery-Åsberg depression rating scale; QIDS=quick inventory of depressive symptomatology; RCT=randomised controlled trial; SD=standard deviation

Meta-analyses

Continuous data, change in depression scores —Using a Hartung-Knapp-Sidik-Jonkman modified random effects meta-analysis, change in depression scores was significantly greater after treatment with psilocybin compared with active placebo. The overall Hedges’ g (1.64, 95% CI 0.55 to 2.73) indicated a large effect size favouring psilocybin ( fig 3 ). PIs were, however, wide and crossed the line of no difference (95% CI −1.72 to 5.03), indicating that there could be settings or populations in which psilocybin intervention would be less efficacious.

Fig 3

Forest plot for overall change in depression scores from before to after treatment. CI=confidence interval; DL=DerSimonian and Laird; HKSJ=Hartung-Knapp-Sidik-Jonkman

Exploring publication bias in continuous data —We used Egger’s test and a funnel plot to examine the possibility of small study biases, such as publication bias. Statistical significance of Egger’s test for small study effects, along with the asymmetry in the funnel plot ( fig 4 ), indicates the presence of bias against smaller studies with non-significant results, suggesting that the pooled intervention effect estimate is likely to be overestimated. 69 An alternative explanation, however, is that smaller studies conducted at the early stages of a new psychotherapeutic intervention tend to include more high risk or responsive participants, and psychotherapeutic interventions tend to be delivered more effectively in smaller trials; both of these factors can exaggerate treatment effects, resulting in funnel plot asymmetry. 70 Also, because of the relatively small number of included studies and the considerable heterogeneity observed, test power may be insufficient to distinguish real asymmetry from chance. 71 Thus, this analysis should be considered exploratory.

Fig 4

Funnel plot assessing publication bias among studies measuring change in depression scores from before to after treatment. CI=confidence interval; θ IV =estimated effect size under inverse variance random effects model

Dichotomous data

We extracted response and remission rates for each group when reported directly, or imputed information when presented graphically. Two studies did not measure response or remission and thus did not contribute data for this part of the analysis. 15 18 The random effects model with a Hartung-Knapp-Sidik-Jonkman modification was used to allow for heterogeneity to be incorporated into the weighting of the included studies’ results, and to provide a better estimation of between study variance accounting for small sample sizes.

Response rate —Overall, the likelihood of psilocybin intervention leading to treatment response was about two times greater (risk ratio 2.02, 95% CI 1.33 to 3.07) than with placebo. Despite the use of different scales to measure response, the heterogeneity between studies was not significant (I 2 =25.7%, P=0.23). PIs were, however, wide and crossed the line of no difference (−0.94 to 3.88), indicating that there could be settings or populations in which psilocybin intervention would be less efficacious.

Remission rate —Overall, the likelihood of psilocybin intervention leading to remission of depression was nearly three times greater than with placebo (risk ratio 2.71, 95% CI 1.75 to 4.20). Despite the use of different scales to measure response, no statistical heterogeneity was found between studies (I 2 =0.0%, P=0.53). PIs were, however, wide and crossed the line of no difference (0.87 to 2.32), indicating that there could be settings or populations in which psilocybin intervention would be less efficacious.

Exploring publication bias in response and remission rates data —We used Egger’s test and a funnel plot to examine whether response and remission estimates were affected by small study biases. The result for Egger’s test was non-significant (P>0.05) for both response and remission estimates, and no substantial asymmetry was observed in the funnel plots, providing no indication for the presence of bias against smaller studies with non-significant results.

Heterogeneity: subgroup analyses and metaregression

Heterogeneity was considerable across studies exploring changes in depression scores (I 2 =89.7%, P<0.005), triggering subgroup analyses to explore contributory factors. Table 4 and table 5 present the results of the heterogeneity analyses (subgroup analyses and metaregression, respectively). Also see supplementary Appendix H for a more detailed description and graphical representation of these results.

Subgroup analyses to explore potential causes of heterogeneity among included studies

Metaregression analyses to explore potential causes of heterogeneity among included studies

Cumulative meta-analyses

We used cumulative meta-analyses to investigate how the overall estimates of the outcomes of interest changed as each study was added in chronological order 72 ; change in depression scores and likelihood of treatment response both increased as the percentage of participants with past use of psychedelics increased across studies, as expected based on the metaregression analysis (see supplementary Appendix I). No other significant time related patterns were found.

We reanalysed the data for change in depression scores using a random effects Dersimonian and Laird model without the Hartung-Knapp-Sidik-Jonkman modification and compared the results with those of the original model. All comparisons found to be significant using the Dersimonian and Laird model with the Hartung-Knapp-Sidik-Jonkman adjustment were also significant without the Hartung-Knapp-Sidik-Jonkman adjustment, and confidence intervals were only slightly narrower. Thus, small study effects do not appear to have played a major role in the treatment effect estimate.

Additionally, to estimate the accuracy and robustness of the estimated treatment effect, we excluded studies from the meta-analysis one by one; no important differences in the treatment effect, significance, and heterogeneity levels were observed after the exclusion of any study (see supplementary Appendix J).

In our meta-analysis we found that psilocybin use showed a significant benefit on change in depression scores compared with placebo. This is consistent with other recent meta-analyses and trials of psilocybin as a standalone treatment for depression 73 74 or in combination with psychological support. 24 25 29 30 31 32 68 75 This review adds to those finding by exploring the considerable heterogeneity across the studies, with subsequent subgroup analyses showing that the type of depression (primary or secondary) and the depression scale used (Montgomery-Åsberg depression rating scale, quick inventory of depressive symptomatology, or Beck depression inventory) had a significant differential effect on the outcome. High between study heterogeneity has been identified by some other meta-analyses of psilocybin (eg, Goldberg et al 29 ), with a higher treatment effect in studies with patients with comorbid life threatening conditions compared with patients with primary depression. 22 Although possible explanations, including personal factors (eg, patients with life threatening conditions being older) or depression related factors (eg, secondary depression being more severe than primary depression) could be considered, these hypotheses are not supported by baseline data (ie, patients with secondary depression do not differ substantially in age or symptom severity from patients with primary depression). The differential effects from assessment scales used have not been examined in other meta-analyses of psilocybin, but this review’s finding that studies using the Beck depression inventory showed a higher treatment effect than those using the Montgomery-Åsberg depression rating scale and quick inventory of depressive symptomatology is consistent with studies in the psychological literature that have shown larger treatment effects when self-report scales are used (eg, Beck depression inventory). 76 77 This finding may be because clinicians tend to overestimate the severity of depression symptoms at baseline assessments, leading to less pronounced differences between before and after treatment identified in clinician assessed scales (eg, Montgomery-Åsberg depression rating scale, quick inventory of depressive symptomatology). 78

Metaregression analyses further showed that a higher average age and a higher percentage of participants with past use of psychedelics both correlated with a greater improvement in depression scores with psilocybin use and explained a substantial amount of between study variability. However, the cumulative meta-analysis showed that the effects of age might be largely an artefact of the inclusion of one specific study, and alternative explanations are worth considering. For instance, Studerus et al 79 identified participants’ age as the only personal variable significantly associated with psilocybin response, with older participants reporting a higher “blissful state” experience. This might be because of older people’s increased experience in managing negative emotions and the decrease in 5-hydroxytryptamine type 2A receptor density associated with older age. 80 Furthermore, Rootman et al 81 reported that the cognitive performance of older participants (>55 years) improved significantly more than that of younger participants after micro dosing with psilocybin. Therefore, the higher decrease in depressive symptoms associated with older age could be attributed to a decrease in cognitive difficulties experienced by older participants.

Interestingly, a clear pattern emerged for past use of psychedelics—the higher the proportion of study participants who had used psychedelics in the past, the higher the post-psilocybin treatment effect observed. Past use of psychedelics has been proposed to create an expectancy bias among participants and amplify the positive effects of psilocybin 82 83 84 ; however, this important finding has not been examined in other meta-analyses and may highlight the role of expectancy in psilocybin research.

Limitations of this study

Generalisability of the findings of this meta-analysis was limited by the lack of racial and ethnic diversity in the included studies—more than 90% of participants were white across all included trials, resulting in a homogeneous sample that is not representative of the general population. Moreover, it was not possible to distinguish between subgroups of participants who had never used psilocybin and those who had taken psilocybin more than a year before the start of the trial, as these data were not provided in the included studies. Such a distinction would be important, as the effects of psilocybin on mood may wane within a year after being administered. 21 85 Also, how psychological support was conceptualised was inconsistent within studies of psilocybin interventions; many studies failed to clearly describe the type of psychological support participants received, and others used methods ranging from directive guidance throughout the treatment session to passive encouragement or reassurance (eg, Griffiths et al, 14 Carhart-Harris et al 63 ). The included studies also did not gather evidence on participants’ previous experiences with treatment approaches, which could influence their response to the trials’ intervention. Thus, differences between participant subgroups related to past use of psilocybin or psychotherapy may be substantial and could help interpret this study’s findings more accurately. Lastly, the use of graphical extraction software to estimate the findings of studies where exact numerical data were not available (eg, Goodwin et al, 18 Grob et al 15 ), may have affected the robustness of the analyses.

A common limitation in studies of psilocybin is the likelihood of expectancy effects augmenting the treatment effect observed. Although some studies used low dose psychedelics as comparators to deal with this problem (eg, Carhart-Harris et al, 63 Goodwin et al, 18 Griffiths et al 14 ) or used a niacin placebo that can induce effects similar to those of psilocybin (eg, Grob et al, 15 Ross et al 17 ), the extent to which these methods were effective in blinding participants is not known. Other studies have, however, reported that participants can accurately identify the study groups to which they had been assigned 70-85% of the time, 84 86 indicating a high likelihood of insufficient blinding. This is especially likely for studies in which a high proportion of participants had previously used psilocybin and other hallucinogens, making the identification of the drug’s acute effects easier (eg, Griffiths et al, 14 Grob et al, 15 Ross et al 17 ). Patients also have expectations related to the outcome of their treatment, expecting psilocybin to improve their symptoms of depression, and these positive expectancies are strong predictors of actual treatment effects. 87 88 Importantly, the effect of outcome expectations on treatment effect is particularly strong when patient reported measures are used as primary outcomes, 89 which was the case in several of the included studies (eg, Griffiths et al, 14 Grob et al, 15 Ross et al 17 ). Unfortunately, none of the included studies recorded expectations before treatment, so it is not possible to determine the extent to which this factor affected the findings.

Implications for clinical practice

Although this review’s findings are encouraging for psilocybin’s potential as an effective antidepressant, a few areas about its applicability in clinical practice remain unexplored. Firstly, it is unclear whether the protocols for psilocybin interventions in clinical trials can be reliably and safely implemented in clinical practice. In clinical trials, patients receive psilocybin in a non-traditional medical setting, such as a specially designed living room, while they may be listening to curated calming music and are isolated from most external stimuli by wearing eyeshades and external noise-cancelling earphones. A trained therapist closely supervises these sessions, and the patient usually receives one or more preparatory sessions before the treatment commences. Standardising an intervention setting with so many variables is unlikely to be achievable in routine practice, and consensus is considerably lacking on the psychotherapeutic training and accreditations needed for a therapist to deliver such treatment. 90 The combination of these elements makes this a relatively complex and expensive intervention, which could make it challenging to gain approval from regulatory agencies and to gain reimbursement from insurance companies and others. Within publicly funded healthcare systems, the high cost of treatment may make psilocybin treatment inaccessible. The high cost associated with the intervention also increases the risk that unregulated clinics may attempt to cut costs by making alterations to the protocol and the therapeutic process, 91 92 which could have detrimental effects for patients. 92 93 94 Thus, avoiding the conflation of medical and commercial interests is a primary concern that needs to be dealt with before psilocybin enters mainstream practice.

Implications for future research

More large scale randomised trials with long follow-up are needed to fully understand psilocybin’s treatment potential, and future studies should aim to recruit a more diverse population. Another factor that would make clinical trials more representative of routine practice would be to recruit patients who are currently using or have used commonly prescribed serotonergic antidepressants. Clinical trials tend to exclude such participants because many antidepressants that act on the serotonin system modulate the 5-hydroxytryptamine type 2A receptor that psilocybin primarily acts upon, with prolonged use of tricyclic antidepressants associated with more intense psychedelic experiences and use of monoamine oxidase inhibitors or SSRIs inducing weaker responses to psychedelics. 95 96 97 Investigating psilocybin in such patients would, however, provide valuable insight on how psilocybin interacts with commonly prescribed drugs for depression and would help inform clinical practice.

Minimising the influence of expectancy effects is another core problem for future studies. One strategy would be to include expectancy measures and explore the level of expectancy as a covariate in statistical analysis. Researchers should also test the effectiveness of condition masking. Another proposed solution would be to adopt a 2×2 balanced placebo design, where both the drug (psilocybin or placebo) and the instructions given to participants (told they have received psilocybin or told they have received placebo) are crossed. 98 Alternatively, clinical trials could adopt a three arm design that includes both an inactive placebo (eg, saline) and active placebo (eg, niacin, lower psylocibin dose), 98 allowing for the effects of psilocybin to be separated from those of the placebo.

Overall, future studies should explore psilocybin’s exact mechanism of treatment effectiveness and outline how its physiological effects, mystical experiences, dosage, treatment setting, psychological support, and relationship with the therapist all interact to produce a synergistic antidepressant effect. Although this may be difficult to achieve using an explanatory randomised trial design, pragmatic clinical trial designs may be better suited to psilocybin research, as their primary objective is to achieve high external validity and generalisability. Such studies may include multiple alternative treatments rather than simply an active and placebo treatment comparison (eg, psilocybin v SSRI v serotonin-noradrenaline reuptake inhibitor), and participants would be recruited from broader clinical populations. 99 100 Although such studies are usually conducted after a drug’s launch, 100 earlier use of such designs could help assess the clinical effectiveness of psilocybin more robustly and broaden patient access to a novel type of antidepressant treatment.

Conclusions

This review’s findings on psilocybin’s efficacy in reducing symptoms of depression are encouraging for its use in clinical practice as a drug intervention for patients with primary or secondary depression, particularly when combined with psychological support and administered in a supervised clinical environment. However, the highly standardised treatment setting, high cost, and lack of regulatory guidelines and legal safeguards associated with psilocybin treatment need to be dealt with before it can be established in clinical practice.

What is already known on this topic

Recent research on treatments for depression has focused on psychedelic agents that could have strong antidepressant effects without the drawbacks of classic antidepressants; psilocybin being one such substance

Over the past decade, several clinical trials, meta-analyses, and systematic reviews have investigated the use of psilocybin for symptoms of depression, and most have found that psilocybin can have antidepressant effects

Studies published to date have not investigated factors that may moderate psilocybin’s effects, including type of depression, past use of psychedelics, dosage, outcome measures, and publication biases

What this study adds

This review showed a significantly greater efficacy of psilocybin among patients with secondary depression, patients with past use of psychedelics, older patients, and studies using self-report measures for symptoms of depression

Efficacy did not appear to be homogeneous across patient types—for example, those with depression and a life threatening illness appeared to benefit more from treatment

Further research is needed to clarify the factors that maximise psilocybin’s treatment potential for symptoms of depression

Ethics statements

Ethical approval.

This study was approved by the ethics committee of the University of Oxford Nuffield Department of Medicine, which waived the need for ethical approval and the need to obtain consent for the collection, analysis, and publication of the retrospectively obtained anonymised data for this non-interventional study.

Data availability statement

The relevant aggregated data and statistical code will be made available on reasonable request to the corresponding author.

Acknowledgments

We thank DT who acted as an independent secondary reviewer during the study selection and data review process.

Contributors: AMM contributed to the design and implementation of the research, analysis of the results, and writing of the manuscript. MC was involved in planning and supervising the work and contributed to the writing of the manuscript. AMM and MC are the guarantors. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: None received.

Competing interests: All authors have completed the ICMJE uniform disclosure form at https://www.icmje.org/disclosure-of-interest/ and declare: no support from any organisation for the submitted work; AMM is employed by IDEA Pharma, which does consultancy work for pharmaceutical companies developing drugs for physical and mental health conditions; MC was the supervisor for AMM’s University of Oxford MSc dissertation, which forms the basis for this paper; no other relationships or activities that could appear to have influenced the submitted work.

Transparency: The corresponding author (AMM) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as registered have been explained.

Dissemination to participants and related patient and public communities: To disseminate our findings and increase the impact of our research, we plan on writing several social media posts and blog posts outlining the main conclusions of our paper. These will include blog posts on the websites of the University of Oxford’s Department of Primary Care Health Sciences and Department for Continuing Education, as well as print publications, which are likely to reach a wider audience. Furthermore, we plan to present our findings and discuss them with the public in local mental health related events and conferences, which are routinely attended by patient groups and advocacy organisations.

Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ .

  • ↵ World Health Organization. Depressive Disorder (Depression); 2023. https://www.who.int/news-room/fact-sheets/detail/depression .
  • GBD 2017 Disease and Injury Incidence and Prevalence Collaborators
  • Cipriani A ,
  • Furukawa TA ,
  • Salanti G ,
  • Trivedi MH ,
  • Wisniewski SR ,
  • Mitchell AJ
  • Bockting CL ,
  • Hollon SD ,
  • Jarrett RB ,
  • Nierenberg AA ,
  • Petersen TJ ,
  • Páleníček T ,
  • Carbonaro TM ,
  • Bradstreet MP ,
  • Barrett FS ,
  • Carhart-Harris RL ,
  • Bolstridge M ,
  • Griffiths RR ,
  • Johnson MW ,
  • Carducci MA ,
  • Danforth AL ,
  • Chopra GS ,
  • Kraehenmann R ,
  • Preller KH ,
  • Scheidegger M ,
  • Goodwin GM ,
  • Aaronson ST ,
  • Alvarez O ,
  • Bogenschutz MP ,
  • Podrebarac SK ,
  • Roseman L ,
  • Galvão-Coelho NL ,
  • Gonzalez M ,
  • Dos Santos RG ,
  • Osório FL ,
  • Crippa JA ,
  • Zuardi AW ,
  • Cleare AJ ,
  • Martelli C ,
  • Benyamina A
  • Vollenweider FX ,
  • Demetriou L ,
  • Carhart-Harris RL
  • Timmermann C ,
  • Giribaldi B ,
  • Goldberg SB ,
  • Nicholas CR ,
  • Raison CL ,
  • Irizarry R ,
  • Winczura A ,
  • Dimassi O ,
  • Dhillon N ,
  • Griffiths RR
  • Castro Santos H ,
  • Gama Marques J
  • Moreno FA ,
  • Wiegand CB ,
  • Taitano EK ,
  • Liberati A ,
  • Tetzlaff J ,
  • Altman DG ,
  • PRISMA Group
  • Sterne JAC ,
  • Savović J ,
  • Guyatt GH ,
  • Schünemann HJ ,
  • Tugwell P ,
  • Knottnerus A
  • Sterne JA ,
  • Sutton AJ ,
  • Ioannidis JP ,
  • Higgins JPT ,
  • Chandler J ,
  • Borenstein M ,
  • Hedges LV ,
  • Higgins JP ,
  • Rothstein HR
  • DerSimonian R ,
  • ↵ Borenstein M, Hedges L, Rothstein H. Meta-analysis: Fixed effect vs. random effects. Meta-analysis. com. 2007;1-62.
  • IntHout J ,
  • Rovers MM ,
  • Gøtzsche PC
  • Spineli LM ,
  • ↵ Higgins JP, Green S. Identifying and measuring heterogeneity. Cochrane handbook for systematic reviews of interventions. 2011;5(0).
  • Austin PC ,
  • O’Donnell KC ,
  • Mennenga SE ,
  • Bogenschutz MP
  • Sander SD ,
  • Berlin JA ,
  • Santanna J ,
  • Schmid CH ,
  • Szczech LA ,
  • Feldman HI ,
  • Anti-Lymphocyte Antibody Induction Therapy Study Group
  • ↵ Iyengar S, Greenhouse J. Sensitivity analysis and diagnostics. Handbook of research synthesis and meta-analysis. Russell Sage Foundation, 2009:417-33.
  • McKenzie JE ,
  • Bossuyt PM ,
  • ↵ Griffiths R, Barrett F, Johnson M, Mary C, Patrick F, Alan D. Psilocybin-Assisted Treatment of Major Depressive Disorder: Results From a Randomized Trial. Proceedings of the ACNP 58th Annual Meeting: Poster Session II. In Neuropsychopharmacology. 2019;44:230-384.
  • ↵ Barrett F. ACNP 58th Annual Meeting: Panels, Mini-Panels and Study Groups. [Abstract.] Neuropsychopharmacology 2019;44:1-77. doi: 10.1038/s41386-019-0544-z . OpenUrl CrossRef
  • Benville J ,
  • Agin-Liebes G ,
  • Roberts DE ,
  • Gukasyan N ,
  • Hurwitz ES ,
  • Považan M ,
  • Rosenberg MD ,
  • Carhart-Harris R ,
  • Buehler S ,
  • Kettner H ,
  • von Rotz R ,
  • Schindowski EM ,
  • Jungwirth J ,
  • Vargas AS ,
  • Barroso M ,
  • Gallardo E ,
  • Isojarvi J ,
  • Lefebvre C ,
  • Glanville J
  • Sukpraprut-Braaten S ,
  • Narlesky M ,
  • Strayhan RC
  • Prouzeau D ,
  • Conejero I ,
  • Voyvodic PL ,
  • Becamel C ,
  • Lopez-Castroman J
  • Więckiewicz G ,
  • Stokłosa I ,
  • Gorczyca P ,
  • John Mann J ,
  • Currier D ,
  • Zimmerman M ,
  • Friedman M ,
  • Boerescu DA ,
  • Attiullah N
  • Borgherini G ,
  • Conforti D ,
  • Studerus E ,
  • Kometer M ,
  • Vollenweider FX
  • Pinborg LH ,
  • Rootman JM ,
  • Kryskow P ,
  • Turner EH ,
  • Rosenthal R
  • Bershad AK ,
  • Schepers ST ,
  • Bremmer MP ,
  • Sepeda ND ,
  • Hurwitz E ,
  • Horvath AO ,
  • Del Re AC ,
  • Flückiger C ,
  • Rutherford BR ,
  • Pearson C ,
  • Husain SF ,
  • Harris KM ,
  • George JR ,
  • Michaels TI ,
  • Sevelius J ,
  • Williams MT
  • Collins A ,
  • Bonson KR ,
  • Buckholtz JW ,
  • Yamauchi M ,
  • Matsushima T ,
  • Coleshill MJ ,
  • Colloca L ,
  • Zachariae R ,
  • Colagiuri B
  • Heifets BD ,
  • Pratscher SD ,
  • Bradley E ,
  • Sugarman J ,

systematic review of clinical research

IMAGES

  1. Evidence Based Medicine

    systematic review of clinical research

  2. The Systematic Review Process

    systematic review of clinical research

  3. systematic review step by step guide

    systematic review of clinical research

  4. systematic review of research utilization

    systematic review of clinical research

  5. Systematic reviews and meta-analyses in the medical sciences: Best

    systematic review of clinical research

  6. Medical research: Systematic review and meta-analysis

    systematic review of clinical research

VIDEO

  1. Systematic Review PROSPERO submission tips by Dr. Rahul Kashyap

  2. Introduction to Systematic Review of Research

  3. Types of review by Ethics Committee

  4. Systematic Literature Review: An Introduction [Urdu/Hindi]

  5. Clinical Trials Registration & Results Reporting & Data Sharing Part 4 of 4

  6. Conduct title screening for systemic review using Endnote Covidence

COMMENTS

  1. How to Write a Systematic Review: A Narrative Review

    Background. A systematic review, as its name suggests, is a systematic way of collecting, evaluating, integrating, and presenting findings from several studies on a specific question or topic.[] A systematic review is a research that, by identifying and combining evidence, is tailored to and answers the research question, based on an assessment of all relevant studies.[2,3] To identify assess ...

  2. How to do a systematic review

    A systematic review aims to bring evidence together to answer a pre-defined research question. This involves the identification of all primary research relevant to the defined review question, the critical appraisal of this research, and the synthesis of the findings.13 Systematic reviews may combine data from different.

  3. Cochrane Database of Systematic Reviews

    A Cochrane Review is a systematic review that attempts to identify, appraise and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a specific research question. Researchers conducting systematic reviews use explicit, systematic methods that are selected with a view aimed at minimizing bias, to produce ...

  4. Clinical systematic reviews

    Systematic reviews answer research questions through a defined methodology. It is a complex task and multiple articles need to be referred to acquire wide range of required knowledge to conduct a systematic review. The aim of this article is to bring the process into a single paper. The statistical concepts and sequence of steps to conduct a systematic review or a meta-analysis are examined by ...

  5. Guidance to best tools and practices for systematic reviews

    Data continue to accumulate indicating that many systematic reviews are methodologically flawed, biased, redundant, or uninformative. Some improvements have occurred in recent years based on empirical methods research and standardization of appraisal tools; however, many authors do not routinely or consistently apply these updated methods. In addition, guideline developers, peer reviewers, and ...

  6. What is a systematic review?

    A high-quality systematic review is described as the most reliable source of evidence to guide clinical practice. The purpose of a systematic review is to deliver a meticulous summary of all the available primary research in response to a research question. A systematic review uses all the existing research and is sometime called 'secondary research' (research on research).

  7. Systematic reviews: Brief overview of methods, limitations, and

    Systematic reviews can help us know what we know about a topic, and what is not yet known, often to a greater extent than the findings of a single study. 4 The process is comprehensive enough to establish consistency and generalizability of research findings across settings and populations. 3 A meta-analysis is a type of systematic review that ...

  8. Systematic Review

    A systematic review is a type of review that uses repeatable methods to find, select, and synthesize all available evidence. It answers a clearly formulated research question and explicitly states the methods used to arrive at the answer. Example: Systematic review. In 2008, Dr. Robert Boyle and his colleagues published a systematic review in ...

  9. Guidance on Conducting a Systematic Literature Review

    Introduction. Literature review is an essential feature of academic research. Fundamentally, knowledge advancement must be built on prior existing work. To push the knowledge frontier, we must know where the frontier is. By reviewing relevant literature, we understand the breadth and depth of the existing body of work and identify gaps to explore.

  10. Systematic Review in Clinical Research : Anesthesia & Analgesia

    A systematic review involves a series of distinct steps: Define the research question: Analogous to a clear specific aim in a clinical study, 3 a well-defined review question is the backbone of the systematic review. A strong review question is clinically relevant, not too narrow yet focused, and typically describes the population, the ...

  11. The Role of Systematic Reviews in Clinical Research and Practice

    A good systematic review offers the "users" of clinical research the best available evidence to use in their clinical practice. For the "doers" of surgical research, it summarizes the evidence, which may spark a new research project to answer any questions that remain unanswered by the systematic review. Types of reviews

  12. The Role of Systematic Reviews in Clinical Research and Practice

    Systematic review is a research method that can pool the highest level of evidence by scientifically collecting and analyzing relevant data from the conflicting studies. 11 Moreover, systematic ...

  13. Clinical systematic reviews

    Systematic reviews are a structured approach to answer a research question based on all suitable available empirical evidence. The statistical methodology used to synthesize results in such a review is called 'meta-analysis'. There are five types of clinical systematic reviews described in this article (see Fig.

  14. Systematic review of clinical practice guidelines and systematic

    A systematic review (SR) is a research method for synthesizing evidence on a specific topic. Among the various types of systematic reviews, there are SRs of guidelines (CPGs) and SRs of SRs. Traditionally, they are limited to just one type of secondary evidence. This paper introduces an innovative S …

  15. Internet addiction and problematic Internet use: A systematic review of

    Results: The systematic literature review identified a total of 46 relevant studies. The included studies used clinical samples, and focused on characteristics of treatment seekers and online addiction treatment. Four main types of clinical research studies were identified, namely research involving (1) treatment seeker characteristics; (2 ...

  16. What is context in knowledge translation? Results of a systematic

    Knowledge Translation (KT) aims to convey novel ideas to relevant stakeholders, motivating their response or action to improve people's health. Initially, the KT literature focused on evidence-based medicine, applying findings from laboratory and clinical research to disease diagnosis and treatment. Since the early 2000s, the scope of KT has expanded to include decision-making with health ...

  17. Assessing fragility of statistically significant findings from

    More concerning, international clinical practice guidelines rely on out-of-date systematic review evidence to inform guideline development . In fact, these guidelines make strong recommendations based on a fraction of the available evidence, employing trials with restrictive eligibility criteria which fail to reflect the common OUD patients ...

  18. The safety of trastuzumab deruxtecan (DS‐8201) with a focus on

    Cancer is an international interdisciplinary journal publishing articles on the latest clinical cancer research findings, spanning ... with a focus on interstitial lung disease and/or pneumonitis: A systematic review and single-arm meta-analysis ... We also assessed the quality of the literature with the Cochrane Handbook for Systematic Reviews ...

  19. Utilization of fluid-based biomarkers as endpoints in disease-modifying

    To determine the gaps and opportunities of fluid-based biomarker endpoints in AD clinical trials, we conducted a systematic review of which and how frequently fluid-based biomarkers have so far been employed for what purpose (primary, secondary or exploratory endpoint or target engagement) and in which type of clinical trials (phase, patient ...

  20. Adult and pediatric relapsing multiple sclerosis phase II and phase III

    A systematic review was prepared according to the latest Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. 7 This systematic review of trial registration information regarding phase II and III clinical trials in patients with MS was performed using three clinical trial registries: ClinicalTrials.gov ...

  21. Photodynamic Therapy for Colorectal Cancer: A Systematic Review of

    New, well designed, prospective clinical trials are required to establish and define the role of PDT in the management of colorectal cancer. ... A Systematic Review of Clinical Research Surg Innov. 2022 Dec;29(6):788-803. doi: 10.1177/15533506221083545. ...

  22. Efficacy of psilocybin for treating symptoms of depression: systematic

    Objective To determine the efficacy of psilocybin as an antidepressant compared with placebo or non-psychoactive drugs. Design Systematic review and meta-analysis. Data sources Five electronic databases of published literature (Cochrane Central Register of Controlled Trials, Medline, Embase, Science Citation Index and Conference Proceedings Citation Index, and PsycInfo) and four databases of ...

  23. Vowel onset measures and their reliability, sensitivity and specificity

    Objective To systematically evaluate the evidence for the reliability, sensitivity and specificity of existing measures of vowel-initial voice onset. Methods A literature search was conducted across electronic databases for published studies (MEDLINE, EMBASE, Scopus, Web of Science, CINAHL, PubMed Central, IEEE Xplore) and grey literature (ProQuest for unpublished dissertations) measuring ...

  24. Understanding and Evaluating Systematic Reviews and Meta-analyses

    A systematic review that incorporates quantitative pooling of similar studies to produce an overall summary of treatment effects is a meta-analysis. A systematic review should have clear, focused clinical objectives containing four elements expressed through the acronym PICO (Patient, group of patients, or problem, an Intervention, a Comparison ...

  25. Practice variation in urine collection methods among pre-toilet trained

    Background Urinary tract infections (UTIs) are a common cause of acute illness among infants and young children. There are numerous methods for collecting urine in children who are not toilet trained. This review examined practice variation in the urine collection methods for diagnosing UTI in non-toilet-trained children. Methods A systematic review was completed by searching MEDLINE (Ovid ...

  26. Inappropriate use of proton pump inhibitors in clinical practice

    We read with interest the population-based cohort studies by Abrahami et al on proton pump inhibitors (PPI) and the risk of gastric and colon cancers.1 2 PPI are used at all levels of healthcare and across different subspecialties for various indications.3 4 A recent systematic review on the global trends and practices of PPI recognised 28 million PPI users from 23 countries, suggesting that ...