Research articles

Temporal trends in lifetime risks of atrial fibrillation and its complications, antipsychotic use in people with dementia, predicting the risks of kidney failure and death in adults with moderate to severe chronic kidney disease, impact of large scale, multicomponent intervention to reduce proton pump inhibitor overuse, esketamine after childbirth for mothers with prenatal depression, glucagon-like peptide 1 receptor agonist use and risk of thyroid cancer, use of progestogens and the risk of intracranial meningioma, delirium and incident dementia in hospital patients, derivation and external validation of a simple risk score for predicting severe acute kidney injury after intravenous cisplatin, quality and safety of artificial intelligence generated health information, large language models and the generation of health disinformation, 25 year trends in cancer incidence and mortality among adults in the uk, cervical pessary versus vaginal progesterone in women with a singleton pregnancy, comparison of prior authorization across insurers, diagnostic accuracy of magnetically guided capsule endoscopy with a detachable string for detecting oesophagogastric varices in adults with cirrhosis, ultra-processed food exposure and adverse health outcomes, added benefit and revenues of oncology drugs approved by the ema, exposure to air pollution and hospital admission for cardiovascular diseases, short term exposure to low level ambient fine particulate matter and natural cause, cardiovascular, and respiratory morbidity, optimal timing of influenza vaccination in young children, effect of exercise for depression, association of non-alcoholic fatty liver disease with cardiovascular disease and all cause death in patients with type 2 diabetes, duration of cpr and outcomes for adults with in-hospital cardiac arrest, clinical effectiveness of an online physical and mental health rehabilitation programme for post-covid-19 condition, atypia detected during breast screening and subsequent development of cancer, publishers’ and journals’ instructions to authors on use of generative ai in academic and scientific publishing, effectiveness of glp-1 receptor agonists on glycaemic control, body weight, and lipid profile for type 2 diabetes, neurological development in children born moderately or late preterm, invasive breast cancer and breast cancer death after non-screen detected ductal carcinoma in situ, all cause and cause specific mortality in obsessive-compulsive disorder, acute rehabilitation following traumatic anterior shoulder dislocation, perinatal depression and risk of mortality, undisclosed financial conflicts of interest in dsm-5-tr, effect of risk mitigation guidance opioid and stimulant dispensations on mortality and acute care visits, update to living systematic review on sars-cov-2 positivity in offspring and timing of mother-to-child transmission, perinatal depression and its health impact, christmas 2023: common healthcare related instruments subjected to magnetic attraction study, using autoregressive integrated moving average models for time series analysis of observational data, demand for morning after pill following new year holiday, christmas 2023: christmas recipes from the great british bake off, effect of a doctor working during the festive period on population health: experiment using doctor who episodes, christmas 2023: analysis of barbie medical and science career dolls, christmas 2023: effect of chair placement on physicians’ behavior and patients’ satisfaction, management of chronic pain secondary to temporomandibular disorders, christmas 2023: projecting complete redaction of clinical trial protocols, christmas 2023: a drug target for erectile dysfunction to help improve fertility, sexual activity, and wellbeing, christmas 2023: efficacy of cola ingestion for oesophageal food bolus impaction, conservative management versus laparoscopic cholecystectomy in adults with gallstone disease, social media use and health risk behaviours in young people, untreated cervical intraepithelial neoplasia grade 2 and cervical cancer, air pollution deaths attributable to fossil fuels, implementation of a high sensitivity cardiac troponin i assay and risk of myocardial infarction or death at five years, covid-19 vaccine effectiveness against post-covid-19 condition, association between patient-surgeon gender concordance and mortality after surgery, intravascular imaging guided versus coronary angiography guided percutaneous coronary intervention, treatment of lower urinary tract symptoms in men in primary care using a conservative intervention, autism intervention meta-analysis of early childhood studies, effectiveness of the live zoster vaccine during the 10 years following vaccination, effects of a multimodal intervention in primary care to reduce second line antibiotic prescriptions for urinary tract infections in women, pyrotinib versus placebo in combination with trastuzumab and docetaxel in patients with her2 positive metastatic breast cancer, association of dcis size and margin status with risk of developing breast cancer post-treatment, racial differences in low value care among older patients in the us, pharmaceutical industry payments and delivery of low value cancer drugs, rosuvastatin versus atorvastatin in adults with coronary artery disease, clinical effectiveness of septoplasty versus medical management for nasal airways obstruction, ultrasound guided lavage with corticosteroid injection versus sham lavage with and without corticosteroid injection for calcific tendinopathy of shoulder, early versus delayed antihypertensive treatment in patients with acute ischaemic stroke, mortality risks associated with floods in 761 communities worldwide, interactive effects of ambient fine particulate matter and ozone on daily mortality in 372 cities, association between changes in carbohydrate intake and long term weight changes, future-case control crossover analysis for adjusting bias in case crossover studies, association between recently raised anticholinergic burden and risk of acute cardiovascular events, suboptimal gestational weight gain and neonatal outcomes in low and middle income countries: individual participant data meta-analysis, efficacy and safety of an inactivated virus-particle vaccine for sars-cov-2, effect of invitation letter in language of origin on screening attendance: randomised controlled trial in breastscreen norway, visits by nurse practitioners and physician assistants in the usa, non-erosive gastro-oesophageal reflux disease and oesophageal adenocarcinoma, venous thromboembolism with use of hormonal contraception and nsaids, food additive emulsifiers and risk of cardiovascular disease, balancing risks and benefits of cannabis use, promoting activity, independence, and stability in early dementia and mild cognitive impairment, effect of home cook interventions for salt reduction in china, cancer mortality after low dose exposure to ionising radiation, effect of a smartphone intervention among university students with unhealthy alcohol use, long term risk of death and readmission after hospital admission with covid-19 among older adults, mortality rates among patients successfully treated for hepatitis c, association between antenatal corticosteroids and risk of serious infection in children, the proportions of term or late preterm births after exposure to early antenatal corticosteroids, and outcomes, safety of ba.4-5 or ba.1 bivalent mrna booster vaccines, comparative effectiveness of booster vaccines among adults aged ≥50 years, third dose vaccine schedules against severe covid-19 during omicron predominance in nordic countries, private equity ownership and impacts on health outcomes, costs, and quality, healthcare disruption due to covid-19 and avoidable hospital admission, educational inequalities in mortality and their mediators among generations across four decades, prevalence and predictors of data and code sharing in the medical and health sciences, medicare eligibility and in-hospital treatment patterns and health outcomes for patients with trauma, therapeutic value of first versus supplemental indications of drugs in us and europe, hospital admissions linked to sars-cov-2 infection in children and adolescents, vitamin d supplementation and major cardiovascular events, follow us on, content links.

  • Collections
  • Health in South Asia
  • Women’s, children’s & adolescents’ health
  • News and views
  • BMJ Opinion
  • Rapid responses
  • Editorial staff
  • BMJ in the USA
  • BMJ in South Asia
  • Submit your paper
  • BMA members
  • Subscribers
  • Advertisers and sponsors

Explore BMJ

  • Our company
  • BMJ Careers
  • BMJ Learning
  • BMJ Masterclasses
  • BMJ Journals
  • BMJ Student
  • Academic edition of The BMJ
  • BMJ Best Practice
  • The BMJ Awards
  • Email alerts
  • Activate subscription

Information

Featured Clinical Reviews

  • Screening for Atrial Fibrillation: US Preventive Services Task Force Recommendation Statement JAMA Recommendation Statement January 25, 2022
  • Evaluating the Patient With a Pulmonary Nodule: A Review JAMA Review January 18, 2022

Select Your Interests

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing
  • Download PDF
  • Share X Facebook Email LinkedIn
  • Permissions

Conducting Clinical Research During the COVID-19 Pandemic : Protecting Scientific Integrity

  • 1 Department of Biostatistics, University of Washington, Seattle
  • 2 Department of Biometrics and Data Sciences, Bristol Myers Squibb, Princeton, New Jersey
  • 3 Statistics Collaborative, Washington, DC
  • Editorial Randomized Clinical Trials and COVID-19 Howard Bauchner, MD; Phil B. Fontanarosa, MD, MBA JAMA
  • Editorial Maintaining Quality of Editorial Evaluation and Peer Review Howard Bauchner, MD; Phil B. Fontanarosa, MD, MBA; Robert M. Golub, MD JAMA
  • Viewpoint Optimizing the Trade-off Between Learning and Doing in the COVID-19 Pandemic Derek C. Angus, MD, MPH JAMA
  • Viewpoint Preserving Clinical Trial Integrity During the Coronavirus Pandemic Mary M. McDermott, MD; Anne B. Newman, MD, MPH JAMA
  • Viewpoint Using Controlled Trials to Resolve Key Unknowns About Policy During the COVID-19 Pandemic Paul Starr, PhD JAMA
  • Viewpoint Remote Research and Clinical Trial Integrity During and After the Coronavirus Pandemic Mary M. McDermott, MD; Anne B. Newman, MD, MPH JAMA

The current novel coronavirus disease 2019 (COVID-19) pandemic has led to substantial changes in health risks, access to health care, and daily interactions. Through these and other challenges, the pandemic is affecting ongoing clinical trials that are evaluating interventions aimed at preventing or treating diseases other than COVID-19. Meaningful alterations to the implementation of protocol-specified procedures for adherence and retention of study participants, without careful consideration of the consequences to statistical analysis, can compromise the generalizability of clinical trial results about efficacy and safety of studied interventions in the postpandemic setting.

Recent guidance from the US Food and Drug Administration 1 urges sponsors of clinical trials to be “assuring the safety of trial participants, maintaining compliance with good clinical practice (GCP), and minimizing the risks to trial integrity during the COVID-19 pandemic.” To achieve these goals, trialists should identify activities that do not place study participants at increased risk of COVID-19 due to study-specific procedures. While ensuring safety, trials should achieve timely recruitment, proper adherence to protocol-specified procedures, high retention of participants, and proper statistical analyses to avoid undue loss of statistical power and increased risk of bias due to informative missing data. This Viewpoint discusses procedures that would “ensure the rights, safety and wellbeing of participants,” 2 while mitigating risks to trial integrity.

Potentially Delaying or Pausing Enrollment

Trials may proceed essentially unchanged if enrolled participants can complete protocol procedures safely and thus contribute to important analyses. Sometimes, however, a wiser course is to delay initiation of enrollment in trials that have not yet started or to pause enrollment in ongoing trials, perhaps on a site-specific basis, until the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) viral burden in that setting is low. Later reinitiation of enrollment to achieve protocol-specified statistical power can begin after the study team judges that it can adequately manage risks of COVID-19. Such an approach is particularly important if concurrent illnesses, both directly and indirectly related to COVID-19, could confound the effect of study treatment on the main safety and efficacy outcomes.

Attaining Best Achievable Adherence to Study Interventions and High Levels of Retention

Careful attention to administration of study drugs is needed to reduce risk of bias from nonadherence to study products caused by the COVID-19 pandemic. 3 Ideally, adherence to study drugs should be consistent with levels clinically achievable in nonpandemic settings. Approaches to increasing adherence without increasing risk of SARS-CoV-2 infection could include enabling study medications to be taken at home by the patient, 4 having health care workers make home visits while wearing personal protective equipment, or enabling delivery of injections in clinical facilities capable of achieving adequate social distancing.

Methods that facilitate more complete data collection during the COVID-19 pandemic also are crucial to increasing the validity of assessments of efficacy and safety. 5 , 6 To prevent disruption of data collection, trialists should consider approaches such as electronic data capture implemented at home by the patient or caregiver, telemedicine, or telephone interviews. 4 Additional procedures that could increase the validity of critical outcome assessments include centralized data monitoring, digital technology, home nursing visits, or use of local instead of central laboratories. Some data, even though imperfectly collected, usually are more useful than no data.

If an outbreak of COVID-19 leads to interruption of delivery of the intervention and study assessments at a site, study staff should maintain contact with participants to enhance the likelihood of retention after the intensity of the outbreak has waned.

Study staff should maintain a list of patients whose participation in the trial has been adversely affected by COVID-19, along with the nature of those consequences. The list should capture the type of missing information, as well as the reasons. Insights about missingness may be used to enlighten modifications to the proposed modified statistical analyses. All changes to data collection should be discussed with clinicians, statisticians, operational staff, and data management teams and should be well documented.

Prespecifying Analyses to Address Effects of the Pandemic on Trial Integrity

The pandemic may lead to the need to revise the statistical methods planned for the trial’s primary and secondary analyses. Individuals blinded to emerging trial data about efficacy and safety should identify and prespecify sensible revised approaches to analyses. In some cases, the primary analysis would exclude intermittent intervals of calendar time that meet prespecified site-specific criteria for severe disruption from the COVID-19 pandemic (eg, substantively reduced ability to deliver blinded study drug or to retain participants). In trials that were relatively near completion when severe disruption began, the study team (not the data monitoring committee) could decide to terminate the trial, thus sacrificing a small degree of statistical power in exchange for more interpretable inference. In other trials, the investigators could justify restarting enrollment after the period of severe disruption, enabling a trial to achieve its specified goals by successfully building on the prepandemic data.

A protocol amendment or revised statistical analysis plan could specify additional modifications to planned study procedures, patient populations, and statistical methods made in response to the pandemic. The reasons for these modifications should be clearly and completely presented and dated. The appropriate protocol review committee, established by the sponsor and the relevant regulatory authorities, should review and approve these changes. The institutional review boards should be informed of operational changes to the protocol. Database lock should occur only after all of these steps are completed and the data quality is confirmed. These changes should be detailed in the Methods section of the research report, and any protocol amendment should be submitted to journals at the time of submission and highlighted in the cover letter.

Addressing Analytical Issues That Are Important in Protecting Trial Integrity

Valid statistical approaches should guide the presentation of results of clinical trials for which the conduct has been meaningfully influenced by the pandemic. For example, if data are collected during the period of severe disruption in a manner different from the approach originally planned, the analysis could stratify the data by method of collection.

Presentation of the results should focus on prespecified primary analyses of the primary and key secondary end points, as defined by the version of the statistical analysis plan that was in place when the database was locked. Sensitivity analyses, prespecified and post hoc, should be presented for these end points to assess the robustness of results. The analyses should address the influence of informative missingness and of deviations from protocol-specified levels of adherence. Descriptions of analyses should clearly delineate which of these irregularities were due to the COVID-19 pandemic.

Descriptive supportive analyses of treatment effects should present estimates and corresponding confidence intervals rather than P values. Traditional forest plots show estimated treatment effects across subgroups formed by baseline covariates; similar plots could explore the influence of the COVID-19 pandemic on trial results. For example, when prespecified primary analyses have excluded intermittent intervals of calendar time that meet prespecified criteria for severe disruption from the COVID-19 pandemic, forest plots could compare effects within and outside those intermittent intervals.

Trialists should present and interpret the results of clinical trials objectively, explicitly recognizing both the strengths of the analyses and the uncertainties resulting from the pandemic. The analyses should aim to make inferences relevant to the postpandemic period. If the COVID-19 pandemic has meaningfully compromised trial conduct, confirmatory trials to achieve targeted levels of reliability may be needed.

Corresponding Author: Thomas R. Fleming, PhD, Department of Biostatistics, University of Washington, PO Box 357232, Seattle, WA 98195-7232 ( [email protected] ).

Published Online: May 28, 2020. doi:10.1001/jama.2020.9286

Conflict of Interest Disclosures: Dr Fleming reported receiving support from the National Institutes of Health and Bristol Myers Squibb, and having extensive interactions in coronavirus disease 2019 (COVID-19) research with the World Health Organization Research and Development Working Group. Dr Labriola reported receiving support from Global Drug Development/Biometrics and Data Sciences of Bristol Myers Squibb. Dr Wittes reported that her employer (Statistics Collaborative) has contracts for statistical collaborations with many companies that have ongoing studies potentially affected by the COVID-19 pandemic.

See More About

Fleming TR , Labriola D , Wittes J. Conducting Clinical Research During the COVID-19 Pandemic : Protecting Scientific Integrity . JAMA. 2020;324(1):33–34. doi:10.1001/jama.2020.9286

Manage citations:

© 2024

Artificial Intelligence Resource Center

Cardiology in JAMA : Read the Latest

Browse and subscribe to JAMA Network podcasts!

Others Also Liked

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

research articles clinical

Genetic associations of cardiovascular risk genes in European patients with coronary artery spasm

  • Roman Tremmel
  • Valeria Martínez Pereyra

research articles clinical

Fluid balance during acute phase extracorporeal cardiopulmonary resuscitation and outcomes in OHCA patients: a retrospective multicenter cohort study

  • Takuya Taira
  • Akihiko Inoue
  • The SAVE-J II study group

research articles clinical

The role of cardiac magnetic resonance in sports cardiology: results from a large cohort of athletes

  • Viviana Maestrini
  • Marco Penza
  • Antonio Pelliccia

research articles clinical

Mortality and rehospitalization in patients with pre-existing implantable pacemakers undergoing catheter ablation are related to increased comorbidity burden—data from the German Ablation Registry

  • Gerrit Frommeyer
  • Florian Reinke
  • Lars Eckardt

research articles clinical

Prognostic impact of prior LVEF in patients with heart failure with mildly reduced ejection fraction

  • Alexander Schmitt
  • Michael Behnes
  • Tobias Schupp

research articles clinical

Safety and feasibility of early discharge after transcatheter aortic valve implantation with ACURATE Neo—the POLESTAR trial

  • Joris F. Ooms
  • Kristoff Cornelis
  • Nicolas M. Van Mieghem

research articles clinical

Prevalence of elevated lipoprotein(a) in cardiac rehabilitation patients — results from a large-scale multicentre registry in Germany

  • Christoph Altmann
  • Nelu-Adrian Burlacu
  • on behalf of the MEDIAN Medical Board Cardiology

research articles clinical

Frequency, characteristics and risk assessment of pulmonary arterial hypertension with a left heart disease phenotype

  • Matteo Toma
  • Giulio Savonitto
  • Pietro Ameri

research articles clinical

Mitral annular disjunction in out-of-hospital cardiac arrest patients—a retrospective cardiac MRI study

  • Felix Troger

research articles clinical

Heart failure with preserved ejection fraction: diagnosis, risk assessment, and treatment

  • Stephan von Haehling
  • Birgit Assmus
  • Johann Bauersachs

research articles clinical

Evaluating the predictive value of late gadolinium enhancement assessed by cardiac magnetic resonance on sudden cardiac death in patients selected for implantable cardioverter defibrillator and cardiac resynchronization therapy implantation: a systematic review and meta-analysis

  • Richárd Masszi
  • Előd-János Zsigmond
  • Annamária Kosztin

research articles clinical

The role of coronary artery disease in lung transplantation: a propensity-matched analysis

  • Enzo Lüsebrink
  • Nikolaus Kneidinger

research articles clinical

Association of an impaired GH-IGF-I axis with cardiac wasting in patients with advanced cancer

  • Ann-Kathrin Fröhlich
  • Jan Porthun
  • Markus S. Anker

research articles clinical

Drivers and recent trends of hospitalisation costs related to acute pulmonary embolism

  • Katharina Mohr
  • Lukas Hobohm
  • Karsten Keller

research articles clinical

Impact of pulmonary hypertension on outcomes after TEER in patients suffering from mitral regurgitation

  • Philippa Jaeger
  • Ioannis Toskas
  • Dominik Rath

research articles clinical

Performance of risk models to predict mortality risk for patients with heart failure: evaluation in an integrated health system

  • Faraz S. Ahmad
  • Ted Ling Hu
  • Claudio Campagnari

research articles clinical

Giant coronary aneurysm and acute myocardial infarction: clinical case report and literature review

  • Barbara Pala
  • Giuliano Tocci
  • Domenico Gabrielli

research articles clinical

Global longitudinal strain in long-term risk prediction after acute coronary syndrome: an investigation of added prognostic value to ejection fraction

  • Joel Lenell
  • Bertil Lindahl
  • Tomasz Baron

research articles clinical

Short- and long-term outcomes of patients with active cancer presenting with an acute coronary syndrome

  • Inbar Nardi Agmon
  • Katia Orvin

research articles clinical

Age- and sex-specific physiological cardiac remodeling: the search for the Fountain of Youth

  • Philipp Markwirth
  • Bernhard Haring

research articles clinical

Lower revascularization rates after high-speed rotational atherectomy compared to modified balloons in calcified coronary lesions: 5-year outcomes of the randomized PREPARE-CALC trial

  • Nader Mankerious
  • Gert Richardt
  • Mohamed Abdel-Wahab

research articles clinical

Balloon technologies for pulmonary vein isolation—12-month outcome and comparison of the novel radiofrequency balloon with the cryoballoon in patients with paroxysmal atrial fibrillation

  • Jan-Hendrik van den Bruck
  • Jonas Wörmann
  • Daniel Steven

research articles clinical

Aortic regurgitation is associated with African American and Asian race, smoking, renal disease, and numerous autoimmune diseases in addition to traditional cardiovascular risk factors but has lower risk with alcohol intake

  • Brandon Timmerman
  • Mehrtash Hashemzadeh
  • Mohammad Reza Movahed

research articles clinical

Systematic underestimation of myocardial perfusion reserve by regadenoson stress perfusion CMR—when haste makes waste

  • Georgios Moutzoukis
  • Marie K. Lorenz
  • Andreas Seitz

research articles clinical

Transcatheter aortic valve implantation in patients with significant septal hypertrophy

  • Martin Beyer
  • Till Joscha Demal
  • Andreas Schaefer

research articles clinical

Sex-specific structural and functional cardiac remodeling during healthy aging assessed by cardiovascular magnetic resonance

  • Leonhard Grassow
  • Jan Gröschel
  • Jeanette Schulz-Menger

research articles clinical

Ultra-long-term efficacy and safety of catheter-based renal denervation in resistant hypertension: 10-year follow-up outcomes

  • Hussam Al Ghorani
  • Saarraaken Kulenthiran
  • Felix Mahfoud

research articles clinical

Effect of supervised exercise training on cardiovascular function in patients with intermittent claudication: a systematic review and meta-analysis of randomized controlled trials

  • Yu-Chen Xiao
  • Wan-Yang Li
  • Yang-Kai Wang

research articles clinical

Clinical benefit and limitations of CT imaging substrate visualization technology for VT ablation

  • Naoya Kataoka
  • Teruhiko Imamura

Unravelling gender differences in coronary artery disease: are we equal?

  • Kyriakos Dimitriadis
  • Panayiotis Iliakis
  • Konstantinos Tsioufis

Incidence and predictors of left atrial thrombus in patients with atrial fibrillation under anticoagulation therapy

  • Joong Min Lee
  • Myung-Jin Cha
  • Min Soo Cho

research articles clinical

Deferral of non-emergency cardiac interventions is associated with increased emergency hospitalizations up to 24 months post-procedure

  • Stefanie Andreß
  • Dominik Felbel
  • Tilman Stephan

research articles clinical

Role of preexisting right ventricular remodeling in symptoms and prognosis after transcatheter tricuspid valve repair

  • Marc-André Ehrenfels
  • Caroline Fretter
  • Christos Iliadis

research articles clinical

Risk of death, thrombotic and hemorrhagic events in anticoagulated patients with atrial fibrillation and systemic autoimmune diseases: an analysis from a global federated dataset

  • Tommaso Bucci
  • Chiara Cardamone
  • Gregory Y. H. Lip

research articles clinical

Clinical value of a comprehensive clinical- and echocardiography-based risk score on predicting cardiovascular outcomes in ischemic heart failure patients with reduced ejection fraction

  • Peter Nordbeck

research articles clinical

Trends of mortality rate in patients with congenital heart defects in Germany—analysis of nationwide data of the Federal Statistical Office of Germany

  • Hashim Abdul-Khaliq
  • Delphina Gomes
  • Martin Poryo

research articles clinical

Non-femoral focused transaxillary access in TAVI: GARY data analysis and future trends

  • Max M. Meertens
  • Sabine Bleiziffer

research articles clinical

90. Jahrestagung der Deutsche Gesellschaft für Kardiologie – Herz- und Kreislaufforschung e.V. (German Cardiac Society)

Human nerve distribution and density around the carotid artery bifurcation.

  • Helge Struthoff
  • Lucas Lauder

research articles clinical

Causal association between lipoproteins and risk of coronary artery disease—a systematic review and meta-analysis of Mendelian randomization studies

  • Rongyuan Yang

research articles clinical

Prognostic implication of heart failure stage and left ventricular ejection fraction for patients with in-hospital cardiac arrest: a 16-year retrospective cohort study

  • Chih-Hung Wang
  • Wen-Jone Chen

research articles clinical

Publisher Correction: Respiratory exchange ratio overshoot during exercise recovery: a promising prognostic marker in HfrEF

  • Marco Vecchiato
  • Daniel Neunhaeuserer
  • Andrea Ermolao

How does electrocardiography-derived compare with angiography-derived coronary microcirculatory resistance index in patients with takotsubo syndrome?

  • John E. Madias

Use of class IC antiarrhythmic drugs in patients with structural heart disease and implantable cardioverter defibrillator

  • Maura M. Zylla
  • Julian Wolfes
  • Patrick Lugenbiel

research articles clinical

Frailty, periinterventional complications and outcome in patients undergoing percutaneous mitral and tricuspid valve repair

  • Matthieu Schäfer
  • Hannah Nöth
  • Roman Pfister

research articles clinical

Incidence and clinical impact of renal failure and bleeding following transcatheter tricuspid valve annuloplasty

  • Thorsten Gietzen
  • Jan Althoff
  • Maria Isabel Körber

research articles clinical

Clinical and serological characterization of acute pleuropericarditis suggests an autoinflammatory pathogenesis and highlights risk factors for recurrent attacks

  • Dorothee Kaudewitz
  • Norbert Blank

research articles clinical

Comparison of Arctic Front Advance Pro and POLARx cryoballoons for ablation therapy of atrial fibrillation: an intraprocedural analysis

  • Vincent Knappe
  • Caroline Lahrmann
  • Thomas Beiert

research articles clinical

Efficacy of dapagliflozin in improving arrhythmia-related outcomes after ablation for atrial fibrillation: a retrospective single-center study

  • Hyeong Jun Noh
  • Sung Joo Cha
  • Jin Kyung Hwang

research articles clinical

Respiratory exchange ratio overshoot during exercise recovery: a promising prognostic marker in HFrEF

research articles clinical

  • Find a journal
  • Publish with us
  • Track your research

Finding Research Articles in PubMed and CINAHL

Limit to research article, limit by publication type, limit by clinical queries, limit by evidence-based practice.

  • Find Nursing Authors

CINAHL Research Article Option

  • High Sensitivity is the broadest search, to include ALL relevant material. May also include less relevant materials.
  • High Specificity is the most targeted search to include only the most relevant result set. May miss some relevant materials.
  • Best Balance retrieves the best balance between Sensitivity and Specificity.

CINAHL Evidence-Based Practice Option

  • << Previous: PubMed
  • Next: Find Nursing Authors >>
  • Last Updated: Jun 22, 2023 1:19 PM
  • URL: https://guides.lib.uw.edu/hsl/nmeth403

Be boundless

1959 NE Pacific Street | T334 Health Sciences Building | Box 357155 | Seattle, WA 98195-7155 | 206-543-3390

© 2024 University of Washington | Seattle, WA

CC BY-NC 4.0

Loading metrics

Open Access

Peer-reviewed

Research Article

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study

Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected] (AJT); [email protected] (DSJT)

Affiliations University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom, Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, United Kingdom

ORCID logo

Roles Data curation, Investigation, Writing – review & editing

Affiliation University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom

Affiliation Eye Institute, Cleveland Clinic Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates

Roles Data curation, Investigation, Writing – original draft, Writing – review & editing

Affiliations University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom, Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, United Kingdom

Roles Data curation, Investigation

Affiliation West Suffolk NHS Foundation Trust, Bury St Edmunds, United Kingdom

Affiliation Manchester Royal Eye Hospital, Manchester University NHS Foundation Trust, Manchester, United Kingdom

Affiliation Birmingham and Midland Eye Centre, Sandwell and West Birmingham NHS Foundation Trust, Birmingham, United Kingdom

Affiliation Department of Ophthalmology, Chang Gung Memorial Hospital, Linkou Medical Center, Taoyuan, Taiwan

Affiliation Yong Loo Lin School of Medicine, National University of Singapore, Singapore

Roles Data curation, Investigation, Project administration, Writing – review & editing

Affiliation Bedfordshire Hospitals NHS Foundation Trust, Luton and Dunstable, United Kingdom

Affiliation Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore

Roles Writing – review & editing

Affiliations Birmingham and Midland Eye Centre, Sandwell and West Birmingham NHS Foundation Trust, Birmingham, United Kingdom, Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom

Roles Funding acquisition, Project administration

Affiliations Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore, Duke-NUS Medical School, Singapore, Singapore, Byers Eye Institute, Stanford University, Palo Alto, California, United States of America

  •  [ ... ],

Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

Affiliations Birmingham and Midland Eye Centre, Sandwell and West Birmingham NHS Foundation Trust, Birmingham, United Kingdom, Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom, Academic Ophthalmology, School of Medicine, University of Nottingham, Nottingham, United Kingdom

  • [ view all ]
  • [ view less ]
  • Arun James Thirunavukarasu, 
  • Shathar Mahmood, 
  • Andrew Malem, 
  • William Paul Foster, 
  • Rohan Sanghera, 
  • Refaat Hassan, 
  • Sean Zhou, 
  • Shiao Wei Wong, 
  • Yee Ling Wong, 

PLOS

  • Published: April 17, 2024
  • https://doi.org/10.1371/journal.pdig.0000341
  • Reader Comments

Table 1

Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types ( p >0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher ( p <0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

Author summary

Large language models (LLMs) are the most sophisticated form of language-based artificial intelligence. LLMs have the potential to improve healthcare, and experiments and trials are ongoing to explore potential avenues for LLMs to improve patient care. Here, we test state-of-the-art LLMs on challenging questions used to assess the aptitude of eye doctors (ophthalmologists) in the United Kingdom before they can be deemed fully qualified. We compare the performance of these LLMs to fully trained ophthalmologists as well as doctors in training to gauge the aptitude of the LLMs for providing advice to patients about eye health. One of the LLMs, GPT-4, exhibits favourable performance when compared with fully qualified and training ophthalmologists; and comparisons with its predecessor model, GPT-3.5, indicate that this superior performance is due to improved accuracy and relevance of model responses. LLMs are approaching expert-level ophthalmological knowledge and reasoning, and may be useful for providing eye-related advice where access to healthcare professionals is limited. Further research is required to explore potential avenues of clinical deployment.

Citation: Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. (2024) Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit Health 3(4): e0000341. https://doi.org/10.1371/journal.pdig.0000341

Editor: Man Luo, Mayo Clinic Scottsdale, UNITED STATES

Received: July 31, 2023; Accepted: February 26, 2024; Published: April 17, 2024

Copyright: © 2024 Thirunavukarasu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data are available as supplementary information , excluding copyrighted material from the textbook used for experiments.

Funding: DSWT is supported by the National Medical Research Council, Singapore (NMCR/HSRG/0087/2018; MOH-000655-00; MOH-001014-00), Duke-NUS Medical School (Duke-NUS/RSF/2021/0018; 05/FY2020/EX/15-A58), and Agency for Science, Technology and Research (A20H4g2141; H20C6a0032). DSJT is supported by a Medical Research Council / Fight for Sight Clinical Research Fellowship (MR/T001674/1). These funders were not involved in the conception, execution, or reporting of this review.

Competing interests: AM is a member of the Panel of Examiners of the Royal College of Ophthalmologists and performs unpaid work as an FRCOphth examiner. DSWT holds a patent on a deep learning system to detect retinal disease. DSJT authored the book used in the study and receives royalty from its sales. The other authors have no competing interests to declare.

Introduction

Generative Pre-trained Transformer 3.5 (GPT-3.5) and 4 (GPT-4) are large language models (LLMs) trained on datasets containing hundreds of billions of words from articles, books, and other internet sources [ 1 , 2 ]. ChatGPT is an online chatbot which uses GPT-3.5 or GPT-4 to provide bespoke responses to human users’ queries [ 3 ]. LLMs have revolutionised the field of natural language processing, and ChatGPT has attracted significant attention in medicine for attaining passing level performance in medical school examinations and providing more accurate and empathetic messages than human doctors in response to patient queries on a social media platform [ 3 , 4 , 5 , 6 ]. While GPT-3.5 performance in more specialised examinations has been inadequate, GPT-4 is thought to represent a significant advancement in terms of medical knowledge and reasoning [ 3 , 7 , 8 ]. Other LLMs in wide use include Pathways Language Model 2 (PaLM 2) and Large Language Model Meta AI 2 (LLaMA 2) [ 3 ], [ 9 , p. 2], [ 10 ].

Applications and trials of LLMs in ophthalmological settings has been limited despite ChatGPT’s performance in questions relating to ‘eyes and vision’ being superior to other subjects in an examination for general practitioners [ 7 , 11 ]. ChatGPT has been trialled on the North American Ophthalmology Knowledge Assessment Program (OKAP), and Fellowship of the Royal College of Ophthalmologists (FRCOphth) Part 1 and Part 2 examinations. In both cases, relatively poor results have been reported for GPT-3.5, with significant improvement exhibited by GPT-4 [ 12 , 13 , 14 , 15 , 16 ]. However, previous studies are afflicted by two important issues which may affect their validity and interpretability. First, so-called ‘contamination’, where test material features in the pretraining data used to develop LLMs, may result in inflated performance as models recall previously seen text rather than using clinical reasoning to provide an answer. Second, examination performance in and of itself provides little information regarding the potential of models to contribute to clinical practice as a medical-assistance tool [ 3 ]. Clinical benchmarks are required to understanding the meaning and implications of scores in ophthalmological examinations attained by LLMs and are a necessary precursor to clinical trials of LLM-based interventions.

Here, we used FRCOphth Part 2 examination questions to gauge the ophthalmological knowledge base and reasoning capability of LLMs using fully qualified and currently training ophthalmologists as clinical benchmarks. These questions were not freely available online, minimising the risk of contamination. The FRCOphth Part 2 Written Examination tests the clinical knowledge and skills of ophthalmologists in training using multiple choice questions with no negative marking and must be passed to fully qualify as a specialist eye doctor in the United Kingdom.

Question extraction

FRCOphth Part 2 questions were sourced from a textbook for doctors preparing to take the examination [ 17 ]. This textbook is not freely available on the internet, making the possibility of its content being included in LLMs’ training datasets unlikely [ 1 ]. All 360 multiple-choice questions from the textbook’s six chapters were extracted, and a 90-question mock examination from the textbook was segregated for LLM and doctor comparisons. Two researchers matched the subject categories of the practice papers’ questions to those defined in the Royal College of Ophthalmologists’ documentation concerning the FRCOphth Part 2 written examination. Similarly, two researchers categorised each question as first order recall or higher order reasoning, corresponding to ‘remembering’ and ‘applying’ or ‘analysing’ in Bloom’s taxonomy, respectively [ 18 ]. Disagreement between classification decisions was resolved by a third researcher casting a deciding vote. Questions containing non-plain text elements such as images were excluded as these could not be inputted to the LLM applications.

Trialling large language models

Every eligible question was inputted into ChatGPT (GPT-3.5 and GPT-4 versions; OpenAI, San Francisco, California, United States of America) between April 29 and May 10, 2023. The answers provided by GPT-3.5 and GPT-4 were recorded and their whole reply to each question was recorded for further analysis. If ChatGPT failed to provide a definitive answer, the question was re-trialled up to three times, after which ChatGPT’s answer was recorded as ‘null’ if no answer was provided. Correct answers (‘ground truth’) were defined as the answers provided by the textbook and were recorded for every eligible question to facilitate calculation of performance. Upon their release, Bard (Google LLC, Mountain View, California, USA) and HuggingChat (Hugging Face, Inc., New York City, USA) were used to trial PaLM 2 (Google LLC) and LLaMA (Meta, Menlo Park, California, USA) respectively on the portion of the textbook corresponding to a 90-question examination, adhering to the same procedures between June 20 and July 2, 2023.

Clinical benchmarks

To gauge the performance, accuracy, and relevance of LLM outputs, five expert ophthalmologists who had all passed the FRCOphth Part 2 (E1-E5), three trainees (residents) currently in ophthalmology training programmes (T1-T3), and two unspecialised ( i . e . not in ophthalmology training) junior doctors (J1-J2) first answered the 90-question mock examination independently, without reference to textbooks, the internet, or LLMs’ recorded answers. As with the LLMs, doctors’ performance was calculated with reference to the correct answers provided by the textbook. After completing the examination, ophthalmologists graded the whole output of GPT-3.5 and GPT-4 on a Likert scale from 1–5 (very bad, bad, neutral, good, very good) to qualitatively appraise accuracy of information provided and relevance of outputs to the question used as an input prompt. For these appraisals, ophthalmologists were blind to the LLM source (which was presented in a randomised order) and to their previous answers to the same questions, but they could refer to the question text and correct answer and explanation provided by the textbook. Procedures are comprehensively described in the protocol issued to the ophthalmologists ( S1 Protocol ).

Our null hypothesis was that LLMs and doctors would exhibit similar performance, supported by results in a wide range of medical examinations [ 3 , 6 ]. Prospective power analysis was conducted which indicated that 63 questions were required to identify a 10% superior performance of an LLM to human performance at a 5% significance level (type 1 error rate) with 80% power (20% type 2 error rate). This indicated that the 90-question examination in our experiments was more than sufficient to detect ~10% differences in overall performance. The whole 90-question mock examination was used to avoid over- or under-sampling certain question types with respect to actual FRCOphth papers. To verify that the mock examination was representative of the FRCOphth Part 2 examination, expert ophthalmologists were asked to rate the difficulty of questions used here in comparison to official examinations on a 5-point Likert scale (“much easier”, “somewhat easier”, “similar”, “somewhat more difficult”, “much more difficult”).

Statistical analysis

Performance of doctors and LLMs were compared using chi-squared (χ 2 ) tests. Agreement between answers provided by doctors and LLMs was quantified through calculation of Kappa statistics, interpreted in accordance with McHugh’s recommendations [ 19 ]. To further explore the strengths and weaknesses of the answer providers, performance was stratified by question type (first order fact recall or higher order reasoning) and subject using a chi-squared or Fisher’s exact test where appropriate. Likert scale data corresponding to the accuracy and relevance of GPT-3.5 and GPT-4 responses to the same questions were analysed with paired t -tests with the Bonferroni correction applied to mitigate the risk of false positive results due to multiple-testing—parametric testing was justified by a sufficient sample size [ 20 ]. A chi-squared test was used to quantify the significance of any difference in overall preference of ophthalmologists choosing between GPT-3.5 and GPT-4 responses. Statistical significance was concluded where p < 0.05. For additional contextualisation, examination statistics corresponding to FRCOphth Part 2 written examinations taken between July 2017 and December 2022 were collected from Royal College of Ophthalmologists examiners’ reports [ 21 ]. These statistics facilitated comparisons between human and LLM performance in the mock examination with the performance of actual candidates in recent examinations. Failure cases where all LLMs provided an incorrect answer were appraised qualitatively to explore any specific weaknesses of the technology.

Statistical analysis was conducted in R (version 4.1.2; R Foundation for Statistical Computing, Vienna, Austria), and figures were produced in Affinity Designer (version 1.10.6; Serif Ltd, West Bridgford, Nottinghamshire, United Kingdom).

Questions sources

Of 360 questions in the textbook, 347 questions (including 87 of the 90 questions from the mock examination chapter) were included [ 17 ]. Exclusions were all due to non-text elements such as images and tables which could not be inputted into LLM chatbot interfaces. The distribution of question types and subjects within the whole set and mock examination set of questions is summarised in Table 1 and S1 Table alongside performance.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

Question subject and type distributions presented alongside scores attained by LLMs (GPT-3.5, GPT-4, LLaMA, and PaLM 2), expert ophthalmologists (E1-E5), ophthalmology trainees (T1-T3), and unspecialised junior doctors (J1-J2). Median scores do not necessarily sum to the overall median score, as fractional scores are impossible.

https://doi.org/10.1371/journal.pdig.0000341.t001

GPT-4 represents a significant advance on GPT-3.5 in ophthalmological knowledge and reasoning.

Overall performance over 347 questions was significantly higher for GPT-4 (61.7%) than GPT-3.5 (48.41%; χ 2 = 12.32, p <0.01), with results detailed in S1 Fig and S1 Table . ChatGPT performance was consistent across question types and subjects ( S1 Table ). For GPT-4, no significant variation was observed with respect to first order and higher order questions (χ 2 = 0.22, p = 0.64), or subjects defined by the Royal College of Ophthalmologists (Fisher’s exact test over 2000 iterations, p = 0.23). Similar results were observed for GPT-3.5 with respect to first and second order questions (χ 2 = 0.08, p = 0.77), and subjects (Fisher’s exact test over 2000 iterations, p = 0.28). Performance and variation within the 87-question mock examination was very similar to the overall performance over 347 questions, and subsequent experiments were therefore restricted to that representative set of questions.

GPT-4 compares well with other LLMs, junior and trainee doctors and ophthalmology experts.

Performance in the mock examination is summarised in Fig 1 —GPT-4 (69%) was the top-scoring model, performing to a significantly higher standard than GPT-3.5 (48%; χ 2 = 7.33, p < 0.01) and LLaMA (32%; χ 2 = 22.77, p < 0.01), but statistically similarly to PaLM 2 (56%) despite a superior score (χ 2 = 2.81, p = 0.09). LLaMA exhibited the lowest examination score, significantly weaker than GPT-3.5 (χ 2 = 4.58, p = 0.03) and PaLM-2 (χ 2 = 10.01, p < 0.01) as well as GPT-4.

thumbnail

Examination performance in the 87-question mock examination used to trial LLMs (GPT-3.5, GPT-4, LLaMA, and PaLM 2), expert ophthalmologists (E1-E5), ophthalmology trainees (T1-T3), and unspecialised junior doctors (J1-J2). Dotted lines depict the mean performance of expert ophthalmologists (66/87; 76%), ophthalmology trainees (60/87; 69%), and unspecialised junior doctors (37/87; 43%). The performance of GPT-4 lay within the range of expert ophthalmologists and ophthalmology trainees.

https://doi.org/10.1371/journal.pdig.0000341.g001

The performance of GPT-4 was statistically similar to the mean score attained by expert ophthalmologists ( Fig 1 ; χ 2 = 1.18, p = 0.28). Moreover, GPT-4’s performance exceeded the mean mark attained across FRCOphth Part 2 written examination candidates between 2017–2022 (66.06%), mean pass mark according to standard setting (61.31%), and the mean official mark required to pass the examination after adjustment (63.75%), as detailed in S2 Table . In individual comparisons with expert ophthalmologists, GPT-4 was equivalent in 3 cases (χ 2 tests, p > 0.05, S3 Table ), and inferior in 2 cases (χ 2 tests, p < 0.05; Table 2 ). In comparisons with ophthalmology trainees, GPT-4 was equivalent to all three ophthalmology trainees (χ 2 tests, p > 0.05; Table 2 ). GPT-4 was significantly superior to both unspecialised trainee doctors (χ 2 tests, p < 0.05; Table 2 ). Doctors were anonymised in analysis, but their ophthalmological experience is summarised in S3 Table . Unsurprisingly, junior doctors (J1-J2) attained lower scores than expert ophthalmologists (E1-E5; t = 7.18, p < 0.01), and ophthalmology trainees (T1-T3; t = 11.18, p < 0.01), illustrated in Fig 1 . Ophthalmology trainees approached expert-level scores with no significant difference between the groups ( t = 1.55, p = 0.18). None of the other LLMs matched any of the expert ophthalmologists, mean mark of real examination candidates, or FRCOphth Part 2 pass mark.

Expert ophthalmologists agreed that the mock examination was a faithful representation of actual FRCOphth Part 2 Written Examination papers with a mean and median score of 3/5 (range 2-4/5).

thumbnail

Results of pair-wise comparisons of examination performance between GPT-4 and the other answer providers. Significantly greater performance for GPT-4 is highlighted green, significantly inferior performance for GPT-4 is highlighted orange. GPT-4 was superior to all other LLMs and unspecialised junior doctors, and equivalent to most expert ophthalmologists and all ophthalmology trainees.

https://doi.org/10.1371/journal.pdig.0000341.t002

LLM strengths and weaknesses are similar to doctors.

Agreement between answers given by LLMs, expert ophthalmologists, and trainee doctors was generally absent (0 ≤ κ < 0.2), minimal (0.2 ≤ κ < 0.4), or weak (0.4 ≤ κ < 0.6), with moderate agreement only recorded for one pairing between the two highest performing ophthalmologists ( Fig 2 ; κ = 0.64) [ 19 ]. Disagreement was primarily the result of general differences in knowledge and reasoning ability, illustrated by strong negative correlation between Kappa statistic (quantifying agreement) and difference in examination performance (Pearson’s r = -0.63, p < 0.01). Answer providers with more similar scores exhibited greater agreement overall irrespective of their category (LLM, expert ophthalmologist, ophthalmology trainee, or junior doctor).

thumbnail

Agreement correlates strongly with overall performance and stratification analysis found no particular question type or subject was associated with better performance of LLMs or doctors, indicating that LLM knowledge and reasoning ability is general across ophthalmology rather than restricted to particular subspecialties or question types.

https://doi.org/10.1371/journal.pdig.0000341.g002

Stratification analysis was undertaken to identify any specific strengths and weaknesses of LLMs with respect to expert ophthalmologists and trainee doctors ( Table 1 and S4 Table ). No significant difference between performance in first order fact recall and higher order reasoning questions was observed among any of the LLMs, expert ophthalmologists, ophthalmology trainees, or unspecialised junior doctors ( S4 Table ; χ 2 tests, p > 0.05). Similarly, only J1 (junior doctor yet to commence ophthalmology training) exhibited statistically significant variation in performance between subjects ( S4 Table ; Fisher’s exact tests over 2000 iterations, p = 0.02); all other doctors and LLMs exhibited no significant variation (Fisher’s exact tests over 2000 iterations, p > 0.05). To explore whether consistency was due to an insufficient sample size, similar analyses were run for GPT-3.5 and GPT-4 performance over the larger set of 347 questions ( S1 Table ; S4 Table ). As with the mock examination, no significant differences in performance across question types ( S4 Table ; χ 2 tests, p > 0.05) or subjects ( S4 Table ; Fisher’s exact tests over 2000 iterations, p > 0.05) were observed.

LLM examination performance translates to subjective preference indicated by expert ophthalmologists.

Ophthalmologists’ appraisal of GPT-4 and GPT-3.5 outputs indicated a marked preference for the former over the latter, mirroring objective performance in the mock examination and over the whole textbook. GPT-4 exhibited significantly ( t -test with Bonferroni correction, p < 0.05) higher accuracy and relevance than GPT-3.5 according to all five ophthalmologists’ grading ( Table 3 ). Differences were visually obvious, with GPT-4 exhibiting much higher rates of attaining the highest scores for accuracy and relevance than GPT-3.5 ( Fig 3 ). This superiority was reflected in ophthalmologists’ qualitative preference indications: GPT-4 responses were preferred to GPT-3.5 responses by every ophthalmologist with statistically significant skew in favour of GPT-4 (χ 2 test, p < 0.05; Table 3 ).

thumbnail

Accuracy (A) and relevance (B) ratings were provided by five expert ophthalmologists for ChatGPT (powered by GPT-3.5 and GPT-4) responses to 87 FRCOphth Part 2 mock examination questions. In every case, the accuracy and relevance of GPT-4 is significantly superior to GPT-3.5 (t-test with Bonferroni correct applied, p < 0.05). Pooled scores for accuracy (C) and relevance (D) from all five raters are presented in the bottom two plots, with GPT-3.5 (left bars) compared directly with GPT-4 (right bars).

https://doi.org/10.1371/journal.pdig.0000341.g003

thumbnail

t-test results with Bonferroni correction applied showing the superior accuracy and relevance of GPT-4 responses relative to GPT-3.5 responses in the opinion of five fully trained ophthalmologists (positive mean differences favour GPT-4), and χ 2 test showing that GPT-4 responses were preferred to GPT-3.5 responses by every ophthalmologist in their blinded qualitative appraisals.

https://doi.org/10.1371/journal.pdig.0000341.t003

Failure cases exhibit no association with subject, complexity, or human answers.

The LLM failure cases—where every LLM provided an incorrect answer—are summarised in Table 4 . While errors made by LLMs were occasionally similar to those made by trainee ophthalmologists and junior doctors, this association was not consistent ( Table 4 ). There was no preponderance of ophthalmological subject or first or higher order questions in the failure cases, and questions did not share a common theme, sentence structure, or grammatical construct ( Table 4 ). Examination questions are redacted here to avoid breaching copyright and prevent future LLMs accessing the test data during pretraining but can be provided on request.

thumbnail

Summary of LLM failure cases, where all models provided an incorrect answer to the FRCOphth Part 2 mock examination question. No associations were found with human answers, complexity, subject, theme, sentence structure, or grammatic constructs.

https://doi.org/10.1371/journal.pdig.0000341.t004

Here, we present a clinical benchmark to gauge the ophthalmological performance of LLMs, using a source of questions with very low risk of contamination as the utilised textbook is not freely available online [ 17 ]. Previous studies have suggested that ChatGPT can provide useful responses to ophthalmological queries, but often use online question sources which may have featured in LLMs’ pretraining datasets [ 7 , 12 , 15 , 22 ]. In addition, our employment of multiple LLMs as well as fully qualified and training doctors provides novel insight into the potential and limitations of state-of-the-art LLMs through head-to-head comparisons which provide clinical context and quantitative benchmarks of competence in ophthalmology. Subsequent research may leverage our questions and results to gauge the performance of new LLMs and applications as they emerge.

We make three primary observations. First, performance of GPT-4 compares well to expert ophthalmologists and ophthalmology trainees, and exhibits pass-worthy performance in an FRCOphth Part 2 mock examination. PaLM 2 did not attain pass-worthy performance or match expert ophthalmologists’ scores but was within the spread of trainee doctors’ performance. LLMs are approaching human expert-level knowledge and reasoning in ophthalmology, and significantly exceed the ability of non-specialist clinicians (represented here by unspecialised junior doctors) to answer ophthalmology questions. Second, clinician grading of model outputs suggests that GPT-4 exhibits improved accuracy and relevance when compared with GPT-3.5. Development is producing models which generate better outputs to ophthalmological queries in the opinion of expert human clinicians, which suggests that models are becoming more capable of providing useful assistance in clinical settings. Third, LLM performance was consistent across question subjects and types, distributed similarly to human performance, and exhibited comparable agreement between other LLMs and doctors when corrected for differences in overall performance. Together, this indicates that the ophthalmological knowledge and reasoning capability of LLMs is general rather than limited to certain subspecialties or tasks. LLM-driven natural language processing seems to facilitate similar—although idiosyncratic—clinical knowledge and reasoning to human clinicians, with no obvious blind spots precluding clinical use.

Similarly dramatic improvements in the performance of GPT-4 relative to GPT-3.5 have been reported in the context of the North American Ophthalmology Knowledge Assessment Program (OKAP) [ 13 , 15 ]. State-of-the-art models exhibit far more clinical promise than their predecessors, and expectations and development should be tailored accordingly. Results from the OKAP also suggest that improvement in performance is due to GPT-4 being more well-rounded than GPT-3.5 [ 13 ]. This increases the scope for potential applications of LLMs in ophthalmology, as development is eliminating weaknesses rather than optimising in narrow domains. This study shows that well-rounded LLM performance compares well with expert ophthalmologists, providing clinically relevant evidence that LLMs may be used to provide medical advice and assistance. Further improvement is expected as multimodal foundation models, perhaps based on LLMs such as GPT-4, emerge and facilitate compatibility with image-rich ophthalmological data [ 3 , 23 , 24 ].

Limitations

This study was limited by three factors. First, examination performance is an unvalidated indicator of clinical aptitude. We sought to ameliorate this limitation by employing expert ophthalmologists, ophthalmology trainees, and unspecialised junior doctors answering the same questions as clinical benchmarks; and compared LLM performance to real cohorts of candidates in recent FRCOphth examinations. However, it remains an issue that comparable performance to clinical experts in an examination does not necessarily demonstrate that an LLM can communicate with patients and practitioners or contribute to clinical decision making accurately and safely. Early trials of LLM chatbots have suggested that LLM responses may be equivalent or even superior to human doctors in terms of accuracy and empathy, and experiments using complicated case studies suggest that LLMs operate well even outside typical presentations and more common medical conditions [ 4 , 25 , 26 ]. In ophthalmology, GPT-3.5 and GPT-4 have been shown to be capable of providing precise and suitable triage decisions when queried with eye-related symptoms [ 22 , 27 ]. Further work is now warranted in conventional clinical settings.

Second, while the study was sufficiently powered to detect a less than 10% difference in overall performance, the relatively small number of questions in certain categories used for stratification analysis may mask significant differences in performance. Testing LLMs and clinicians with more questions may help establish where LLMs exhibit greater or lesser ability in ophthalmology. Furthermore, researchers using different ways to categorise questions may be able to identify specific strengths and weaknesses of LLMs and doctors which could help guide design of clinical LLM interventions.

Finally, experimental tasks were ‘zero-shot’ in that LLMs were not provided with any examples of correctly answered questions before it was queried with FRCOphth questions from the textbook. This mode of interrogation entails the maximal level of difficulty for LLMs, so it is conceivable that the ophthalmological knowledge and reasoning encoded within these models is actually even greater than indicated by results here [ 1 ]. Future research may seek to fine-tune LLMs by using more domain-specific text during pretraining and fine-tuning, or by providing examples of successfully completed tasks to further improve performance in that clinical task [ 3 ].

Future directions

Autonomous deployment of LLMs is currently precluded by inaccuracy and fact fabrication. Our study found that despite meeting expert standards, state-of-the-art LLMs such as GPT-4 do not match top-performing ophthalmologists [ 28 ]. Moreover, there remain controversial ethical questions about what roles should and should not be assigned to inanimate AI models, and to what extent human clinicians must remain responsible for their patients [ 3 ]. However, the remarkable performance of GPT-4 in ophthalmology examination questions suggests that LLMs may be able to provide useful input in clinical contexts, either to assist clinicians in their day-to-day work or with their education or preparation for examinations [ 3 , 13 , 14 , 27 ]. Further improvement in performance may be obtained by specific fine-tuning of models with high quality ophthalmological text data, requiring curation and deidentification [ 29 ]. GPT-4 may prove especially useful where access to ophthalmologists is limited: provision of advice, diagnosis, and management suggestions by a model with FRCOphth Part 2-level knowledge and reasoning ability is likely to be superior to non-specialist doctors and allied healthcare professionals working without support, as their exposure to and knowledge of eye care is limited [ 27 , 30 , 31 ].

However, close monitoring is essential to avoid mistakes caused by inaccuracy or fact fabrication [ 32 ]. Clinical applications would also benefit from an uncertainty indicator reducing the risk of erroneous decisions [ 7 ]. As LLM performance often correlates with the frequency of query terms’ representation in the model’s training dataset, a simple indicator of ‘familiarity’ could be engineered by calculating the relative frequency of query term representation in the training data [ 7 , 33 ]. Users could appraise familiarity to temper their confidence in answers provided by the LLM, perhaps reducing error. Moreover, ophthalmological applications require extensive validation, preferably with high quality randomised controlled trials to conclusively demonstrate benefit (or lack thereof) conferred to patients by LLM interventions [ 34 ]. Trials should be pragmatic so as not to inflate effect sizes beyond what may generalise to patients once interventions are implemented at scale [ 34 , 35 ]. In addition to patient outcomes, practitioner-related variables should also be considered: interventions aiming to improve efficiency should be specifically tested to ensure that they reduce rather than increase clinicians’ workload [ 3 ].

According to comparisons with expert and trainee doctors, state-of-the-art LLMs are approaching expert-level performance in advanced ophthalmology questions. GPT-4 attains pass-worthy performance in FRCOphth Part 2 questions and exceeds the scores of some expert ophthalmologists. As top-performing doctors exhibit superior scores, LLMs do not appear capable of replacing ophthalmologists, but state-of-the-art models could provide useful advice and assistance to non-specialists or patients where access to eye care professionals is limited [ 27 , 28 ]. Further research is required to design LLM-based interventions which may improve eye health outcomes, validate interventions in clinical trials, and engineer governance structures to regulate LLM applications as they begin to be deployed in clinical settings [ 36 ].

Supporting information

S1 fig. chatgpt performance in questions taken from the whole textbook..

Mosaic plot depicting the overall performance of ChatGPT versions powered by GPT-3.5 and GPT-4 in 360 FRCOphth Part 2 written examination questions. Performance was significantly higher for GPT-4 than GPT-3.5, and was close to mean human examination candidate performance and pass mark set by standard setting and after adjustment.

https://doi.org/10.1371/journal.pdig.0000341.s001

S1 Table. Question characteristics and performance of GPT-3.5 and GPT-4 over the whole textbook.

Similar observations were noted here to the smaller mock examination used for subsequent experiments. GPT-4 performs to a significantly higher standard than GPT-3.5

https://doi.org/10.1371/journal.pdig.0000341.s002

S2 Table. Examination statistics corresponding to FRCOphth Part 2 written examinations sat between July 2017-December 2022.

https://doi.org/10.1371/journal.pdig.0000341.s003

S3 Table. Experience of expert ophthalmologists (E1-E5), ophthalmology trainees (T1-T3), and unspecialised junior doctors (J1-J2) involved in experiments.

https://doi.org/10.1371/journal.pdig.0000341.s004

S4 Table. Results of statistical tests of variation in performance between question subjects and types, for each trialled LLM, expert ophthalmologist, and trainee doctor.

Statistically significant results are highlighted in green.

https://doi.org/10.1371/journal.pdig.0000341.s005

S1 Protocol. Procedures followed by ophthalmologists to grade the output of GPT-3.5 and GPT-4 in terms of accuracy, relevance, and rater-preference of model outputs.

https://doi.org/10.1371/journal.pdig.0000341.s006

Acknowledgments

The authors extend their thanks to Mr Arunachalam Thirunavukarasu (Betsi Cadwaladr University Health Board) for his advice and assistance with recruitment.

  • 1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems [Internet]. Curran Associates, Inc.; 2020 [cited 2023 Jan 30]. p. 1877–901. Available from: https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  • 2. OpenAI. GPT-4 Technical Report [Internet]. arXiv; 2023 [cited 2023 Apr 11]. Available from: http://arxiv.org/abs/2303.08774
  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 9. Google. PaLM 2 Technical Report [Internet]. 2023 [cited 2023 May 11]. Available from: https://ai.google/static/documents/palm2techreport.pdf
  • 17. Ting DSJ, Steel D. MCQs for FRCOphth Part 2. Oxford University Press; 2020. 253 p.
  • 21. Part 2 Written FRCOphth Exam [Internet]. The Royal College of Ophthalmologists. [cited 2023 Jan 30]. Available from: https://www.rcophth.ac.uk/examinations/rcophth-exams/part-2-written-frcophth-exam/

research articles clinical

In the brain, bursts of beta rhythms implement cognitive control

Bursts of brain rhythms with “beta” frequencies control where and when neurons in the cortex process sensory information and plan responses. Studying these bursts would improve understanding of cognition and clinical disorders, researchers argue in a new review.

The brain processes information on many scales. Individual cells electrochemically transmit signals in circuits but at the large scale required to produce cognition, millions of cells act in concert, driven by rhythmic signals at varying frequencies. Studying one frequency range in particular, beta rhythms between about 14-30 Hz, holds the key to understanding how the brain controls cognitive processes—or loses control in some disorders—a team of neuroscientists argues in a new review article.

Drawing on experimental data, mathematical modeling and theory, the scientists make the case that bursts of beta rhythms control cognition in the brain by regulating where and when higher gamma frequency waves can coordinate neurons to incorporate new information from the senses or formulate plans of action. Beta bursts, they argue, quickly establish flexible but controlled patterns of neural activity for implementing intentional thought.

“Cognition depends on organizing goal-directed thought, so if you want to understand cognition, you have to understand that organization,” said co-author Earl K. Miller , Picower Professor in The Picower Institute for Learning and Memory and the Department of Brain and Cognitive Sciences at MIT. “Beta is the range of frequencies that can control neurons at the right spatial scale to produce organized thought.”

Miller and colleagues Mikael Lundqvist, Jonatan Nordmark and Johan Liljefors at the Karolinska Institutet and Pawel Herman at the KTH Royal Institute of Technology in Sweden, write that studying bursts of beta rhythms to understand how they emerge and what they represent would not only help explain cognition, but also aid in diagnosing and treating cognitive disorders.

“Given the relevance of beta oscillations in cognition, we foresee a major change in the practice for biomarker identification, especially given the prominence of beta bursting in inhibitory control processes … and their importance in ADHD, schizophrenia and Alzheimer’s disease,” they write in the journal Trends in Cognitive Sciences .

Experimental studies covering several species including humans, a variety of brain regions, and numerous cognitive tasks have revealed key characteristics of beta waves in the cortex, the authors write: Beta rhythms occur in quick but powerful bursts; they inhibit the power of higher frequency gamma rhythms; and though they originate in deeper brain regions, they travel within specific locations of cortex. Considering these properties together, the authors write that they are all consistent with precise and flexible regulation, in space and time, of the gamma rhythm activity that experiments show carry signals of sensory information and motor plans.

A chart from a study plots bursts of brain waves of varying frequency at specific times. The bursts are represented as warm colors against a the blue background. When there are low frequency bursts there aren't high frequency bursts and vice versa.

“Beta bursts thus offer new opportunities for studying how sensory inputs are selectively processed, reshaped by inhibitory cognitive operations and ultimately result in motor actions,” the authors write.

For one example, Miller and colleagues have shown in animals that in the prefrontal cortex in working memory tasks, beta bursts direct when gamma activity can store new sensory information, read out the information when it needs to be used, and then discard it when it’s no longer relevant. For another example, other researchers have shown that beta rises when human volunteers are asked to suppress a previously learned association between word pairs, or to forget a cue because it will no longer be used in a task.

In a paper last year, Lundqvist, Herman, Miller and others cited several lines of experimental evidence to hypothesize that beta bursts implement cognitive control spatially in the brain , essentially constraining patches of the cortex to represent the general rules of a task even as individual neurons within those patches represent the specific contents of information. For example, if the working memory task is to remember a pad lock combination, beta rhythms will implement patches of cortex for the general steps “turn left,” “turn right,” “turn left again,” allowing gamma to enable neurons within each patch to store and later recall the specific numbers of the combination. The two-fold value of such an organizing principle, they noted, is that the brain can rapidly apply task rules to many neurons at a time and do so without having to re-establish the overall structure of the task if the individual numbers change (i.e. you set a new combination).

Another important phenomenon of beta bursts, the authors write, is that they propagate across long distances in the brain, spanning multiple regions. Studying the direction of their spatial travels, as well as their timing, could shed further light on how cognitive control is implemented.

New ideas beget new questions

Beta rhythm bursts can differ not only in their frequency, but also their duration, amplitude, origin and other characteristics. This variety speaks to their versatility, the authors write, but also obliges neuroscientists to study and understand these many different forms of the phenomenon and what they represent to harness more information from these neural signals.

“It quickly becomes very complicated, but I think the most important aspect of beta bursts is the very simple and basic premise that they shed light on the transient nature of oscillations and neural processes associated with cognition,” Lundqvist said.“This changes our models of cognition and will impact everything we do. For a long time we implicitly or explicitly assumed oscillations are ongoing which has colored experiments and analyses. Now we see a first wave of studies based on this new thinking, with new hypothesis and ways to analyze data, and it should only pick up in years to come.” 

The authors acknowledge another major issue that must be resolved by further research—How do beta bursts emerge in the first place to perform their apparent role in cognitive control?

“It is unknown how beta bursts arise as a mediator of an executive command that cascades to other regions of the brain,” the authors write.

The authors don’t claim to have all the answers. Instead, they write, because beta rhythms appear to have an integral role in controlling cognition, the as yet unanswered questions are worth asking.

“We propose that beta bursts provide both experimental and computational studies with a window through which to explore the real-time organization and execution of cognitive functions,” they conclude. “To fully leverage this potential there is a need to address the outstanding questions with new experimental paradigms, analytical methods and modeling approaches.”

Related Articles

Paper: to understand cognition—and its dysfunction—neuroscientists must learn its rhythms.

A black and white brain shown in profile is decorated with red light bulbs on its surface. In one spot, a stencil for making the light bulbs, labeled "beta," is present. Nearby is a can of red spray paint labeled "gamma" with a little wave on it.

Study reveals a universal pattern of brain wave frequencies

research articles clinical

Anesthesia blocks sensation by cutting off communication within the cortex

A blue-hued cartoon shows a transparent head on the left in profile with a brain inside. Big slow waves emanate from marked points in the brain into the space on the right.

Anesthesia technology precisely controls unconsciousness in animal tests

An operating room scene shows a patient on a table. Our perspective is from behind the anesthesiologist who holds a mask on the patient's face and watches a monitor with a bunch of indicators. A surgeon stands out of focus on the far end of the patient.

Help | Advanced Search

Computer Science > Computation and Language

Title: ct-ade: an evaluation benchmark for adverse drug event prediction from clinical trial results.

Abstract: Adverse drug events (ADEs) significantly impact clinical research and public health, contributing to failures in clinical trials and leading to increased healthcare costs. The accurate prediction and management of ADEs are crucial for improving the development of safer, more effective medications, and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a novel dataset compiled to enhance the predictive modeling of ADEs. Encompassing over 12,000 instances extracted from clinical trial results, the CT-ADE dataset integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments, providing a comprehensive resource for developing advanced predictive models. To mirror the complex nature of ADEs, annotations are standardized at the system organ class level of the Medical Dictionary for Regulatory Activities (MedDRA) ontology. Preliminary analyses using baseline models have demonstrated promising results, achieving 73.33% F1 score and 81.54% balanced accuracy, highlighting CT-ADE's potential to advance ADE prediction. CT-ADE provides an essential tool for researchers aiming to leverage the power of artificial intelligence and machine learning to enhance patient safety and minimize the impact of ADEs on pharmaceutical research and development. Researchers interested in using the CT-ADE dataset can find all necessary resources at this https URL .

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Work & Careers
  • Life & Arts

Become an FT subscriber

Try unlimited access Only $1 for 4 weeks

Then $75 per month. Complete digital access to quality FT journalism on any device. Cancel anytime during your trial.

  • Global news & analysis
  • Expert opinion
  • Special features
  • FirstFT newsletter
  • Videos & Podcasts
  • Android & iOS app
  • FT Edit app
  • 10 gift articles per month

Explore more offers.

Standard digital.

  • FT Digital Edition

Premium Digital

Print + premium digital, weekend print + standard digital, weekend print + premium digital.

Today's FT newspaper for easy reading on any device. This does not include ft.com or FT App access.

  • 10 additional gift articles per month
  • Global news & analysis
  • Exclusive FT analysis
  • Videos & Podcasts
  • FT App on Android & iOS
  • Everything in Standard Digital
  • Premium newsletters
  • Weekday Print Edition
  • FT Weekend Print delivery
  • Everything in Premium Digital

Essential digital access to quality FT journalism on any device. Pay a year upfront and save 20%.

  • Everything in Print

Complete digital access to quality FT journalism with expert analysis from industry leaders. Pay a year upfront and save 20%.

Terms & Conditions apply

Explore our full range of subscriptions.

Why the ft.

See why over a million readers pay to read the Financial Times.

International Edition

  • Open access
  • Published: 18 April 2024

Implementation of an audit and feedback module targeting low-value clinical practices in a provincial trauma quality assurance program: a cost-effectiveness study

  • Blanchard Conombo 1 , 2 ,
  • Jason R. Guertin 1 ,
  • Jeffrey S. Hoch 3 ,
  • Jeremy Grimshaw 4 ,
  • Mélanie Bérubé 2 , 5 ,
  • Christian Malo 5 ,
  • Simon Berthelot 2 , 6 , 7 ,
  • François Lauzier 8 ,
  • Henry T. Stelfox 9 ,
  • Alexis F. Turgeon 2 , 8 ,
  • Patrick Archambault 10 , 6 ,
  • Amina Belcaid 2 &
  • Lynne Moore 1 , 2  

BMC Health Services Research volume  24 , Article number:  479 ( 2024 ) Cite this article

132 Accesses

Metrics details

Audit and Feedback (A&F) interventions based on quality indicators have been shown to lead to significant improvements in compliance with evidence-based care including de-adoption of low-value practices (LVPs). Our primary aim was to evaluate the cost-effectiveness of adding a hypothetical A&F module targeting LVPs for trauma admissions to an existing quality assurance intervention targeting high-value care and risk-adjusted outcomes. A secondary aim was to assess how certain A&F characteristics might influence its cost-effectiveness.

We conducted a cost-effectiveness analysis using a probabilistic static decision analytic model in the Québec trauma care continuum. We considered the Québec Ministry of Health perspective. Our economic evaluation compared a hypothetical scenario in which the A&F module targeting LVPs is implemented in a Canadian provincial trauma quality assurance program to a status quo scenario in which the A&F module is not implemented. In scenarios analyses we assessed the impact of A&F characteristics on its cost-effectiveness. Results are presented in terms of incremental costs per LVP avoided.

Results suggest that the implementation of A&F module (Cost = $1,480,850; Number of LVPs = 6,005) is associated with higher costs and higher effectiveness compared to status quo (Cost = $1,124,661; Number of LVPs = 8,228). The A&F module would cost $160 per LVP avoided compared to status quo. The A&F module becomes more cost-effective with the addition of facilitation visits; more frequent evaluation; and when only high-volume trauma centers are considered.

A&F module targeting LVPs is associated with higher costs and higher effectiveness than status quo and has the potential to be cost-effective if the decision-makers’ willingness-to-pay is at least $160 per LVP avoided. This likely represents an underestimate of true ICER due to underestimated costs or missed opportunity costs. Results suggest that virtual facilitation visits, frequent evaluation, and implementing the module in high-volume centers can improve cost-effectiveness.

Peer Review reports

Introduction

Low-value practices (LVPs) are tests and treatments that are not supported by evidence and may expose patients to physical and psychological harm [ 1 , 2 ]. They have been estimated to consume up to 30% of healthcare resources in Canada [ 3 ] and in the US [ 4 ]. In 2013, an estimated $270 billion was wasted on excess healthcare services in the US [ 2 ]. From a patient and caregiver perspective, LVPs expose patients to physical and psychological harms, delays to effective treatment, and direct and indirect expenses [ 2 , 5 , 6 , 7 , 8 ]. From a healthcare system perspective, they put strain on tight healthcare budgets and decrease the availability of scarce resources.

Recent literature suggests that interventions targeting the de-implementation of ineffective or harmful health interventions have the potential to reduce overuse and improve clinically important outcomes [ 9 ]. Among these are Audit and Feedback (A&F) interventions, defined as ‘a summary of clinical performance of healthcare over a specified period aimed at providing information to health professionals to allow them to assess and adjust their performance’ [ 10 ]. We now have extensive evidence of the effectiveness of A&F interventions, including those targeting de-implementation of LVPs. A systematic review including 140 randomized controlled trials (RCTs) estimated that A&F interventions resulted in close to 4.3% absolute increase in adherence to evidence-based care (IQR 0.5% to 16.0%) [ 11 ]. The effect of an A&F intervention appears to be larger when it targets de-implementation of low-value practices (absolute decrease of 10.5%). This review also revealed that A&F effectiveness is influenced by its design and delivery [ 11 ]. The World Health Organisation recently expressed concern about the major knowledge gap on the cost and cost-effectiveness of A&F interventions [ 12 ], and recommended that implementation of these interventions be informed by data on their cost-effectiveness [ 13 ]. Despite this, most A&F interventions, including those used across Canadian trauma systems, are implemented without evidence on their cost-effectiveness [ 12 , 14 ]. A 2022 systematic review summarized evidence on the economic value of A&F interventions in healthcare [ 15 ] and found that they have a high potential to be cost-effective. However, authors only identified economic evaluations for 6% of A&F trials, methodological quality of these evaluations was low, and authors concluded that model-based simulations were urgently needed to assess the impact of A&F characteristics on cost-effectiveness to inform optimal A&F design.

Trauma systems are a favorable setting for de-implementation interventions as they possess many documented facilitators including quality improvement teams with medical leadership, routinely-collected clinical data, and performance linked to accreditation [ 16 ]. Furthermore, potential gains are huge due to the resource-intensive nature of trauma care. Trauma systems are thus the ideal setting to advance knowledge on de-implementation. Our research team recently published a list of quality indicators targeting LVPs in acute trauma care [ 17 , 18 ]. We aim to evaluate the cost-effectiveness of an A&F module targeting the de-implementation of these LVPs in an integrated Canadian trauma system and to assess the impact of A&F characteristics on cost-effectiveness.

We conducted an economic evaluation according to the Canadian guidelines for the Economic Evaluation of Health Technologies [ 19 ], and results are reported following the CHEERS 2022 statement [ 20 ]. The study protocol was developed with a project advisory committee including two emergency physicians (CM, EM), two trauma surgeons (TR, NY), three critical care physicians (FL, AFT, HTS), a neurosurgeon (PLB), a spine surgeon (JP), an orthopedic surgeon (ML), two trauma service managers (MB, CR), a trauma registry co-ordinator (AB), and epidemiologist (LM), and two health economists (JRG, JSH). The protocol was approved a priori by all co-authors, members of the advisory committee, a granting agency peer-review committee (Canadian Institutes of Health Research project #353374) and the CHU-de Québec – Université Laval research ethics committee.

Our economic evaluation is based on a hypothetical A&F module embedded in the Québec Trauma Care Continuum , a provincial regionalized trauma system comprising 57 adult trauma centers of which 3 are level I (highly specialized urban centers), 5 are level II (similar capacity to level I but in smaller cities), 21 are level III (hospitals in small towns transferring most major trauma to level I/II centers after stabilization), and 28 are level IV (rural community hospitals). All centers undergo mandatory, periodic verifications in line with designation, conducted by the provincial healthcare quality agency, Institut national d’excellence en santé et services sociaux ( INESSS ) and overseen by the Ministry of Health and Social Services [ 21 ]. Verification includes A&F on adherence to high-value care and risk-adjusted outcomes. Local trauma committees in each center are required to ensure the quality of the trauma program according to designation requirements. Committees include the program medical director (Chair), the program manager, heads of critical care, emergency and surgical departments, heads of multidisciplinary services, and a hospital administrator. Quality improvement activities include trimestral committee meetings with chart review, development of local care protocols, and discussions with clinical and administrative leads locally and at referring centers to identify improvement strategies. Formal letters of agreement are signed by heads of clinical departments to operate changes in their services when required.

Intervention and comparator

We compared a hypothetical scenario in which an A&F module targeting LVPs is implemented in the Québec trauma system to a status quo scenario in which the A&F module is not implemented.

Comparator (status quo scenario)

The study comparator is the A&F intervention currently in place in the Québec Trauma Care Continuum, designed by the provincial healthcare quality agency using the US Agency for Healthcare Research and Quality guidelines [ 22 ]. This A&F intervention targets trauma committees in each trauma center and, as explained above, currently includes modules for adherence to high-value practices (15 quality indicators) and optimal outcomes (3 quality indicators). The A&F intervention currently in place consists of:

Quality reports disseminated via a Web platform to local trauma committees and hospital boards of directors produced using trauma registry data.

Web links to user-friendly information sheets including definitions for quality indicators and references supporting each indicator.

Information sheets and Web capsules with guidelines on how the results should be interpreted and acted upon.

A case revision tool integrated into the trauma registry.

Within 6 months of reception of the report, committees are required to submit an action plan proposing improvement strategy for quality indicators for which they are identified as negative outliers.

Intervention

The study intervention is an A&F module targeting LVPs (6 quality indicators) ( http://www.ohri.ca/auditfeedback/laboratories/ ). The 6 quality indicators were selected using the results of an expert consensus study [ 23 ] and an indicator validation study using data from the Quebec trauma registy [ 24 ].

In the base case scenario, the module includes the components already in place described in the status quo scenario, applied to quality indicators on LVPs. We attributed a 5-year lifespan to the A&F module as current literature recommends that quality indicators be updated every five years [ 25 ]. To account for the 5-year lifespan of the A&F module and its potential benefits one year beyond its lifespan, we used a 6-year time horizon.

Type of economic evaluation

For this early economic evaluation, i.e., an evaluation prior to the implementation of the module, a probabilistic static decision analytic model was developed to estimate the incremental cost-effectiveness ratio (ICER) of the A&F module compared with status quo scenario in which the A&F module is not implemented for patients with acute injury (Fig.  1 ). We considered the Québec Ministry of Health perspective.

figure 1

Decision-analytic model. In the status quo scenario, there is no implementation of the A&F module targeting LVPs. In the intervention scenario, an A&F module targeting LVPs is implemented at baseline (at the beginning of the 1st year). “Use data from year 1” means that data on the effectiveness and costs of A&F module from the 1st year are available at the beginning of the second year and so on for the following years

Effectiveness

The incremental effectiveness of the A&F module was estimated as the incremental number of LVPs avoided. Plausible ranges of percent reductions in LVPs were obtained from the 2012 Cochrane review, which presented effectiveness of A&F interventions as medians and interquartile ranges [ 26 ]. Specifically, we used the pooled estimate of effectiveness specific to deimplementation interventions based on 29 studies. For the purposes of our analysis, these values were used to estimate mean effectiveness and associated standard errors using a method based on highly-cited recommendations [ 27 , 28 ].

The incremental costs of the A&F module over the comparator were estimated by summing the implementation costs of the A&F module over its 5-year lifespan and the potential reduction in resource utilization for all LVPs, valued in costs and estimated between years 2 and 6. The implementation costs were determined by identifying all non-recurrent and recurrent costs related to the implementation of the A&F module including data validation and analyses, report production and validation, administrative costs, and follow-up in local trauma committees (Table  1 ). The potential reduction in resource utilization was estimated by multiplying the hypothesized reduction in the frequency of the LVPs by their average costs. Detailed information on how practices were costed are available elsewhere [ 29 ]. Briefly, we estimated direct healthcare costs for each LVP from the Ministry of Health perspective using an activity-based costing approach. Activity-based costing involves multiplying unit costs of specific activity centres by the corresponding units of resources used. This method provides an estimate of hospital resource use by activity center, consistent with Grading of Recommendations Assessment, Development and Evaluations guidelines [ 30 , 31 , 32 ]. All costs are expressed in 2020 Canadian dollars. We report our study following the Consolidated Health Economic Evaluation Reporting Standards statement [ 29 ].

Incremental cost-effectiveness ratio

The incremental cost-effectiveness ratio (ICER) was estimated by dividing the incremental costs (or savings) of the A&F module by its incremental effectiveness. Results are reported as the incremental cost per LVP avoided.

Discount rate

All future costs and benefits were discounted at a rate of 1.5% as recommended by current Canadian guidelines [ 19 ].

Scenario analyses

Our advisory committee identified 6 scenario-specific sensitivity analyses based on published evidence of A&F effectiveness and context-specific considerations (#1, 2, 3, 6) [ 24 , 26 ] as well as Canadian guidelines on economic evaluation (#4, 5):

Adding a virtual facilitation visit once per cycle to help trauma committees identify barriers and facilitators and use them to identify improvement strategies for their action plan; [ 15 , 26 ]

Increasing feedback frequency from annually to monthly, as assessed in the systematic review; [ 26 ]

Implementing the module only in high-volume trauma centers (i.e., level I and II);

Varying the discount rate between 0 and 5%, as recommended by current Canadian guidelines; [ 19 ]

Increasing the lifespan of the A&F module to 10 years (Supplementary Fig.  1 ) to take into account the effect of time on module effectiveness.

Increasing the costs of LVPs by 100% to account for lack of complete data on physician billing and unit costs that underestimate market prices. This is based on evidence that physician billing represents approximately 56% of hospital costs in Canada ( https://www.cihi.ca/sites/default/files/document/nhex-trends-2020-narrative-report-en.pdf ).

We present the ICER based on the results of probabilistic sensitivity analysis (PSA) as recommended in the Canadian Guidelines for the Economic Evaluation of Health Technologies [ 19 ]. In the PSA, model parameters were represented by distributions of possible values rather than point estimates to address parameter uncertainty. All parameters and their distributions are presented in Table  2 . Parameter distributions were randomly sampled 10,000 times. Results were summarized using cost-effectiveness acceptability curves (CEACs) and the cost-effectiveness acceptability frontier (CEAF) [ 33 ]. We used Excel software (Microsoft Office 2019 Professional Plus) to construct a decision model, to analyze base case results, and conduct PSA.

The mean costs of the A&F module and status quo scenario were $1,480,850 and $1,124,661 respectively. The associated average number of LVPs were 6,005 for the A&F module and 8,228 for status quo scenario. The implementation of the A&F module is associated with a reduction of approximately 2,223 LVPs. The ICER for the A&F module versus status quo scenario was $160 per LVP avoided (Table  3 ). The results of the PSA plotted on a cost-effectiveness plane (Fig.  2 ) show that most of the points in the scatter plot are located in the Northeast quadrant, indicating that the A&F module has a potential to be cost-effective given a decision maker’s willingness-to-pay (WTP). The cost-effectiveness acceptability frontier indicates that A&F module is cost-effective in 50% of our iterations at a WTP of $160 per LVP avoided (Fig.  3 ).

figure 2

Probabilistic sensitivity analysis comparing A&F module and status quo scenario (no A&F module targeting LVPs). The x-axis represents the incremental effectiveness, number of LVPs avoided. The y-axis represents the incremental costs between A&F module and status quo scenario. Each circle represents a single simulation for a total of 10,000 simulations

figure 3

Cost-effectiveness acceptability curve (CEAC) between A&F module and status quo scenario and cost-effectiveness acceptability frontier (CEAF). The x-axis represents the willingness-to-pay (WTP) for each LVP avoided. The y-axis represents the percentage of simulations in which the A&F module is cost-effectiveness relative to the status quo scenario at different WTP threshold. The switch point where the A&F module became a cost-effective intervention corresponds to $160 per LVP avoided, equal to the ICER estimate. A&F module became 100% cost-effective at a WTP of $1000 per LVP avoided. The A&F module had the highest expected net benefit, for all values of WTP greater than the ICER. At our ICER estimate, 50% of the distribution of ICERs were cost-effective

Adding a virtual facilitation visit to the A&F module (one visit per A&F cycle) would reduce the estimated ICER (improve its cost-effectiveness profile compared to the base case scenario) to $108 per LVP avoided (Table  4 ). More frequent feedback (monthly) is associated with a slight improvement in its cost-effectiveness profile ($154 per LVP avoided). The A&F module is more cost-effective ($48 per LVP avoided) when only high-volume trauma centers are considered for the implementation of the module. Similarly, an increase in the costs of LVPs by 100% and a longer time horizon would lead to a reduction in the ICER to $10 and $106 per LVP avoided, respectively. On the other hand, a discount rate of 5% increases the ICER to $199 per LVP avoided (Table  4 ).

The results of this early economic evaluation suggest that the addition of an A&F module targeting LVPs to a provincial trauma quality assurance program over a time horizon of 6 years is associated with an ICER of $160 per LVP avoided. In analyses that simultaneously accounted for uncertainty in all key model parameters, 50% of simulations were cost-effective at a WTP of $160 per LVP avoided. The A&F module is more cost-effective with the addition of facilitation visits, frequent evaluation and if restricted to high-volume trauma centers.

Our study fills a major knowledge gap on the potential cost-effectiveness of A&F interventions to de-implement low-value care. Comparison of our results with the literature on acute trauma care is difficult, because there are no studies that have assessed the cost-effectiveness profile of A&F interventions in the context of acute injury care. However, a 2022 systematic review on the economic value of A&F interventions in various health areas summarized results of 35 studies that compared different A&F strategies targeting health professionals compliance with desire practices or patient health outcomes [ 15 ]. The results of this systematic review mirror our findings. Of 14 cost-effectiveness analyses based on changes in compliance to desired practice from the public healthcare payer perspective, 12 (86%) studies found that the A&F interventions were more costly but more effective than the comparator [ 15 ]. From studies assessing de-implementation of LVPs [ 34 , 35 ], A&F interventions were associated with a reduction in the overuse of LVPs and had the potential to be cost-effective [ 34 , 35 ]. Four (28%) studies included in the review conducted simulations to assess the influence of A&F characteristics on cost-effectiveness in scenario analyses [ 34 , 35 , 36 , 37 ]. Despite having different comparator groups (do-nothing scenario), these studies also observed improved cost-effectiveness when facilitation visits are added to A&F intervention [ 35 ] and the time horizon of the intervention is increased to 9 months (4 to 9 months) [ 34 ]. In addition, our study provides evidence that cost-effectiveness of A&F interventions may be improved by increasing frequency and restricting the intervention to high volume hospitals.

Strengths and limitations

Our study is based on effectiveness parameters from a meta-analysis on over 140 RCTs on the effectiveness of A&F interventions in different healthcare settings [ 11 ], on observed data on the frequency of LVPs [ 24 ], and on costs based on a mature, province-wide quality assurance program ( https://www.donneesquebec.ca/recherche/dataset/as-471-rapports-financiers-annuels-des-etablissements ). In addition, we conducted extensive sensitivity analyses and a range of scenario analyses to evaluate the robustness of our results and to assess the influence of key A&F characteristics on its cost-effectiveness profile. Despite these strengths, our results should be interpreted within the context of the study’s limitations. First, our evaluation is based on estimates of effectiveness from a meta-analysis published in 2012. While this study represents the most up-to-date evidence synthesis available (the Cochrane review is yet to be updated [ 38 ] and systematic reviews published more recently have not included meta-analyses) [ 39 ], it does not include the most recent evidence. Furthermore, while we used estimates specific to deimplementation interventions from the review, none of the studies were specific to trauma, none evaluated an intervention delivered in the context of accreditation, and none compared a deimplementation module in a system with an A&F intervention already in place. Furthermore, risk of bias was low for only 31% of included studies. The estimate used may therefore represent an underestimate or overestimate of the true effectiveness. Second, we conducted an early economic evaluation to assess if a hypothetical A&F module could be cost-effective and, if so, under which conditions. As such, the results of our economic evaluation provide encouragement that the true ICER of the intervention were it to be designed and implemented in the Québec trauma system might be promising as well. We used a broad range of scenarios and parameter values within our probabilistic sensitivity analyses but attributed the same weight to all scenarios analyzed. The base-case scenario will not necessarily be the one that will be implemented. However, intervention costs were based on resources currently used in the Québec trauma care continuum and opportunity costs related to LVPs avoided were based on observed baseline frequencies. We plan to conduct an economic evaluation after our cluster randomized trial (funded and currently underway) to assess the true (observed) cost-effectiveness of the intervention in a pragmatic setting. Third, opportunity costs of LVPs avoided did not fully account for physician fees and were based on unit costs that are known to underestimate their true costs. In addition, we did not account for potential resource repercussions of LVPs, for example, re-imaging due to uncertain findings or treatment of clinically nonsignificant incidental findings. Our scenario analysis where costs associated with LVPs were increased by 100% probably better reflects the Québec Ministry of Health perspective; the large decrease in the ICER ($160 to $10) suggest that opportunity costs related to LVPs are an important determinant of the cost-effectiveness of an A&F module targeting de-implementation. Furthermore, we only considered direct healthcare costs associated with the two competing strategies and did not factor in the effects of indirect costs (e.g., time off work for patients) from LVPs, which would also have led to an underestimation of the intervention’s cost-effectiveness. Fourth, our study is based on the single healthcare payer model, and it is uncertain if our findings would be applicable to other jurisdictions with alternate payer systems. Also, physicians in Canada receive payments based on fee for service that is periodically negotiated [ 40 ], so our results are dependent on current unit costs in our system and may not apply well in non-universal health systems or other jurisdictions with different structures. Fifth, in the absence of evidence indicating otherwise, our base case scenario was based on the strong assumption that effectiveness was the same for all 6 indicators. However, in probabilistic sensitivity analyses, we allowed the effectiveness of LVP to vary independently. Sixth, we were unable to take account of the uncertainty of the cost estimates of implementing the A&F module, derived from expert consultation, which we anticipate may have been underestimated. Finally, we deliberately focused on adherence to desired practice (LVPs avoided) rather than health outcomes (e.g., adverse events) due to lack of available data associated with utility/disutility of LVPs for trauma patients. Nevertheless, a strong argument can be made for focus on the measurement of LVPs avoided for assessment of the quality of our A&F intervention, as they relate most closely to actions that are within the control of healthcare professionals. Indeed, economic evaluations of similar A&F interventions have obtained more meaningful results with similar intermediate outcomes than with Quality-Adjusted Life Years (QALYs) [ 41 ]. Studies have also demonstrated that reducing LVPs will reduce physical harms and adverse events [ 42 , 43 , 44 ]. However, this probably led to an underestimation of the true cost-effectiveness profile of our A&F module as health outcomes or negative health consequences of LVPs are not considered in the measure of effectiveness [ 15 ].

Potential impact

The outcome parameter used in decision model (LVPs avoided) is unique and does not have an explicit cost-effectiveness threshold associated with it. Therefore, the decision to invest in the intervention will be based on the decision-makers willingness-to-pay, i.e., would they be prepared to invest 160$ per LVP avoided? However, the decision should also be based on other considerations, e.g., opportunity costs are likely to be greater than those estimated, cost-effectiveness may be increased if virtual facilitation visits are added, if the frequency of evaluations are increased, and if the intervention is restricted to high-volume trauma centers (level I and II). The intervention has the potential to lead to a global awareness of healthcare overuse and therefore a decrease in other LVPs [ 24 ].

Our economic evaluation suggests that an A&F module targeting de-implementation, integrated into a provincial quality-assurance program, has a high potential to reduce LVPs while increasing total healthcare costs, with an ICER of $160 per LVP avoided. Results suggest that virtual facilitation visits, frequent evaluation and implementing the intervention only in high-volume centers increase cost-effectiveness. However, its economic potential is likely underestimated in this study due to opportunity costs that were underestimated (costs of LVPs) or not accounted for (indirect costs, health outcomes, and long-term consequences). The findings of the present study may inform the development of A&F interventions targeting de-implementation and they demonstrate the feasibility of conducting early economic evaluations to inform optimal A&F intervention design.

Availability of data and materials

Quebec Trauma Registry is subject to a third-party restriction (Quebec Ministry of Health and Social Services).

Abbreviations

Low-value practices

Audit and Feedback

Randomized Controlled Trial

Willingness-to-pay

Computed tomography

Interquartile Range

Cost-effectiveness acceptability curve

Cost-effectiveness acceptability frontier

Probability sensitivity analysis

Choosing Wisely. Home page. Accessed February 21, 2022. https://www.choosingwisely.org/ .

Brownlee S, Chalkidou K, Doust J, et al. Evidence for overuse of medical services around the world. Lancet. 2017;390(10090):156–68.

Article   PubMed   PubMed Central   Google Scholar  

Canadian Institute for Health Information. CIHI. unnecessary care in Canada: technical report. Ottawa, ON: CIHI, 2017.

Lown Institute. Low-value care. https://lowninstitute.org/lown-issues/low-value-care/ . Accessed 20 Dec 2022.

Berwick DM, Hackbarth AD. Eliminating waste in US health care. JAMA. 2012;307(14):1513–6.

Article   CAS   PubMed   Google Scholar  

Boat TF, Chao SM, O’Neill PH. From waste to value in health care. JAMA. 2008;299(5):568–71.

Reilly BM, Evans AT. Much ado about (doing) nothing. Ann Intern Med. 2009;150(4):270–1.

Article   PubMed   Google Scholar  

Hauser CJ, Adams CA Jr, Eachempati SR. Surgical Infection Society guideline: prophylactic antibiotic use in open fractures: an evidence-based guideline. Surg infect. 2006;7(4):379–405.

Article   Google Scholar  

Niven DJ, Mrklas KJ, Holodinsky JK, et al. Towards understanding the de-adoption of low-value clinical practices: a scoping review. BMC Med. 2015;13:255.

World Health Organization. Using audit and feedback to health professionals to improve the quality and safety of health care. https://apps.who.int/iris/handle/10665/332014 . Accessed 20 Dec 2022.

Noah I. Optimizing Audit and Feedback Interventions to Improve Quality in Primary Care. 2014.

Flottorp S, Jamtvedt G, Gibis B, McKee M. Using audit and feedback to health professionals to improve the quality and safety of health care. http://www.euro.who.int/__data/assets/pdf_file/0003/124419/e94296.pdf . Published 2010. Accessed.

Soleymani F, Rashidian A, Dinarvand R, Kebriaeezade A, Hosseini M, Abdollahi M. Assessing the effectiveness and cost-effectiveness of audit and feedback on physician’s prescribing indicators: study protocol of a randomized controlled trial with economic evaluation. Daru. 2012;20(1):88–88.

Avery AJ, Rodgers S, Cantrill JA, et al. A pharmacist-led information technology intervention for medication errors (PINCER): a multicentre, cluster randomised, controlled trial and cost-effectiveness analysis. Lancet. 2012;379(9823):1310–9.

Moore L, Guertin JR, Tardif P-A, et al. Economic evaluations of audit and feedback interventions: a systematic review. BMJ Qual Saf. 2022;31(10):754.

Squires JE, Sullivan K, Eccles MP, Worswick J, Grimshaw JM. Are multifaceted interventions more effective than single-component interventions in changing health-care professionals’ behaviours? An overview of systematic reviews. Implement Sci. 2014;9(1):152.

Fitch K, Bernstein JS, Aguilar MD, et al. The RAND/UCLA Appropriateness Method User's Manual. https://www.rand.org/pubs/monograph_reports/MR1269.html . Published 2001. Accessed 17 Sept 2018.

Heyworth J. Trauma services in a district general hospital. BMJ (Clinical research ed). 1990;300(6728):876–7.

CADTH. Guidelines for the economic evaluation of health technologies: Canada. 4th edition. Ottawa: CADTH; 2017.

Husereau D, Drummond M, Augustovski F, et al. Consolidated Health Economic Evaluation Reporting Standards 2022 (CHEERS 2022) statement: updated reporting guidance for health economic evaluations. BMJ (Clinical research ed). 2022;376:e067975.

PubMed   Google Scholar  

Ministère de la Santé et des Services sociaux. Trajectoire - Traumatologie. https://www.msss.gouv.qc.ca/professionnels/soins-et-services/guide-urgences-trajectoire-traumatologie/ . Published 2022. Accessed 13 Sept 2022.

Developing a framework and research agenda for overuse and appropriateness measures. Agency for Healthcare Research and Quality Appropriateness Small Conference. Published 2009. Accessed 17 Sept 2018.

Moore L, Bérubé M, Tardif PA, et al. Quality Indicators targeting low-value clinical practices in trauma care. JAMA Surg. 2022;157(6):507–14.

Moore L, Bérubé M, Tardif PA, et al. Validation of quality indicators targeting low-value trauma care. JAMA Surg. 2022;157(11):1008–16.

Stelfox HT, Straus SE. Measuring quality of care: considering measurement frameworks and needs assessment to guide quality indicator development. J Clin Epidemiol. 2013;66(12):1320–7.

Ivers N, Jamtvedt G, Flottorp S, et al. Audit and feedback: effects on professional practice and healthcare outcomes. Cochrane Datab Syst Rev. 2012;6:Cd000259.

Google Scholar  

Hozo SP, Djulbegovic B, Hozo I. Estimating the mean and variance from the median, range, and the size of a sample. BMC Med Res Methodol. 2005;5(1):13.

Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol. 2014;14(1):135.

Conombo B, Guertin JR, Hoch JS, et al. Potential avoidable costs of low-value clinical practices in acute injury care in an integrated canadian provincial trauma system. JAMA Surg. 2023;158(9):977–9.

Canby JBT. Applying activity-based costing to healthcare settings. Healthc Financ Manag. 1995;49(2):50–2 54-56.

Chan YC. Improving hospital cost accounting with activity-based costing. Health Care Manage Rev. 1993;18(1):71–7.

Guyatt GH, Oxman AD, Kunz R, et al. Incorporating considerations of resources use into grading recommendations. BMJ. 2008;336(7654):1170–3.

Fenwick E, Claxton K, Sculpher M. Representing uncertainty: the role of cost-effectiveness acceptability curves. Health Econ. 2001;10(8):779–87.

Ling R, Giles M, Searles A. Administration of indwelling urinary catheters in four Australian Hospitals: cost-effectiveness analysis of a multifaceted nurse-led intervention. BMC Health Serv Res. 2021;21(1):897.

Gandjour A, Lauterbach KW. When is it worth reducing overuse of health care services? The example of prescribing expensive antihypertensives. Med Klin (Munich). 2005;100(9):535–41.

Barasa EW, Ayieko P, Cleary S, English M. A multifaceted intervention to improve the quality of care of children in district hospitals in Kenya: a cost-effectiveness analysis. PLoS Med. 2012;9(6):e1001238.

Huis A, Schoonhoven L, Grol R, Donders R, Hulscher M, van Achterberg T. Impact of a team and leaders-directed strategy to improve nurses’ adherence to hand hygiene guidelines: a cluster randomised trial. Int J Nurs Stud. 2013;50(4):464–74.

Crawshaw J, Meyer C, Antonopoulou V, et al. Identifying behaviour change techniques in 287 randomized controlled trials of audit and feedback interventions targeting practice change among healthcare professionals. Implement Sci. 2023;18(1):63.

Dunsmore J, Duncan E, MacLennan S, N’Dow J, MacLennan S. Effectiveness of de-implementation strategies for low-value prescribing in secondary care: a systematic review. Implement Sci Commun. 2023;4(1):115.

Ridic G, Gleason S, Ridic O. Comparisons of health care systems in the United States. Germany Canada Mater Sociomed. 2012;24(2):112–20.

Elliott RA, Putman Kd Fau - Franklin M, Franklin M Fau - Annemans L, et al. Cost effectiveness of a pharmacist-led information technology intervention for reducing rates of clinically important errors in medicines management in general practices (PINCER). (1179–2027 (Electronic)).

Elliott RA, Putman KD, Franklin M, et al. Cost effectiveness of a pharmacist-led information technology intervention for reducing rates of clinically important errors in medicines management in general practices (PINCER). Pharmacoeconomics. 2014;32(6):573–90.

Johri M, Ng ESW, Bermudez-Tamayo C, Hoch JS, Ducruet T, Chaillet N. A cluster-randomized trial to reduce caesarean delivery rates in Quebec: cost-effectiveness analysis. BMC Med. 2017;15(1):96.

Rodriguez-Martinez CE, Sossa-Briceño MP, Castro-Rodriguez JA. Cost-effectiveness of the utilization of “good practice” or the lack thereof according to a bronchiolitis evidence-based clinical practice guideline. J Eval Clin Pract. 2019;25(4):682–8.

Download references

Acknowledgements

We thank Natalie Yanchar, Éric Mercier, Jérôme Paquet, Tarek Razek, Martin Lesieur, Paule Lessard Boneaventure and Christine Rizzo for their role as members of the advisory committee.

Role of funder/sponsor

The funders had no role in developing this manuscript.

This research was supported by the Canadian Institutes of Health Research (Foundation grant, #353374).

Author information

Authors and affiliations.

Department of Social and Preventative Medicine, Université Laval, Québec City, Québec, Canada

Blanchard Conombo, Jason R. Guertin & Lynne Moore

Population Health and Optimal Health Practices Research Unit, Trauma – Emergency – Critical Care Medicine, Quebec University Hospital, Centre de Recherche du CHU de Québec-Université Laval, 18E Rue, Local H-012a, Québec City, Québec, 1401G1J 1Z4, Canada

Blanchard Conombo, Mélanie Bérubé, Simon Berthelot, Alexis F. Turgeon, Amina Belcaid & Lynne Moore

Division of Health Policy and Management, Department of Public Health Sciences, University of California at Davis, Davis, CA, USA

Jeffrey S. Hoch

Department of Medicine, University of Ottawa, Ottawa, ON, Canada

Jeremy Grimshaw

Faculty of Nursing, Université Laval, Québec City, Québec, Canada

Mélanie Bérubé & Christian Malo

Department of Family Medicine and Emergency Medicine, Université Laval, Québec City, Québec, Canada

Simon Berthelot & Patrick Archambault

Centre de Recherche Intégrée Pour Un Système Apprenant en Santé Et Services Sociaux, Centre Intégré de Santé Et de Services Sociaux de Chaudière-Appalaches, Lévis, Québec, Canada

Simon Berthelot

Department of Anesthesiology and Critical Care Medicine, Division of Critical Care Medicine, Université Laval, Québec City, Québec, Canada

François Lauzier & Alexis F. Turgeon

Department of Critical Care Medicine, Medicine and Community Health Sciences, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada

Henry T. Stelfox

VITAM-Centre de Recherche en Santé Durable, Québec City, Québec, Canada

Patrick Archambault

You can also search for this author in PubMed   Google Scholar

Contributions

BC, LM, JRG, JSH, HTS, PA, SB, FL, AFT, JG, MB, CM and AB developed the original research concept and developed the study design. BC, LM, FL, AFT, JSH, HTS, AB, MB and JRG have contributed to the data acquisition, analysis, or interpretation. JH, HTS and SB oversaw the analysis and provided feedback. BC and LM developed the draft manuscript. All authors made substantial contributions to the manuscript development, critical revision and approved the final version of the manuscript.

Corresponding author

Correspondence to Lynne Moore .

Ethics declarations

Ethics approval and consent to participate.

This study was approved by the CHU-de Québec – Université Laval research ethics committee and patient consent is not required.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Conombo, B., Guertin, J.R., Hoch, J.S. et al. Implementation of an audit and feedback module targeting low-value clinical practices in a provincial trauma quality assurance program: a cost-effectiveness study. BMC Health Serv Res 24 , 479 (2024). https://doi.org/10.1186/s12913-024-10969-2

Download citation

Received : 09 October 2023

Accepted : 09 April 2024

Published : 18 April 2024

DOI : https://doi.org/10.1186/s12913-024-10969-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Audit and feedback
  • Cost-effectiveness analysis
  • Low-value care

BMC Health Services Research

ISSN: 1472-6963

research articles clinical

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Benefits of Participation in Clinical Trials: An Umbrella Review

Amira bouzalmate-hajjaj.

1 Department of Preventive Medicine and Public Health, Faculty of Medicine, University of Granada, 18016 Granada, Spain

Paloma Massó Guijarro

2 Preventive Medicine Unit, Universitary Hospital Virgen de las Nieves, 18014 Granada, Spain

3 Instituto de Investigación Biosanitaria de Granada (IBS.GRANADA), 18012 Granada, Spain

Khalid Saeed Khan

4 CIBER de Epidemiología y Salud Pública (CIBERESP-Spain), 28029 Madrid, Spain

Aurora Bueno-Cavanillas

Naomi cano-ibáñez, associated data.

Not applicable.

Participation in randomised clinical trials (RCTs) entails taking part in the discovery of effects of health care interventions. The question of whether participants’ outcomes are different to those of non-participants remains controversial. This umbrella review was aimed at assessing whether there are health benefits of participation in RCTs, compared to non-participation. After prospective registration (PROSPERO CRD42021287812), we searched the Medline, Scopus, Web of Science and Cochrane Library databases from inception to June 2022 to identify relevant systematic reviews with or without meta-analyses. Data extraction and study quality assessment (AMSTAR-2) were performed by two independent reviewers. Of 914 records, six systematic reviews summarising 380 comparisons of RCT participants with non-participants met the inclusion criteria. In two reviews, the majority of comparisons were in favour of participation in RCTs. Of the total of comparisons, 69 (18.7%) were in favour of participation, reporting statistically significant better outcomes for patients treated within RCTs, 264 (71.7%) comparisons were not statistically significant, and 35 (9.5%) comparisons were in favour of non-participation. None of the reviews found a harmful effect of participation in RCTs. Our findings suggest that taking part in RCTs may be beneficial compared to non-participation.

1. Introduction

Patients participating in randomised clinical trials (RCT) take part in discovering the effects of healthcare interventions. Eligible participants enrol in RCTs voluntarily in the hope that, in addition to the possibility to obtain a health improvement individually, their participation will benefit health status in future patients. In fact, given that RCT implementation requires approval by an ethics committee, requires oversight with regard to compliance with the protocol, and involves the support of extra research staff in the monitoring of care, it is likely that this surveillance and additional healthcare might result in the accrual of benefits compared to the usual care provided, regardless of the study arm allocation [ 1 ]. However, whether their outcomes are different to those of non-participants remains controversial [ 2 , 3 , 4 , 5 , 6 ].

Informed consent forms offered to patients before their enrolment into RCTs provide information about potential benefits and risks [ 7 ], but not those of participation per se , even for the control group. The successful recruitment of patients relies on active and personalised strategies [ 8 ] and depends on the confidence of patients and health professionals regarding the benefits and safety of RCTs. A recent review showed that the decision to participate in a surgical trial is influenced by patients’ abilities to make sense of the trial and trial processes, to weigh the risks and benefits of the treatment options, and to trust in the RCT staff [ 9 ]. In a meta-analysis of barriers to cancer clinical trial participation, physician and patient decision-making was identified as the reason for not enrolling by one out of four patients, beyond trial availability or clinical ineligibility [ 10 ]. In a cross-sectional study on attitudinal discordance between cancer patients and clinicians/research providers regarding RCT participation, patients more frequently reported negative beliefs, such as the belief that participation did not help patients personally (32.9% vs. 1.8%, p < 0.001), although they were more confident regarding the benefit risk ratio (57% vs. 44%, p = 0.03) and less concerned about treatment toxicity (18% vs. 60% p = 0.006) and randomisation or receiving a placebo (27% vs. 71% p = 0.005) [ 11 ]. In a qualitative study on participation in oncological therapy RCTs, health professionals reported that misconceptions based on negative beliefs and attitudes towards research were the main patient-level barriers to participation [ 12 ]. In a review, uncertainty about the risk-benefit ratio of clinical trial participation may lead to a magnification in the perceived likelihood to suffer an adverse event and reduce patients’ predisposition to participate, as well as making clinicians, especially oncologists, reluctant to offer their patients the opportunity to enrol in a clinical trial, so as not to jeopardise their therapeutic long-term relationship [ 13 ].

A patient and public involvement (PPI) approach to the trial development process, from the formulation of research questions to the dissemination of results, may help staff build trusting relationships with potential participants and foster mutual commitment [ 14 , 15 ]. If it can be demonstrated that participating in RCTs improves health status, this would encourage volunteers to take part in research and enable health professionals to be confident about inviting patients to engage in trials [ 16 , 17 ]. Evidence regarding the benefits of participating in RCTs may help to interpret the generalisability of research findings, aiding in the implementation of new interventions in clinical practice and healthcare policy [ 18 ]. In this umbrella review, we aimed to determine if there was a health benefit (outcome) among eligible people (population) from participation in RCTs (intervention) vs. non-participation (comparison group).

2. Material and Methods

We performed this umbrella review after prospective registration (PROSPERO number: CRD42021287812) and reported it in accordance with the relevant guidelines [ 19 , 20 ]. We also adhered to the reporting guidelines for overviews of reviews of healthcare interventions (PRIOR) [ 21 ].

2.1. Literature Search and Selection

We conducted a sensitive literature search without language restrictions in electronic databases (the Medline, Scopus, Web of Science and Cochrane libraries) from inception to June 2022. We used a combination of keywords and terms including “participation”, “non-participants”, “systematic reviews”, “meta-analysis”, “health changes”, “health status improvement”, “harmful”, and “randomized controlled trials”. All citations found were exported to Endnote software, where duplicates were removed. Two reviewers (ABH and PMG) carried out the search strategy independently using electronic databases and manual searches, and screened all abstracts and titles ( Table S1 ).

We included studies aimed at assessing benefits or hazards of participation in RCTs independently of the intervention or control group allocation, compared to similar non-participating patients receiving conventional care outside of trials. The exclusion criteria were: studies which did not report benefits or harmful effects in all participants; study designs other than systematic reviews or meta-analyses on RCT, i.e., narrative reviews and reviews on non-RCTs; and reviews on effectiveness comparing intervention groups versus control groups, without comparisons with those outside the RCT. Any disagreement regarding the inclusion of the citations was resolved by obtaining the opinion of a third researcher (NCI). We contacted authors to obtain full-text articles that were not available. Finally, the selection of records was based on an independent review of the full texts to ensure that the inclusion and exclusion criteria had been fulfilled.

2.2. Data Extraction and Risk-of-Bias Assessment

The characteristics of selected studies were extracted independently by two reviewers (ABH and PMG) after reading the full text. We used a predefined form for data extraction, including citation details (author and year); objectives; characteristics and number of participants; the number of databases sourced and searched; the date range of the database search; the publication date range of studies included in the review that informed each outcome of interest; the instrument used to appraise the primary studies; and the ratings of their quality, comparator, type of intervention, and outcomes reported that were relevant to the umbrella review question.

The quality of the included systematic reviews was independently assessed by two reviewers (ABH and NCI). We chose the 16-item questionnaire “A Measurement Tool to Assess Systematic Reviews” (AMSTAR-2) [ 22 , 23 ] because of its more extensive use in umbrella reviews to assess quality, compared with other tools [ 24 ]. Disagreements were resolved via consultation with a third reviewer (PMG). According to the guidelines, the reviewers assigned one of four global quality ratings (i.e., high, moderate, low or critically low) after the consideration of 16 potential critical and noncritical weaknesses. High and moderate ratings reflected the presence of one or less or one noncritical weakness, respectively, whereas low and critically low ratings indicated one or more than one critical weakness, respectively.

2.3. Data Synthesis

The extracted data in each review were structured according to the PICO framework, noting the participant characteristics, intervention, comparator and outcome of each study. The findings were tabulated, including the overall number of RCTs and participants, the number of studies in favour and not in favour of participation [ 25 ] and whether meta-analysis and heterogeneity assessments were performed.

We also calculated the corrected covered area (CCA), a validated method of quantifying the degree of overlap between two or more reviews to help the decision process. CCA is expressed as a percentage, and is calculated as (N − r)/(rc − r), where N is the number of publications included in the evidence synthesis, r is the number of rows and c is the number of columns. Overlap is categorised as very high (CCA >15%), high (CCA 11–15%), moderate (CCA 6–10%) or slight (CCA 0–5%) [ 26 ]. In overlapping cases, we planned to give preference to the most recent review that had the highest quality (AMSTAR-2 assessment), supplied pooled-effect estimates or conducted a meta-analysis and had the highest number of studies or participants [ 27 ].

3.1. Selection, Characteristics and Quality of Studies

A total of 914 records were initially identified. Six articles met the eligibility criteria (292 RCTs, 380 unique comparisons). The dates used for the searching of the databases ranged from 1880 [ 28 ] to 2017 [ 29 ]. Figure 1 displays a PRISMA flow diagram of the selection process. We have also provided a list of studies that might appear to meet the inclusion criteria but were excluded, with the main reason for their exclusion ( Table S2 ). The main characteristics of the selected reviews and meta-analyses are summarised in Table 1 .

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-15368-g001.jpg

Flow chart of selected reviews and meta-analyses evaluating the benefits of participation in clinical trials.

Characteristics of the selected reviews and meta-analyses evaluating the benefits of participation in Randomized Clinical Trials (RCTs).

Evidence maps of effect direction, association strength, evidence certainty of RCT participation and its benefits, heterogeneity and whether meta-analysis was performed are provided in ( Table S3 ). Among the six reviews included in this study, three performed meta-analyses [ 28 , 29 , 30 ]. Primary original research studies included in the reviews were conducted in different medical areas and included a wide range of interventions, such as medical, surgical or counselling interventions. A moderate degree of overlap was found (CCA = 10%) ( Table S4 ). The quality was high in three reviews [ 29 , 30 , 31 ], low in one [ 28 ] and critically low in two ( Figure 2 ). Among the AMSTAR-2 criteria, the most frequent critical weaknesses identified were the lack of a comprehensive literature search strategy (50%) and an inappropriate investigation of publication bias (65%).

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-15368-g002.jpg

Quality assessment of the selected reviews and meta-analyses evaluating the benefits of participation in clinical trials using AMSTAR-2 (percentage of systematic reviews meeting the16 items).

3.2. Synthesis of Findings

In two of our reviews [ 31 , 32 ] the majority (>50%) of comparisons were in favour of participation in RCTs. Of the total number of comparisons included, 69 (18.7%) were in favour of participation, reporting statistically significant better outcomes for patients treated within RCTs, and 264 (71.7%) comparisons were not statistically significant, whereas 35 (9.5%) comparisons were in favour of non-participation ( Figure 3 ). None of the reviews showed a harmful effect of participation in RCTs in their overall synthesis.

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-15368-g003.jpg

Results of the selected reviews and meta-analyses evaluating the benefits of participation in clinical trials.

3.3. Findings from the High-Quality Subgroup of Reviews

In a cancer review [ 31 ], 27/49 (55.1%) comparisons reported statistically significantly better outcomes in RCT participants, 12/49 (24.5%) comparisons were in favour of non-participation and 10/49 (20.4%) comparisons were non-significant. A meta-analysis comparing women’s health outcomes in obstetrics and gynaecology trials [ 29 ] found 3/21 (14.2%) comparisons in favour of participation, 1/21 (4.8%) comparisons in favour of non-participation and 17/21 (81%) non-significant comparisons. In another review regarding general medicine [ 30 ], a total of 11/141 (7.8%) comparisons were in favour of participation, reporting statistically significant better outcomes, lower complications and relapse for patients treated within RCTs, whereas 10/141 (7.1%) comparisons were in favour of non-participation and 120/141 (85.1%) comparisons were not significant. In addition, 3/37 (8.1%) comparisons found a lower risk of mortality for patients treated inside of RCTs, whereas the remaining 34 (91.9%) comparisons were not statistically significant ( Figure 4 ).

An external file that holds a picture, illustration, etc.
Object name is ijerph-19-15368-g004.jpg

Results of the selected reviews and meta-analysis evaluating risk of mortality.

3.4. Findings of the Subgroup of Reviews with Low and Critically Low Quality

A review of cancer patients [ 32 ] reported 15/27 (55.5%) comparisons in favour of participation and 12/27 (44.5%) comparisons in favour of non-participation. In a general medicine review [ 28 ], 10/117 (8.5%) comparisons were in favour of participation, 9/117 (7.7%) were not in favour and 98/117 (83.8%) were statistically non-significant. Mortality was not significant either. In a review [ 33 ] focused on the safety of random treatment assignment, 3/25 (12%) comparisons were in favour of participation, 3/25 (12%) were not in favour of participation and 19/25 (76%) were not significant. In addition, in mortality and cancer recurrence, 50% of non-participants died or had a 4-year disease compared to 26% of participants.

4. Discussion

Our findings suggest that taking part in RCTs may be beneficial compared to non-participation. This was observed across women’s health, cancer and general medicine RCTs, with evidence from 380 unique comparisons collated in the synthesis. None of the reviews found a harmful effect of participation in RCTs. There was underlying heterogeneity and due to the observational nature of the comparisons, the findings should be interpreted with caution.

To our knowledge, this is the first umbrella review focusing on the benefits of participation in RCTs vs. non-participation. Our search was unrestricted, without limitations regarding the language or time period covered in the databases, to capture the highest possible number of relevant records. There was good reviewer agreement in the search, selection and quality assessments of studies, strengthening the review’s reliability.

In a study by Braunholtz et al. [ 32 ], 14 articles reported data from 21 trials, and they concluded that randomised trials tended to have beneficial effects rather than harmful effects on the patients who participated. In addition, a study included in this review showed that survival rates were significantly higher for children within RCTs than for those who were not participating [ 34 ]. Similarly, a study comparing survival among cancer patients found better survival in RCT participants compared to patients treated outside of RCTs in the first year after diagnosis [ 35 ]. This can be better appreciated in a women’s health meta-analysis [ 29 ], in which trial participants compared with non-participants showed improved health outcomes on average. In a high-quality review [ 30 ], although in some cases non-participants showed a benefit, a larger number of comparisons reported significantly better outcomes, as well as a lower risk of mortality, in RCT participants. In another high-quality review [ 33 ], the same number of studies in favour of participation and in support of non-participation was found; therefore, it cannot be claimed that participants in clinical trials derive a clear and significant benefit. These findings closely resembled those in another review investigating patients with the same disease, treated inside and outside of RCTs [ 28 ], in which most of the studies found no statistically significant differences in terms of benefits or harms between participants and non-participants.

The evidence supporting the safety and possible benefits of participation was consistent with the findings of two meta-analyses focused on control group weight changes within lifestyle RCTs. The most recent study showed a slight intragroup weight loss [ 6 ]. In a previous study, control groups receiving the usual care lost weight compared to those receiving no intervention, whereas the rest of the control group participants receiving other healthcare protocols did not gain weight [ 5 ]. The authors suggested including in future RCTs patient information sheets about the likelihood of weight loss or at least a prevention of weight gain for control group participants [ 6 ]. These findings are in alignment with those a previous review [ 36 ], in which it was found that most of the comparisons from cancer studies showed an association of trial participation with health benefits, with no evidence of harm. Thus, it has been suggested that the chance of obtaining benefits of participation in clinical trials should be acknowledged to encourage the enrolment of patients in intervention research [ 37 , 38 ]. Patient engagement in healthcare research is likely to be feasible in many settings, although it entails challenges such as the need for increased time and funding [ 39 , 40 ]. Given that randomised trials are necessary in order to provide reliable and high-quality evidence about the effects of clinical interventions [ 41 ], it is important to conduct properly designed trials with sufficient sample sizes. It is imperative to inform the eligible population about the benefits or hazards of interventions before their enrolment [ 42 , 43 , 44 ]. This can be understood as a chance to enhance the health outcomes of participants and to contribute to advances in treatment and healthcare, independently of which participation group they are allocated to. However, this does not imply that all intervention studies had no risk at all, as the hazards and benefits may vary significantly between studies. Understanding why participants exhibited an improved health status would have considerable implications, not only in the interpretation of intervention effects, but also in the design of future intervention trials.

Nevertheless, we acknowledge some methodological limitations. The six selected reviews provided limited evidence, mainly because of the heterogeneity in terms of the quality and size of the comparisons. Clinical variations in the nature of participants, interventions and outcomes can be a strength in terms of generalisability. However, statistical heterogeneity can mask the beneficial effect of trial participation or trial effects. Furthermore, it should be noted that the treatment effect, that is, differences due to interventions received inside instead of outside RCTs, as well as the presence of unmeasured differences in sociodemographic or clinical characteristics between participants in RCTs and non-participants, may affect the interpretation of our findings [ 45 , 46 ]. Given the limited number of reviews available on this topic, and the moderate overlap observed (CCA = 10%), we agreed not to remove any of the included records. This meant that comparisons from systematic reviews shared a 10% of their primary original studies. This proportion represented an acceptable level of redundancy [ 26 ].

5. Conclusions

Our findings suggest that taking part in RCTs may be beneficial compared to non-participation. Participation in clinical trials should be encouraged and its health impact needs to be addressed in further intervention research. We recommend systematically reporting a comparison between the outcomes amongst participants in RCTs, combining those assigned to control and intervention groups, and those not participating and receiving usual healthcare in a similar setting.

Supplementary Materials

The following supporting information can be downloaded at www.mdpi.com/xxx/s1,Table S1: Search strings in the umbrella review concerning the benefits of participation in clinical trials, Table S2: Excluded studies, with the main reason for their exclusion. Table S3: Results of the selected reviews and meta-analyses evaluating the benefits of participation in clinical trials. Table S4: Overlaps of the selected reviews and meta-analyses evaluating the benefits of participation in clinical trials.

Funding Statement

This research has received funding from the Ministry of Science and Innovation, Instituto de Salud Carlos III, FEDER co-funding from European Union (PI20/01532 project), and the Centro de Investigación Biomédica en Red-Epidemiología y Salud Pública (CIBERESP/CB06/02/1014).

Author Contributions

A.B.-C. and K.S.K. conceived the research question. A.B.-C., K.S.K. and P.M.G. designed the study; A.B.-H. and P.M.G. conducted the literature search, study selection and data extraction; N.C.-I. was the third reviewer. A.B.-H. and N.C.-I. performed the quality assessment, and P.M.G. was the third reviewer. All the authors contributed to the design of the figures, tables and appendices, as well as the drafts and the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

K.S.K. is the co-author of a systematic review included in this article [ 29 ].

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 23 April 2024

Strategies for the delivery of sex-based equity in cardiovascular clinical trials

  • Julie Sanders   ORCID: orcid.org/0000-0002-7335-2803 1 , 2 ,
  • Tim Clayton   ORCID: orcid.org/0000-0002-1266-3288 3 ,
  • Stacey Matthews   ORCID: orcid.org/0000-0002-9465-9580 4 ,
  • Sarah Murray 5 , 6 , 7 ,
  • Lynn Laidlaw 3 ,
  • Richard Evans 3 &
  • Rochelle Wynne   ORCID: orcid.org/0000-0003-1814-3416 8  

Nature Reviews Cardiology ( 2024 ) Cite this article

Metrics details

  • Medical research

The under-representation of women in cardiovascular clinical trials persists across participant, clinician and research roles. This gap perpetuates health inequity and hampers the generation, translation and implementation of optimal evidence-based care. Urgent action is needed to address barriers, promote diversity, and ensure inclusive trial design and health-care delivery and dissemination, for more equitable cardiovascular health.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Diversity, equity and inclusion in cardiovascular medicine and research. Nat. Rev. Cardiol. 19 , 705 (2022).

WHO. Number of clinical trial registrations by location, disease, phase of development, age and sex of trial participants (1999-2022). https://www.who.int/observatories/global-observatory-on-health-research-and-development/monitoring/number-of-trial-registrations-by-year-location-disease-and-phase-of-development (2023).

Jin, X. et al. Women’s participation in cardiovascular clinical trials from 2010 to 2017. Circulation 141 , 540–548 (2020).

Article   PubMed   Google Scholar  

Vogel, B. et al. The Lancet Women and Cardiovascular Disease Commission: reducing the global burden by 2030. Lancet 397 , 2385–2438 (2021).

Matthews, S. et al. Factors affecting women’s participation in cardiovascular research: a scoping review. Eur. J. Cardiovasc. Nurs. 23 , 107–114 (2024).

Wainer, Z. & Carcel, C. Sex and gender in health research: updating policy to reflect evidence. Med. J. Aust. 212 , 57–62.e1 (2020).

Sharma, G. et al. 10 Recommendations to enhance recruitment, retention, and career advancement of women cardiologists. J. Am. Coll. Cardiol. 74 , 1839–1842 (2019).

Zaman, S. et al. Representation of women in internal medicine specialties in North America, the United Kingdom, and Australasia: cardiology’s outlier status and the importance of diversity. Am. J. Cardiol. 185 , 122–128 (2022).

Greenwood, B. N., Carnahan, S. & Huang, L. Patient-physician gender concordance and increased mortality among female heart attack patients. Proc. Natl Acad. Sci. USA 115 , 8569–8574 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hunt, L., Nielsen, M. W. & Schiebinger, L. A framework for sex, gender, and diversity analysis in research. Science 377 , 1492–1495 (2022).

Article   CAS   PubMed   Google Scholar  

Download references

Author information

Authors and affiliations.

St Bartholomew’s Hospital, Barts Health NHS Trust, London, UK

Julie Sanders

William Harvey Research Institute, Queen Mary University of London, London, UK

Clinical Trials Unit, Faculty of Epidemiology and Public Health, London School of Hygiene and Tropical Medicine, London, UK

Tim Clayton, Lynn Laidlaw & Richard Evans

National Heart Foundation of Australia, Melbourne, Victoria, Australia

Stacey Matthews

National Patient & Public Involvement (PPI), Leicester, UK

Sarah Murray

British Heart Foundation Clinical Research Collaborative, London, UK

National Institute of Cardiovascular Research (NICOR) Community Representation Group, London, UK

School of Nursing & Midwifery, Centre for Quality and Patient Safety Research, Institute for Health Transformation, Deakin University, Geelong, Victoria, Australia

Rochelle Wynne

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Julie Sanders .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Sanders, J., Clayton, T., Matthews, S. et al. Strategies for the delivery of sex-based equity in cardiovascular clinical trials. Nat Rev Cardiol (2024). https://doi.org/10.1038/s41569-024-01025-x

Download citation

Published : 23 April 2024

DOI : https://doi.org/10.1038/s41569-024-01025-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research articles clinical

IMAGES

  1. Journal of Clinical Trials and Research Template

    research articles clinical

  2. Journal of Clinical Medical Research

    research articles clinical

  3. Click here

    research articles clinical

  4. (PDF) How to write a laboratory-based case study for the Journal

    research articles clinical

  5. (PDF) The top cited clinical research articles on sepsis: A

    research articles clinical

  6. (PDF) How To Write A Scientific Article For A Medical Journal?

    research articles clinical

VIDEO

  1. 1-3- Types of Clinical Research

  2. Clinical Trials Registration & Results Reporting & Data Sharing Part 4 of 4

  3. CASE SERIES IN A MEDICAL JOURNAL| PUBLISHING ARTICLES IN MEDICAL JOURNALS

  4. WRITING CASE REPORT IN A MEDICAL JOURNAL| PUBLISHING ARTICLES IN MEDICAL JOURNALS

  5. December 2023 JVS Journal Club

  6. Who can write a scientific article? #research #shorts

COMMENTS

  1. The New England Journal of Medicine

    The New England Journal of Medicine (NEJM) is a weekly general medical journal that publishes new medical research and review articles, and editorial opinion on a wide variety of topics of ...

  2. JAMA

    JAMA - The Latest Medical Research, Reviews, and Guidelines. Home New Online Issues For Authors. Editor's Choice: AI Tools to Improve Access to Reliable Health Information. Antibiotic Stewardship TrialsApril 19, 2024Original Investigation Stewardship Prompts to Improve Antibiotic Selection for Pneumonia: The INSPIRE Randomized Clinical Trial ...

  3. The BMJ original medical research articles

    Vitamin D supplementation and major cardiovascular events. June 28, 2023. Can't find what you're looking for? Continue to all research articles. Original research studies that can improve decision making in clinical medicine, public health, health care policy, medical education, or biomedical research.

  4. Clinical Trials and Clinical Research: A Comprehensive Review

    Clinical research is an alternative terminology used to describe medical research. Clinical research involves people, and it is generally carried out to evaluate the efficacy of a therapeutic drug, a medical/surgical procedure, or a device as a part of treatment and patient management. Moreover, any research that evaluates the aspects of a ...

  5. Planning and Conducting Clinical Research: The Whole Process

    Abstract. The goal of this review was to present the essential steps in the entire process of clinical research. Research should begin with an educated idea arising from a clinical practice issue. A research topic rooted in a clinical problem provides the motivation for the completion of the research and relevancy for affecting medical practice ...

  6. Clinical trials

    Clinical trials articles from across Nature Portfolio. A clinical trial involves the study of the safety, efficacy and/or dosage regimen of a therapeutic intervention (such as a drug) in humans ...

  7. The Changing Face of Clinical Trials

    L.D. Fiore and P.W. LavoriN Engl J Med 2016; 374:2152-2158. Clinical trials of interventions in common practice can be built into the workflow of an electronic medical record. The authors review ...

  8. Recently Published

    R.M. Conti, R.G. Frank, and D.M. Cutler DOI:10.1056/NEJMp2313400. Perspective; Apr 20, 2024; Beyond Code Status

  9. 11 clinical trials that will shape medicine in 2022

    Jennifer Litton is the vice president of clinical research and a professor in the Breast Medical Oncology department, ... Cite this article. Arnold, C. 11 clinical trials that will shape medicine ...

  10. 11 clinical trials that will shape medicine in 2024

    This article has been updated. Nature Medicine asks leading researchers to name their top clinical trial for 2024, from base editing and a vaccine against HIV to artificial intelligence tools for ...

  11. Clinical Trials: Sage Journals

    Clinical Trials is dedicated to advancing knowledge on the design and conduct of clinical trials related research methodologies. Covering the design, conduct, analysis, synthesis and evaluation of key methodologies, the journal remains on the cusp of the latest topics, including ethics, regulation and policy impact.

  12. Transforming Clinical Research to Meet Health Challenges

    The COVID-19 pandemic made "clinical trials" a household phrase, highlighting the critical value of clinical research in creating vaccines and treatments and demonstrating the need for large-scale, well-designed, and rapidly deployed clinical trials to address the public health emergency. As the largest public funder of clinical trials, the ...

  13. Clinical research nursing and factors influencing success: a

    Clinical research delivery is a term increasingly used to describe the work undertaken to implement studies which explore and test prevention, diagnosis and treatment in healthcare. Such studies range from multi-site clinical trials to single site observational projects. Whilst widely acknowledged as fundamental to effective healthcare ...

  14. Conducting Clinical Research During the COVID-19 Pandemic

    March 2020 (updated May 14, 2020). US Dept of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, Center for Devices and Radiological Health, Oncology Center of Excellence, Office of Good Clinical Practice. Accessed May 21, 2020. Docket No. FDA-2020-D ...

  15. Clinical Nursing Research: Sage Journals

    Clinical Nursing Research (CNR) is a leading international nursing journal, published eight times a year.CNR aims to publish the best available evidence from multidisciplinary teams, with the goal of reporting clinically applicable nursing science and phenomena of interest to nursing. Part of CNR's mission is to bring to light clinically applicable solutions to some of the most complex ...

  16. Clinical trials articles within Scientific Reports

    Read the latest Research articles in Clinical trials from Scientific Reports. ... Clinical trials articles within Scientific Reports. Featured. Article 17 April 2024 | Open Access.

  17. Articles

    Valve unit instead of intensive or intermediate care unit admission following transcatheter edge-to-edge mitral valve repair is safe and reduces postprocedural complications. Clinical Research in Cardiology is an international journal dedicated to clinical cardiovascular research. Publishes articles from the field of clinical ...

  18. Cancer

    F. Castinetti and F. Borson-ChazotN Engl J Med 2023;389:1916-1918. Although medullary thyroid cancer accounts for less than 5% of thyroid cancers, it deserves attention because of its phenotypic ...

  19. CINAHL

    Clinical Queries is designed for clinician use and uses filters to limit retrieval to research-based citations on clinical topics or systematic reviews. Select specific query type: Therapy, Prognosis, Review, Qualitative, Causation (Etiology). As research may require different emphasis, three strategies are provided for each area.

  20. Critical Appraisal of Clinical Research

    Critical appraisal is the course of action for watchfully and systematically examining research to assess its reliability, value and relevance in order to direct professionals in their vital clinical decision making [ 1 ]. Critical appraisal is essential to: Continuing Professional Development (CPD).

  21. The Patient Knows Best: PROs in RA Practice and Research

    April 22, 2024. 0. Patient-reported outcomes (PROs) in rheumatology are not just personal lists of physical complaints or so-called "organ recitals." In fact, PROs can both guide treatment ...

  22. Large language models approach expert-level clinical knowledge and

    Introduction. Generative Pre-trained Transformer 3.5 (GPT-3.5) and 4 (GPT-4) are large language models (LLMs) trained on datasets containing hundreds of billions of words from articles, books, and other internet sources [1, 2].ChatGPT is an online chatbot which uses GPT-3.5 or GPT-4 to provide bespoke responses to human users' queries [].LLMs have revolutionised the field of natural language ...

  23. In the brain, bursts of beta rhythms implement cognitive control

    In the brain, bursts of beta rhythms implement cognitive control. Bursts of brain rhythms with "beta" frequencies control where and when neurons in the cortex process sensory information and plan responses. Studying these bursts would improve understanding of cognition and clinical disorders, researchers argue in a new review.

  24. Clinical Research Article

    Clinical Research Article 19 Mar 2024 Overexpression of miR-22-3p and miR-29c-3p in CFU-Hill colonies is related to senescence process among children with low birth weight Paula R. P. Souza

  25. [2404.12827] CT-ADE: An Evaluation Benchmark for Adverse Drug Event

    Adverse drug events (ADEs) significantly impact clinical research and public health, contributing to failures in clinical trials and leading to increased healthcare costs. The accurate prediction and management of ADEs are crucial for improving the development of safer, more effective medications, and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a novel dataset ...

  26. Mental Health and Clinical Psychological Science in the Time of COVID

    The article concludes by discussing implications for new research directions, clinical approaches, and policy issues. Footnotes 1 The term developed over a period of years (1964 to 1991) that saw an increasing belief among clinical psychologists that there was a "fundamental incompatibility of the roles of scientist and professional within ...

  27. OpenAI's model all but matches doctors in assessing eye problems

    OpenAI's latest artificial intelligence model has almost matched expert doctors in analysing eye conditions, according to research that highlights the technology's potential in medicine. The ...

  28. Implementation of an audit and feedback module targeting low-value

    Developing a framework and research agenda for overuse and appropriateness measures. Agency for Healthcare Research and Quality Appropriateness Small Conference. Published 2009. Accessed 17 Sept 2018. Moore L, Bérubé M, Tardif PA, et al. Quality Indicators targeting low-value clinical practices in trauma care. JAMA Surg. 2022;157(6):507-14.

  29. Benefits of Participation in Clinical Trials: An Umbrella Review

    Evidence regarding the benefits of participating in RCTs may help to interpret the generalisability of research findings, aiding in the implementation of new interventions in clinical practice and healthcare policy . In this umbrella review, we aimed to determine if there was a health benefit (outcome) among eligible people (population) from ...

  30. Strategies for the delivery of sex-based equity in ...

    The under-representation of women in cardiovascular clinical trials persists across participant, clinician and research roles. This gap perpetuates health inequity and hampers the generation ...