Real-world evidence research based on big data

Motivation—challenges—success factors

Real-World-Evidence-Forschung auf Basis von Big Data

Motivation – Herausforderungen – Erfolgsfaktoren

  • Open access
  • Published: 07 June 2018
  • Volume 24 , pages 91–98, ( 2018 )

Cite this article

You have full access to this open access article

real world evidence research based on big data

  • Benedikt E. Maissenhaelter 1 ,
  • Ashley L. Woolmore 2 &
  • Peter M. Schlag 3  

45 Citations

7 Altmetric

Explore all metrics

In recent years there has been an increasing, partially also critical interest in understanding the potential benefits of generating real-world evidence (RWE) in medicine.

The benefits and limitations of RWE in the context of randomized controlled trials (RCTs) are described along with a view on how they may complement each other as partners in the generation of evidence for clinical oncology. Moreover, challenges and success factors in building an effective RWE network of cooperating cancer centers are analyzed and discussed.

Material and methods

This article is based on a selective literature search (predominantly 2015–2017) combined with our practical experience to date in establishing European oncology RWE networks.

RWE studies can be highly valuable and complementary to RCTs due to their high external validity. If cancer centers successfully address the various challenges in the establishment of an effective RWE study network and in the consequent execution of studies, they may efficiently generate high-quality research findings on treatment effectiveness and safety. Concerns pertaining to data privacy are of utmost importance and discussed accordingly. Securing data completeness, accuracy, and a common data structure on routinely collected disease and treatment-related data of patients with cancer is a challenging task that requires high engagement of all participants in the process.

Based on the discussed prerequisites, the analysis of comprehensive and complex real-world data in the context of a RWE study network represents an important and promising complementary partner to RCTs. This enables research into the general quality of cancer care and can permit comparative effectiveness studies across partner centers. Moreover, it will provide insights into a broader optimization of cancer care, refined therapeutic strategies for patient subgroups as well as avenues for further research in oncology.

Zusammenfassung

Hintergrund.

In den letzten Jahren ist ein zunehmendes, teilweise auch durchaus kritisches, Interesse an Real-World-Evidence(RWE) in der Medizin festzustellen.

Fragestellung

Vor- und Nachteile von RWE sollen insbesondere im Kontext mit randomisierten klinischen Studien (RCTs) analysiert werden und dabei diskutiert werden, wie RWE und RCTs sich als komplementäre Partner zur Evidenzgenerierung in der klinischen Onkologie ergänzen können. Ferner sollen die Herausforderungen und Erfolgsfaktoren für den Aufbau eines leistungsstarken RWE-Netzwerks kooperierender Krebszentren aufgezeigt werden.

Material und Methoden

Thematisch wurde die aktuelle Literatur (schwerpunktmäßig 2015–2017) selektiv recherchiert und kombiniert mit der eigenen praktischen Erfahrung aus dem bisherigen Aufbau von Europäischen Onkologischen Real-World-Evidence-Netzwerken.

RWE-Studien können durch ihre hohe externe Validität eine äußerst wertvolle Ergänzung zu RCTs sein. Wenn Krebszentren die zahlreichen Herausforderungen sowohl im Aufbau eines leistungsstarken RWE-Netzwerks als auch in der späteren Studiendurchführung sorgfältig adressieren, können hochwertige Forschungsergebnisse zur Therapieeffektivität und Therapiesicherheit erzielt werden. Die mit der Nutzung vernetzter, umfangreicher und vielschichtiger Datensätze (Big Data) sich ergebenden datenschutzrechtlichen Anforderungen sind dabei a priori zu beachten. Die Sicherung qualitätsgeprüfter, angemessen vollständiger und einheitlich strukturierter onkologischer Behandlungs- sowie Verlaufsdaten ist eine anspruchsvolle Aufgabe, welche die aktive und verantwortungsvolle Mitwirkung aller hieran Beteiligten benötigt.

Schlussfolgerungen

Unter den aufgezeigten Voraussetzungen stellt die Analyse umfangreicher und komplexer Real-World-Daten im Rahmen eines RWE-Netzwerks eine wichtige und vielversprechende Ergänzung zu RCTs dar. Hierdurch kann die allgemeine onkologische Versorgungsqualität analysiert und ein (Qualitäts‑)Vergleich verschiedener Einrichtungen ermöglicht werden. Zudem können auch Hinweise zu einer flächendeckenden Optimierung onkologischer Behandlung, zu verbesserten Therapiestrategien für Patientensubgruppen und Anregungen für neue onkologische Forschungsfelder gewonnen werden.

Similar content being viewed by others

real world evidence research based on big data

Real-world data: towards achieving the achievable in cancer care

Wasted research when systematic reviews fail to provide a complete and up-to-date evidence synthesis: the example of lung cancer.

real world evidence research based on big data

The future of clinical trials in urological oncology

Avoid common mistakes on your manuscript.

“We have entered the era of big data in healthcare” [ 12 ] and this era will transform medicine and especially oncology [ 13 , 24 ]. In this article, we focus on a specific aspect: how and under which conditions can real-world evidence (RWE) enrich and improve outcome research in oncology?

The U.S. Food and Drug Administration (FDA) defines RWE as “the clinical evidence regarding the usage, and potential benefits or risks, of a medical product derived from analysis of real-world data” [ 34 ]. The British Academy of Medical Sciences employs a similar definition: “the evidence generated from clinically relevant data collected outside of the context of conventional randomised controlled trials” [ 33 ]. Common to these definitions, and others, is the focus on evidence that is clinically relevant and that stems from routine clinical practice [ 32 ]. Our understanding of RWE is the technology-facilitated collation of all routinely collected information on patients from clinical systems to a comprehensive, homogeneously analysable dataset (big data) that reflects the treatment reality in the best possible and comparable manner.

In recent years there has been a growing interest in the potential benefits and the relevance of RWE studies [ 29 , 30 ]. For example, Tannock et al. recently pointed out in The Lancet Oncology that RWE studies enable “crucial insights into quality of care and effectiveness” [ 32 ]. In particular, key healthcare institutions have joined the scientific debate about when and how RWE studies can enrich our understanding of medical evidence [ 33 , 34 ]. At the same time, the high value of traditional randomized controlled trials (RCTs) should not be challenged and the necessity to conduct RWE studies at a high level of methodological and scientific rigor needs to be emphasized [ 18 ].

In oncology, the American Society of Clinical Oncology (ASCO) has recently published a research statement that discusses the potential of RWE and provides recommendations on how RWE may be utilized in conjunction with RCTs [ 35 ]. We will follow this line of reasoning and assume that RWE and RCTs are principally complementary approaches in clinical research. Consequently, it follows that RWE studies can also be a valuable tool for clinical research in oncology. In the following, we first discuss the benefits of RWE studies. Subsequently, we analyze a series of specific challenges and success factors in the establishment of a RWE study network that enables relevant research studies.

Strengths and weaknesses of RWE studies versus RCTs

Only a small proportion of cancer patients are recruited into RCTs and those that participate are typically younger and have fewer comorbidities than those that are not included into RCTs. The inclusion and exclusion criteria of RCTs usually create idealized conditions whereas per definition RWE studies provide insights into the routine clinical setting [ 3 ]. As a result, RWE studies may benefit from greater generalizability and external validity compared to RCTs [ 26 , 27 , 32 ].

RWE studies are complementary to RCTs in the generation of scientific evidence

The lack of generalizability of RCTs may contribute to a limited uptake of novel treatments despite positive evidence within RCTs [ 3 , 26 ]. This may be due to uncertainty about how this evidence may transfer to broader patient groups and how to integrate these treatments into routine practice [ 10 , 30 ].

In addition to a higher external validity, RWE studies have the potential to address a number of further limitations of RCTs. For example, RCTs often underestimate (long-term) toxicity and they rarely, or with a delay, explore certain research topics such as head-to-head comparisons of novel medications or interventions. Analyses with various clinical outcomes, in particular long-term and quality of life parameters are relatively infrequently addressed [ 10 , 26 , 32 ]. Moreover, a substantial number of RCTs focus on surrogate parameters instead of clinical parameters that are more clinically relevant [ 10 , 32 ]. Therefore, RWE studies can be used in a supplemental manner to create surveillance for new therapies and enable analyses of differential benefits of therapies in routine clinical care or by patient subgroups [ 35 ]. Finally, RCTs are relatively time and resource-intensive [ 10 , 29 ]. On the other hand, RWE studies have the promise of being conducted significantly faster and more resource-efficient but only once the necessary structures have been established in the centers and institutions. The financing sources for RWE studies are principally the same as for RCTs.

These critical remarks of caution do not intend to challenge the high value of RCTs, especially in the assessment of the efficacy of novel therapies. We believe, however, that RWE studies based on the data already captured in clinical systems can yield important additional insights into research and clinical care if they are being conducted at a high level of quality. To achieve this, we need sophisticated planning and careful execution to dispel a number of concerns about RWE studies. The large amount of apparently available electronic data may mislead researchers to conduct studies without elaborate attention to a stringent study design. This may include RWE studies that do not properly attend to data quality and thereby run the risk of biased data [ 19 , 35 ]. This further includes RWE studies that conduct ‘data dredging’ in disregard of scientific principles [ 2 ] or RWE studies which are initiated with a view towards commercial objectives instead of clinical or scientific insights [ 14 ].

In principle, the limited internal validity of RWE studies, primarily due to the general lack of randomization, is an important criticism and urges towards caution [ 2 , 17 ]. Certainly, the lower internal validity needs to be addressed to disentangle the effect of the treatment under investigation from other factors [ 3 ]. We will discuss later how advanced statistical techniques may support researchers in responding to this challenge.

Internal and external validity are both vital cornerstones of good science. While RCTs have higher internal validity, RWE studies have higher external validity. Thus, there may be a complementarity in the generation of scientific evidence [ 3 , 29 , 33 ].

For example, RWE studies can help in setting the research direction and in generating hypotheses of future RCTs or serve as the foundation for future confirmatory RCTs [ 29 ]. On the other hand, RWE studies can extend our knowledge of treatment effectiveness and safety by generalizing the findings of prior RCTs [ 29 ]. They may further describe underutilization of therapies or reveal overtreatment [ 3 ], and also foster research of rare tumors because they may allow the use of data sets with sufficiently large patient cohorts [ 13 ].

By demonstrating that positive results of RCTs are also applicable in routine clinical practice RWE will also increase the confidence of oncologists with respect to their use of anti-cancer therapies in routine clinical care. Perhaps along the way they may uncover boundary conditions and derive adapted approaches and safety insights for subpopulations (Fig.  1 ).

Strengths and weaknesses of RCTs vs. RWE studies, and their complementarity

Concept and examples of RWE studies

A good example to demonstrate the complementarity of RCTs and RWE studies and the potential therein is a study recently published in the Journal of Clinical Oncology (JCO) [ 22 ]. The research question of this study resulted from disputed results of several international RCTs regarding the role of neoadjuvant chemotherapy (NACT) and primary cytoreductive surgery (PCS) in ovarian cancer of stages IIIC and IV. This study, which analyzed the comprehensive database of 1538 patients from 6 renowned American centers of the National Comprehensive Cancer Network (NCCN) found a survival benefit for patients in stage IIIC in the PCS group. This correlated with a subgroup analysis of a prior European Organisation for Research and Treatment of Cancer (EORTC) study and is in line with current treatment guidelines, e. g. in Germany, the Association of the Scientific Medical Societies in Germany (AWMF) S3 guideline. The comparability of both groups was ensured by means of a refined propensity score matching ( n  = 594). The general increase of NACT indications for ovarian cancer during the analysis period of this study should be viewed critically since the study also showed that interval cytoreductive surgery (ICS) did not improve outcomes. However, the study confirmed that NACT is noninferior to PCS in stage IV. Thus, several new research questions (hypotheses) with regard to optimizing the treatment algorithm for ovarian cancer in stages IIIC and IV may be derived from this RWE study.

Other examples for the complementarity of RCTs and RWE studies are long-term studies on treatment safety [ 6 ] or on topics for which RCTs were not feasible (especially for patients with a rare tumor) [ 23 ]. These examples demonstrate the direction that RWE studies could and should be taking. An increased level of data depth and data quality combined with stringent methods and processes should further reduce the limitations while increasing the quality and quantity of RWE studies [ 24 ].

Benefits of establishing an oncology RWE network

There are a range of additional benefits from the establishment of a network of cancer centers that collaborate with each other. In a partner network, cancer centers can learn from each other and exchange experiences in topics such as building the required infrastructure, creating high-quality data sets, or the design and execution of the RWE studies. In addition, the cancer centers can collectively analyze their data to form sufficiently large patient cohorts. Moreover, a network of cancer centers would ideally have centers that partially employ different processes and regimens in the treatment of patients. It is this variation in practices that may potentially help uncover novel insights into as yet insufficiently covered factors and their impact on treatment effectiveness. Furthermore, treatment alternatives (whose pros and cons have not been fully defined in treatment guidelines) can be tested [ 19 ].

Such topics can currently not, or only partially, be derived from epidemiological [ 4 ] or clinical cancer registries [ 15 ] because these registries have a different main objective. RWE studies, as described in this article, are of course not completely new but rather based on established observational study methods such as cohort studies [ 16 ], registry studies [ 25 ] and population studies [ 20 ]. These registries and types of studies provide different and crucial insights into the quality of cancer care but capture data often with a time delay and necessarily in a limited depth. RWE studies are therefore also complementary to these efforts.

Challenges and success factors

Establishing a rwe network.

In order to achieve the possibilities and objectives described above, several tasks need to be addressed. These refer predominantly to the current reality of fragmented clinical IT systems, the quality of the data therein, as well as to information governance and operations. Resolving these challenges prior to the initiation of studies will set the technical and operational infrastructure to conduct RWE studies more efficiently, more reliably, and at a consistently high level of quality. In this context, continuous attention and efforts to utilize appropriate technologies and an up to date management of the “big data sets” are of particular importance [ 13 ].

Identification of partner cancer centers

Establishing a network begs the question on how to identify suitable partner cancer centers for the network. The high importance of an appropriate technical infrastructure and especially of high data quality necessitates the inclusion of partner cancer centers that are committed to invest time, resources and determination into optimizing their data infrastructure. Furthermore, key personnel in the cancer centers should be convinced of the benefits of such a RWE network. Lastly, one should strive to connect cancer centers that may complement each other with regard to their patient profiles, research areas, and geographical variation. On this basis, the participating centers are better equipped to cover a wide range of influencing parameters, that may otherwise be neglected, and thereby also counteract potential confounders [ 18 ].

IT systems and databases

Experience tells us that clinical data sets currently frequently reside in ‘silos’ [ 27 , 35 ]. In essence, the data are located in different domains because they are being captured by organizationally different clinical units, by means of different systems and are ultimately stored in different infrastructure units [ 9 ]. Hospitals and cancer centers typically do not have unified and integrated data warehouses. On the contrary, laboratory results are stored in a laboratory information system (LIMS), (radiological) imaging data in a picture archiving and communication system (PACS), prescription data in the pharmacy system etc. This fragmented landscape poses a challenge for any given study even within a specific type of tumor at a single cancer center. In addition, a highly effective RWE network should strive to achieve data comparability across the partner cancer centers to enable multicenter studies. Data, however, are often incomparable within a center and the comparability across cancer centers is thus even more challenging.

The commitment of decision makers in the cancer centers is essential

A key success factor in addressing these challenges is the construction of integrated data warehouses that collect, link and store all relevant data sets [ 9 , 35 ]. This infrastructure should be built in a manner that ensures comparability of the data for the purpose of conducting research studies [ 33 ]. Technically, this can be facilitated by common standards of electronic data exchange such as HL7 and tools designed to ‘extract-transform-load’ (ETL). The data can thereby be extracted from a source, transformed into the desired format, and loaded into a target infrastructure. This process should be supported by medical ontologies and in practice by the diagnosing physicians and the treating oncologists. Eventually, this can result, under the leadership of specialized IT personnel, in a common data model, which will also greatly advance data comparability across cancer centers in the RWE network. Of course, tools for data protection and pseudonymization/de-identification of patient data need to be embedded in the technical solution.

Besides these technological requirements, an essential criterion for the success of such a project is the commitment of key decision makers in the cancer centers [ 13 ]. Some cancer centers have already started to build such solutions but clearly further efforts and advances are necessary in order to fully utilize the potential [ 27 ].

Data quality

Another important aspect is the quality of the data. The systems that routinely collect and store patient data have usually not been designed with the objective to eventually utilize the data for research purposes. Similarly, data entry into documentation systems is rather unpopular (not only) with physicians because it detracts from their direct patient contact by significantly limiting the available time for this. However, completeness and validity of the data collected in clinical routine care are of utmost importance for a meaningful and analytically ready RWE network [ 5 , 29 ]. Specifically, there are four distinct yet related challenges: the completeness, common structure, and accuracy of the data as well as the availability of novel types of data.

Missing data is a common issue with health data in general [ 2 ]. This limits the size of the patient cohort that can be completely analyzed and may also introduce a bias into the data set that could potentially invalidate the findings [ 35 ]. Moreover, cancer patients are often treated by more than one department and unit within a center as well as by office-based oncologists. Cancer centers would need to strive to achieve a nearly complete follow-up to enable outcome research with a comprehensive and long-term view. It is obvious that the success of a high-quality RWE program depends on data sets that are nearly complete and in particular it requires that any incompleteness is not due to a systematic bias [ 34 ]. Within an institution, data completeness may be improved by a change in the front-end data capture coupled with illustrating the value of capturing full records. Across institutions, technology in the sense of common standards and interfaces may also contribute to the analytical integration of commonly collected data. In addition, the commitment among the decision makers and sufficient capacity of trained personnel are both instrumental in ensuring comprehensive follow-up together with partner institutions.

Some types of data in electronic medical records, such as anamnestic data, comorbidities or toxicity, are frequently recorded as free text instead of being stored as structured data variables [ 24 ]. One advantage of real-world data is the ability to construct data sets with long follow-up periods. The potential to fully utilize the theoretically available data is confined by historical, paper-based documents. Both challenges can be increasingly addressed by a mixture of technology and organizational measures. For example, technically by a change in the front-end systems requiring the structured input of key data points along with a general motivation of the staff of a cancer center [ 35 ]. Also, natural language processing (NLP) software solutions that utilize medical dictionaries (e. g. SNOMED, LOINC) can transform unstructured data both retrospectively and prospectively [ 12 ]. They should not be used in isolation but rather by medical coding teams.

Various reasons can endanger the accuracy of data. For example, data may be inaccurate due to an error creating or in entering patient data, or due to a change in classification schemes etc. Data accuracy can be fostered by a combination of electronic means of assuring quality, e. g., checking validity at the point of data entry or business quality rules, and periodic quality checks. Reviewing the distribution of data variables and conducting logic checks may uncover systematic inaccuracies.

Some, primarily scientific, types of data such as new biomarkers, genomics data, or novel laboratory tests are currently not systematically collected or in a heterogeneous manner. This applies also to patient-reported outcomes (PRO) such as structured assessments of patients’ quality of life. Increasingly, (psycho-) oncologists suggest that quality of life data should be key outcome parameters in oncology. The clinical results of novel therapies, unfortunately, still generate in many cases only marginal improvements in overall survival but may result in meaningful differences in patients’ quality of life [ 28 ]. Considering PRO data may also be helpful in uncovering symptomatic toxicities such as nausea or vomiting that are often captured incompletely [ 21 ]. A RWE network should therefore envisage the incorporation of these novel types of data systematically into their routine clinical practice [ 12 ]. Of particular relevance is the establishment of a user-friendly process that captures patients’ quality of life assessments [ 28 ].

Data privacy and protection

The fundamental importance and the indispensable cornerstones of the protection of patient data [ 5 ] have been recently reiterated by the (German) National Ethics Committee (Deutscher Nationaler Ethikrat) in a comprehensive report in the context of big data in healthcare [ 7 ]. In parallel, the new General Data Protection Regulation (GDPR) will come into effect in the European Union as of May 2018. This updated regulation further expands the scope of data privacy and protection by including new principles such as data protection and privacy by design that obligates the implementation of technical and organizational measures that secure patient data already at the design of systems [ 8 ]. Two legitimate claims oscillate here: the individual right to data privacy and the right of the population that improvements in cancer therapy may be developed based on data analyses.

Sophisticated solutions have been developed for the pseudonymization or de-identification of data. Various technical solutions are available to protect the data. They should be combined with well-established organizational processes and training of personnel. Furthermore, the data sets could stay within the confines of each cancer center and be analyzed only by staff associated with the cancer center. Multicenter studies could be conducted by means of federated data analysis in this scheme.

Integrating data from fragmented systems, ensuring high data quality and enhancing it further, while securing data privacy requires a strong governance framework within a cancer center. The framework needs to describe the decision rights and roles and responsibilities of the various departments within a cancer center and the hospital. This is not only crucial in the formation of the network but will also govern how RWE studies are to be conducted within a center and in partnership with other centers. All requirements and tasks described above also necessitate a sufficient capacity of dedicated, specialized personnel [ 5 ]. Only this interplay can enable a high level of scientific and methodological rigor (Table  1 ).

Conducting RWE studies

For high-quality RWE studies there are additional requirements with respect to design, management, and publication.

Study design and execution

A critical review of phase IV trial protocols suggest that many of them neglect to account for recognized measures of quality assurance in the design of these studies [ 14 ]. A clinically relevant research question needs to be postulated based on a stringent medical theory and the corresponding hypotheses need to be established. These may then be analyzed with an appropriate and sufficiently large dataset and by applying suitable methods [ 1 ]. This necessitates a substantial medical expertise in oncology from the inception of the study [ 29 ]. Therefore, RWE studies should be conducted hand in hand with stakeholders who possess the requisite clinical and biometric expertise and who design and execute the studies independently under primarily scientific aspects [ 2 ].

The tasks require sufficient and dedicated specialized personnel

Real-world data are vulnerable to a range of biases [ 35 ]. These include selection bias, information bias, measurement error, confounding, and Simpson ’s paradox [ 11 ], as well as performance, detection or attrition bias [ 35 ]. It is thus necessary to employ stringent methods to assess and ascertain the data quality, for example, by integrating the Cochrane risk-of-bias approach [ 9 ].

As discussed above, the key limitation of RWE studies is their lower internal validity [ 2 , 3 ]. In contrast to RCTs, and more generally, studies with an experimental design, RWE need to apply methods that single out the effect of the treatment under investigation [ 27 ]. To this end, there are a number of statistical matching techniques, such as propensity score matching, inverse probability weighting, or stratification [ 17 ]. The propensity score method has some parallels with controlled trials on some levels [ 17 ]. Conceptually, the method seeks to analytically generate a control group that resembles the treatment group very closely with respect to the characteristics of the patient groups and other impact factors.

Publication of RWE studies

A frequent criticism of post-marketing studies, which include RWE studies, is the partial practice of opaque reporting of findings and selective publication [ 2 ]. A recent British Medical Journal (BMJ) article reported that only a small proportion of post-marketing studies were published in scientific journals [ 31 ]. This development is problematic because it does not contribute to scientific progress and contravenes common principles of good scientific practice. This raises concerns about the motivation for and the scientific discussion of post-marketing studies. A general obligation to publish (completed and discontinued) RWE studies conducted in an oncological RWE network should already be agreed upon in the planning phase of a study [ 2 ]. The publications should be transparent with respect to the original research question, the design of the study, its analysis and interpretation ([ 35 ]; Table  2 ).

“Big data” can be considered the material basis for the realization of RWE studies. These may be a complementary partner of RCTs and thereby a valuable tool in clinical research in oncology.

Modern IT concepts and technologies enable the digital and structured capture of complex oncological information in addition to routine medical data.

Thereby, it is possible to analyze data longitudinally and data that has been collected with different methods.

The individual centers in a RWE study network may conduct national or international benchmarking, depending on the composition of the network, in addition to analyzing their internal clinical context.

This may not only yield clues about outcomes of current treatment pathways but also about alternative approaches or about new research-related hypotheses.

RWE studies are therefore a meaningful complement to RCTs which typically analyze pre-selected patients and to clinical registries which usually operate on a reduced data scope.

The ambition to conduct high-quality RWE studies in oncology poses significant challenges for all stakeholders with regards to IT, personnel, organizational, financial, and data privacy aspects.

These challenges can only be overcome jointly in order to achieve the legitimate aim of a relevant improvement of quality, effectiveness and safety in oncological care.

Antes G (2016) Ist das Zeitalter der Kausalität vorbei? Z Evid Fortbild Qual Gesundhwes 112:S16–S22

Article   Google Scholar  

Berger ML, Sox H, Willke RJ et al (2017) Good practices for real-world data studies of treatment and/or comparative effectiveness: recommendations from the joint ISPOR-ISPE special task force on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf 26:1033–1039

Booth CM, Tannock IF (2014) Randomised controlled trials and population-based observational research: partners in the evolution of medical evidence. Br J Cancer 110:551–555

Article   CAS   Google Scholar  

Brenner H, Weberpals J, Jansen L (2017) Epidemiologische Forschung mit Krebsregisterdaten. Onkologe 23:272–279

Califf RM, Robb MA, Bindman AB et al (2016) Transforming evidence generation to support health and health care decisions. N Engl J Med 375:2395–2400

Darby SC, Ewertz M, McGale P et al (2013) Risk of ischemic heart disease in women after radiotherapy for breast cancer. N Engl J Med 368:987–998

Deutscher Ethikrat (2017) Big Data und Gesundheit – Datensouveränität als informationelle Freiheitsgestaltung. Vorabfassung vom 30.11.2017

Google Scholar  

E.U. Regulation (2016) 679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulat). Off J Eur Union L119:1–88

Elliott JH, Grimshaw J, Altman R et al (2015) Make sense of health data: develop the science of data synthesis to join up the myriad varieties of health information. Nature 527:31–33

Frieden TR (2017) Evidence for health decision making—beyond randomized, controlled trials. N Engl J Med 377:465–475

Hammer GP, du Prel JB, Blettner M (2009) Vermeidung verzerrter Ergebnisse in Beobachtungsstudien. Dtsch Arztebl 47:664–668

Hassett MJ (2017) Quality improvement in the era of big data. J Clin Oncol 35:3178–3180

Jaffee EM, Van Dang C, Agus DB et al (2017) Future cancer research priorities in the USA: a Lancet Oncology Commission. Lancet Oncol 18:e653–e706

von Jeinsen BK, Sudhop T (2013) A 1‑year cross-sectional analysis of non-interventional post-marketing study protocols submitted to the German Federal Institute for Drugs and Medical Devices (BfArM). Eur J Clin Pharmacol 69:1453–1466

Klinkhammer-Schalke M, Gerken M, Barlag H et al (2017) Bedeutung von Krebsregistern für die Versorgungsforschung. Onkologe 23:280–287

Kulkarni GS, Hermanns T, Wei Y et al (2017) Propensity score analysis of radical cystectomy versus bladder-sparing trimodal therapy in the setting of a multidisciplinary bladder cancer clinic. J Clin Oncol 35:2299–2307

Kuss O, Blettner M, Börgermann J (2016) Propensity score: an alternative method of analyzing treatment effects – part 23 of a series on evaluation of scientific publications. Dtsch Aerztebl Int 113:597–603

Lange S, Sauerland S, Lauterberg J, Windeler J (2017) The range and scientific value of randomized trials – part 24 of a series on evaluation of scientific publications. Dtsch Aerztebl Int 114:635–640

Lefering R (2016) Registerdaten zur Nutzenbewertung – Beispiel TraumaRegister DGU®. Z Evid Fortbild Qual Gesundhwes 112:11–S15

van Maaren MC, de Munck L, de Bock GH et al (2016) 10 year survival after breast-conserving surgery plus radiotherapy compared with mastectomy in early breast cancer in the Netherlands: a population-based study. Lancet Oncol 17:1158–1170

Di Maio M, Basch E, Bryce J, Perrone F (2016) Patient-reported outcomes in the evaluation of toxicity of anticancer treatments. Nat Rev Clin Oncol 13:319–325

Meyer LA, Cronin AM, Sun CC et al (2016) Use and Effectiveness of Neoadjuvant Chemotherapy for Treatment of Ovarian Cancer. J Clin Oncol 34:3854–3863

Nussbaum DP, Rushing CN, Lane WO et al (2016) Preoperative or postoperative radiotherapy versus surgery alone for retroperitoneal sarcoma: a case-control, propensity score-matched analysis of a nationwide clinical oncology database. Lancet Oncol 17:966–975

Obermeyer Z, Emanuel EJ (2016) Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 375:1216

Reiss KA, Yu S, Mamtani R et al (2017) Starting dose of sorafenib for the treatment of hepatocellular carcinoma: a retrospective, multi-institutional study. J Clin Oncol 35:3575–3585

Rothwell PM (2005) External validity of randomised controlled trials:“to whom do the results of this trial apply?”. Lancet 365:82–93

Schneeweiss S (2014) Learning from big health care data. N Engl J Med 370:2161–2163

Secord AA, Coleman RL, Havrilesky LJ et al (2015) Patient-reported outcomes as end points and outcome indicators in solid tumours. Nat Rev Clin Oncol 12:358–370

Sherman RE, Anderson SA, Dal Pan GJ et al (2016) Real-world evidence—what is it and what can it tell us. N Engl J Med 375:2293–2297

Sherman RE, Davies KM, Robb MA et al (2017) Accelerating development of scientific evidence for medical products within the existing US regulatory framework. Nat Rev Drug Discov 16:297–298

Spelsberg A, Prugger C, Doshi P et al (2017) Contribution of industry funded post-marketing studies to drug safety: survey of notifications submitted to regulatory agencies. BMJ 356:j337

Tannock IF, Amir E, Booth CM et al (2016) Relevance of randomised controlled trials in oncology. Lancet Oncol 17:e560–e567

The Academy of Medical Sciences (2015) Real world evidence: summary of a joint meeting held on 17 September 2015 by the Academy of Medical Sciences and the Association of the British Pharmaceutical Industry

U.S. Food and Drug Administration (2017) Use of real-world evidence to support regulatory decision-making for medical devices. Guidance for industry and food and drug administration staff

Visvanathan K, Levit LA, Raghavan D et al (2017) Untapped potential of observational research to inform clinical decision making: American Society of Clinical Oncology research statement. J Clin Oncol 35(16):1845–1854

Download references

Author information

Authors and affiliations.

IQVIA (formerly Quintiles & IMS Health), Landshuter Allee 10, 80637, Munich, Germany

Benedikt E. Maissenhaelter

IQVIA (formerly Quintiles & IMS Health), Paris, France

Ashley L. Woolmore

c/o Charité Comprehensive Cancer Center, Berlin, Germany

Peter M. Schlag

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Benedikt E. Maissenhaelter .

Ethics declarations

Conflict of interest.

B.E. Maissenhaelter, A.L. Woolmore are employees of IQVIA and are responsible for establishing a European Oncology Data and Evidence Network. P.M. Schlag advises IQVIA in establishing a European Oncology Data and Evidence Network and receives consulting fees from IQVIA.

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Open Access. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Maissenhaelter, B.E., Woolmore, A.L. & Schlag, P.M. Real-world evidence research based on big data. Onkologe 24 (Suppl 2), 91–98 (2018). https://doi.org/10.1007/s00761-018-0358-3

Download citation

Published : 07 June 2018

Issue Date : November 2018

DOI : https://doi.org/10.1007/s00761-018-0358-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Real-world data
  • Evidence network
  • Network of cancer centers
  • Outcome research
  • Quality of care

Schlüsselwörter

  • Real-World-Data
  • Evidenznetzwerk
  • Netzwerk von Krebszentren
  • Outcome-Forschung
  • Versorgungsqualität
  • Find a journal
  • Publish with us
  • Track your research

Amanote Research

Real-world evidence research based on big data, onkologe - germany, doi 10.1007/s00761-018-0358-3.

Available in full text

June 7, 2018

Springer Science and Business Media LLC

Related search

Research on real-time network data mining technology for big data, research on the data service application based on big data, the world of real research. commentary on... research in the real world, evidence-based practice in the real world, research on e-commerce enterprise management based on big data, using big data to predict collective behavior in the real world, pcp53 - a diagnostic framework to evaluate real-world data sources for real-world evidence generation, the research based on big data management accounting model building, new discrimination diagrams for basalts based on big data research.

  • Open access
  • Published: 05 November 2022

Real-world data: a brief review of the methods, applications, challenges and opportunities

  • Fang Liu   ORCID: orcid.org/0000-0003-3028-5927 1 &
  • Demosthenes Panagiotakos 2  

BMC Medical Research Methodology volume  22 , Article number:  287 ( 2022 ) Cite this article

50k Accesses

83 Citations

21 Altmetric

Metrics details

A Correction to this article was published on 02 May 2023

This article has been updated

The increased adoption of the internet, social media, wearable devices, e-health services, and other technology-driven services in medicine and healthcare has led to the rapid generation of various types of digital data, providing a valuable data source beyond the confines of traditional clinical trials, epidemiological studies, and lab-based experiments.

We provide a brief overview on the type and sources of real-world data and the common models and approaches to utilize and analyze real-world data. We discuss the challenges and opportunities of using real-world data for evidence-based decision making This review does not aim to be comprehensive or cover all aspects of the intriguing topic on RWD (from both the research and practical perspectives) but serves as a primer and provides useful sources for readers who interested in this topic.

Results and Conclusions

Real-world hold great potential for generating real-world evidence for designing and conducting confirmatory trials and answering questions that may not be addressed otherwise. The voluminosity and complexity of real-world data also call for development of more appropriate, sophisticated, and innovative data processing and analysis techniques while maintaining scientific rigor in research findings, and attentions to data ethics to harness the power of real-world data.

Peer Review reports

Introduction

Per the definition by the US FDA, real-world data (RWD) in the medical and healthcare field “are the data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources”[ 1 ]. The wide usage of the internet, social media, wearable devices and mobile devices, claims and billing activities, (disease) registries, electronic health records (EHRs), product and disease registries, e-health services, and other technology-driven services, together with increased capacity in data storage, have led to the rapid generation and availability of digital RWD [ 2 ].

The increasing accessibility of RWD and the fast development of artificial intelligence (AI) and machine learning (ML) techniques, together with rising costs and recognized limitations of traditional trials, has spurred great interest in the use of RWD to enhance the efficiency of clinical research and discoveries and bridge the evidence gap between clinical research and practice. For example, during the COVID-19 pandemic, RWD are used to generate or aid the generation of real-world evidence (RWE) on the effectiveness of COVID-19 vaccination [ 3 , 4 , 5 ], to model localized COVID-19 control strategies [ 6 ], to characterize COVID-19 and flu using data from smartphones and wearables [ 7 ], to study behavioral and mental health changes in relation to the lockdown of public life [ 8 ], and to assist in decision and policy making, among others.

In what follows, we provide a brief review on the type and sources of RWD (Section 2 ) and the common models and approaches to utilize and analyze RWD (Section 3 ) , and discuss the challenges and opportunities of using RWD for evidence-based decision making (Section 4 ). This review does not aim to be comprehensive or cover all aspects of the intriguing topic on RWD (from both the research and practical perspectives) but serves as a primer and provides useful sources for readers who interested in this topic.

Characteristics, types and applications of RWD

RWD have several characteristics as compared to data collected from randomized trials in controlled settings. First, RWD are observational as opposed to data gathered in a controlled setting. Second, many types of RWD are unstructured (e.g., texts, imaging, networks) and at times inconsistent due to entry variations across providers and health systems. Third, RWD may be generated in a high-frequency manner (e.g., measurements at the millisecond level from wearables), resulting in voluminous and dynamic data. Fourth, RWD may be incomplete and lack key endpoints for an analysis given that the original collection is not for such a purpose. For example, claims data usually do not have clinical endpoints; registry data have limited follow-ups. Fifth, RWD may be subject to bias and measurement errors (random and non-random). For example, data generated from the internet, mobile devices, and wearables can be subject to selection bias; a RWD dataset is a unrepresentative sample of the underlying population that a study intends to understand; claims data are known to contain fraudulent values. In summary, RWD are messy, incomplete, heterogeneous, and subject to different types of measurement errors and biases. A systematic scoping review of the literature suggests data quality of RWD is not consistent, and as a result quality assessments are challenging due to the complex and heterogeneous nature of these data. The sub-optimal data quality of RWD is well recognized [ 9 , 10 , 11 , 12 ]; how to improve it (e.g. regulatory-grade) is work in progress [ 13 , 14 , 15 ].

There are many different types of RWD. Figure 1 [ 16 ] provides a list of the RWD types and sources in medicine. We also refer readers to [ 11 ] for a comprehensive overview of the RWD data types. Here we use a few common RWD types, i.e., EHRs, registry data, claims data, patient-reported outcome (PRO) data, and data collected from wearables, as examples to demonstrate the variety of RWD and how they can be used for what purposes.

figure 1

RWD Types and Sources (source: Fig. 1 in [ 16 ] with written permission by Dr. Brandon Swift to use the figure)

EHRs are collected as part of routine care across clinics, hospitals, and healthcare institutions. EHR data are typical RWD – noisy, heterogeneous, structured, and unstructured (e.g., text, imaging), and dynamic and require careful and intensive efforts pre-processing [ 17 ]. EHRs have created unprecedented opportunities for data-driven approaches to learn patterns, make new discoveries, assist preoperative planning, diagnostics, clinical prognostication, among others [ 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 ], improve predictions in selected outcomes especially if linked with administrative and claim data and usage of proper machine learning techniques [ 27 , 28 , 29 , 30 ], and validate and replicate findings from clinical trials [ 31 ].

Registry data have various types. For example, product registries include patients who have been exposed to a biopharmaceutical product or a medical device; health services registries consist of patients who have had a common procedure or hospitalization; and disease registries contains information about people diagnosed with a specific type of disease. Registries data enable identification and sharing best clinical practices, improve accuracy of estimates, provide valuable data for supporting regulatory decision-making [ 32 , 33 , 34 , 35 ]. Especially for rare diseases where clinical trials are often of small size and data are subject to high variability, registries provide a valuable data source to help understand the course of a disease, and provide critical information for confirmatory clinical trial design and translational research to develop treatments and improve patient care [ 34 , 36 , 37 ]. Reader may refer to [ 38 ] for a comprehensive overview on registry data and how they help understanding of patient outcomes.

Claims data refer to data generated during processing healthcare claims in health insurance plans or from practice management systems. Despite that claims data are collected and stored primarily for payment purposes originally, they have been used in healthcare to understand patients’ and prescribes’ behavior and how they interact, to estimate disease prevalence, to learn disease progression, disease diagnosis, medication usage, and drug-drug interactions, and validate and replicate findings from clinical trials [ 31 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 ]. A known pitfall of claim data is fraud, on top of some of the common data characteristics of RMD, such as upcoding Footnote 1 [ 47 ]. The data fraud problem can be mitigated with detailed audits and adoption of modern statistical, data mining and ML techniques for fraud detection [ 48 , 49 , 50 , 51 ].

PRO data refer to data reported directly by patients on their health status. PRO data have been used to provide RWE on effectiveness of interventions, symptoms monitoring, relationships between exposure and outcomes, among others [ 52 , 53 , 54 , 55 ]. PRO data are subject to recall bias and large inter-individual variability.

Wearable devices generate continuous streams of data. When combined with contextual data (e.g., location data, social media), they provide an opportunity to conduct expansive research studies that are large in scale and scope [ 56 ] that would be otherwise infeasible in controlled trials. Examples of using wearable RWD to generate RWE include applications in neuroscience and environmental health [ 57 , 58 , 59 , 60 ]. The wearables generate huge amounts of data. Advances in data storage, real-time processing capabilities and efficient battery technology would be essential for the full utilization of wearable data.

Using and analyzing RWD

A wide range of research methods are available to make use of RWD. In what follows, we outline a few approaches, including pragmatic clinical trials, target trial emulation, and applications of ML and AI techniques.

Pragmatic clinical trials are trials designed to test the effectiveness of an intervention in the real-world clinical setting. Pragmatic trials leverage the increasingly integrated healthcare system and may use data from EHR, claims, patient reminder systems, telephone-based care, etc. Due to the data characteristics of RWD, new guidelines and methodologies are developed to mitigate bias in RWE generated by RWD for decision making and causal inference, especially for per-protocol analysis [ 61 , 62 ]. The research question under investigation in pragmatic trials is whether an intervention works in real life and trials are designed to maximize the applicability and generalizability of the intervention. Various types of outcomes can be measured in these trials, but mostly patient-centered, instead of typical measurable symptoms or markers in explanatory trials. For example, ADAPTABLE trial [ 63 , 64 ] is a high-profile pragmatic trial and is the first large-scale, EHR-enabled clinical trial conducted within the U.S. It used EHR data to identify around 450,000 patients with established atherosclerotic cardiovascular disease (CVD) for recruitment and eventually enrolled about 15,000 individuals at 40 clinical centers that were randomized to two aspirin dose arms. Electronic patient follow-up for patient-reported outcomes was completed every 3 to 6 months, with a median follow-up was 26.2 months to determine the optimal dosage of aspirin in CVD patients, with the primary endpoint being the composite of all-cause mortality, hospitalization for nonfatal myocardial infarction, or hospitalization for a nonfatal stroke. The cost of ADATABLE is estimated to be only 1/5 to 1/2 of a traditional RCT of that scale.

Target trial emulation is the application of trial design and analysis principles from (target) randomized trials to the analysis of observational data [ 65 ]. By precisely specifying the target trial’s inclusion/exclusion criteria, treatment strategies, treatment assignment, causal contrast, outcomes, follow-up period, and statistical analysis, one may draw valid causal inferences about an intervention from RWD. Target trial emulation can be an important tool especially when comparative evaluation is not yet available or feasible in randomized trials. For example, [ 66 ] employs target trial emulation to evaluate real-world COVID-19 vaccine effectiveness, measured by protection against COVID-19 infection or related death, in racially and ethnically diverse, elderly populations by comparing newly vaccinated persons with matched unvaccinated controls using data from the US Department of Veterans Affairs health care system. The simulated trial was conducted with clearly defined inclusion/exclusion criteria, identification of matched controls, including matching based on propensity scores with careful selection of model covariates. Target trial emulation has also been used to evaluate the effect of colon cancer screening on cancer incidence over eight years of follow up [ 67 ], and the risk of urinary tract infection among diabetic patients [ 68 ].

RWD can also be used as historical controls and reference groups for controlled trials, with assessment of the quality and appropriateness of the RWD and employment of proper statistical approaches for analyzing the data [ 69 ]. Controlling for selection bias and confounding is key to the validity of this approach because of the lack of randomization and potentially unrecognized baseline differences, and the control group needs to be comparable with the treated group. RWD also provide a great opportunity to study rare events given the data voluminousness [ 70 , 71 , 72 ]. These studies also highlight the need for improving the RWD data quality, developing surrogate endpoints, and standardizing data collection for outcome measures in registries.

In terms of analysis of RWD, statistical models and inferential approaches are necessary for making sense of RWD, obtaining causal relationships, testing/validating hypotheses, and generating regulatory-grade RWE to inform policymakers and regulators in decision making – just as in the controlled trial settings. In fact, the motivation for and the design and analysis principles in pragmatic trials and target trial emulation are to obtain causal inference, with more innovative methods beyond the traditional statistical methods to adjust for potential confounders and improve the capabilities of RWD for causal inference [ 73 , 74 , 75 , 76 ].

ML techniques are getting increasingly popular and are powerful tools for predictive modeling. One reason for their popularity is that the modern ML techniques are very capable of dealing with voluminous, messy, multi-modal, and various unstructured data types without strong assumptions about the distribution of data. For example, deep learning can learn abstract representations of large, complex, and unstructured data; natural language processing (NLP) and embedding methods can be used to process texts and clinical notes in EHRs and transform them to real-valued vectors for downstream learning tasks. Secondly, new and more powerful ML techniques are being developed rapidly, due to the high demand and the large group of researchers in the field attracted by the hot topic. Thirdly, there are also many open source codes (e.g., on Github) and software libraries (e.g., TensorFlow, Pytorch, Keras) out there to facilitate the implementation of these techniques. Indeed, ML has enjoyed a rapid surge in the last decade or so for a wide range of applications in RWD, outperforming more conventional approaches [ 77 , 78 , 79 , 80 , 81 , 82 , 83 , 84 , 85 ]. For example, ML is widely applied in in health informatics to generate RWE and formulate personalized healthcare [ 86 , 87 , 88 , 89 , 90 ] and was successfully employed on RWD collected during the COVID-19 pandemic to help understand the disease and evaluate its prevention and treatment strategies [ 91 , 92 , 93 , 94 , 95 ]. It should be noted that the ML techniques are largely used for predictions and classification (e.g., disease diagnosis), variable selections (e.g, biomarker screening), data visualization, etc, rather than generating regulatory-level RWE; but this may change soon as regulatory agencies are aggressively evaluating ML/AI for generating RWE and engaging stakeholders on the topic [ 96 , 97 , 98 , 99 ].

It would be more effective and powerful to combine the expertise from statistical inference and ML when it comes to generating RWE and learning causal relationships. One of the recent methodological developments is indeed in that direction – leveraging the advances in semi-parametric and empirical process theory and incorporating the benefits of ML into comparative effectiveness using RWD. A well-known framework is targeted learning [ 100 , 101 , 102 ] that has been successfully applied in causal inference for dynamic treatment rules using EHR data [ 103 ] and efficacy of COVID-19 treatments [ 104 ], among others.

Regardless of which area a RWD project focuses on – causal inference or prediction and classification, representativeness of RWD of the population where the conclusions from the RWD project will be generalized to is critical. Otherwise, estimation or prediction can be misleading or even harmful. The information in RWD might not be adequate to validate the appropriateness of the data for generalization; in that case, the investigators should resist the temptation to generalize to groups that they are unsure about.

Challenges and opportunities

Various challenges – from data gathering to data quality control to decision making – still exist in all stages of a RWD life cycle despite all the excitement around their transformative potentials. We list some of the challenges below, where plenty of opportunities for improvement exist and greater efforts are needed to harness the power of RWD.

Data quality : RWD are now often used for other purposes than what they are originally collected for and thus may lack information for critical endpoints and not always be positioned for generating regulatory-grade evidence. On top of that, RWD are messy, heterogeneous, and subject to various measurement errors, all of which contribute to the lower quality of RWD compared to data from controlled trials. As a result, accuracy and precision of results based on RWD are negatively impacted and misleading results or false conclusions can be generated. While these do not preclude the use of RWD in evidence generation and decision making, data quality issues need to be consistently documented and addressed as much as possible through data cleaning and pre-processing (e.g., imputation to fill in missing values, over-sampling for imbalanced data, denoising, combining disparate pieces of information across databases, etc). If an issue can be addressed during the pre-processing stage, efforts should be made to correct it during data analysis or caution should be used when interpreting the results. Early engagement of key stakeholders (e.g., regulatory agencies if needed, research institutes, industries etc.) are encouraged to establish data quality standards and reduce unforeseen risks and issues.

Efficient and practical ML and statistical procedures : Fast growth of digital medical data and the fact that workforce and investment flood into the field also drive the rapid development and adoption of modern statistical procedures and ML algorithms to analyze the data. The availability of open-source platforms and software greatly facilitate the application of the procedures in practice. On the other hand, noisiness, heterogeneity, incompleteness, and unbalancedness of RWD may cause considerable under-performance of the existing statistical and ML procedures and demand new procedures that target specifically at RWD and can be effectively deployed in the real world. Further, the availability of the open-source platform and software and the accompanied convenience, while offered with good intentions, also increases the chance of practitioners misusing the procedures, if not equipped with proper training or understanding the principles of the techniques before applying them to real-world situations. In addition, to maintain scientific rigor during the RWE generation process from RWD, results from statistical and ML procedures would require medical validation either using expert knowledge or conducting reproducibility and replicability studies before they are being used for decision making in the real world [ 105 ].

Explainability and interpretability : Modern ML approaches are often employed in a black-box fashion and there a lack of understanding of the relationships between input and output and causal effects. Model selection, parameter initialization, and hyper-parameter tuning are also often conducted in a trial-and-error manner, without domain expert input. This is in contrast to the medical and healthcare field where interpretability is critical to building patient/user trust, and doctors are unlikely to use technology that they don’t understand. Promising and encouraging research work on this topic has already started [ 106 , 107 , 108 , 109 , 110 , 111 ], but more research is warranted.

Reproducibility and replicability : Reproducibility and replicability Footnote 2 are major principles in scientific research, RWD included. If an analytical procedure is not robust and its output is not reproducible or replicable, the public would call into questions the scientific rigor of the work and doubt the conclusion from a RWD-based study [ 113 , 114 , 115 ]. Result validation, reproducibility, and replicability can be challenging given their messiness, incompleteness, unstructured data, but need to be established especially considering that the generated evidence could be used towards regulatory decisions and affect the lives of millions of people. Irreproducibility can be mitigated by sharing raw and processed data and codes, assuming no privacy is compromised in this process. For replicability, given that RWD are not generated from controlled trials and every data set may has its own unique data characteristics, complete replicability can be difficult or even infeasible. Nevertheless, detailed documentation of data characteristics and pre-processing, pre-registration of analysis procedures, and adherence to open science principles (e.g., code repositories [ 116 ]) are critical for replicating findings on different RWD datasets, assuming they come from the same underlying population. Readers may refer to [ 117 , 118 , 119 ] for more suggestions and discussions on this topic.

Privacy : Ethical issues exist when an RWD project is implemented, among which, privacy is a commonly discussed topic. Information in RWD is often sensitive, such as medical histories, disease status, financial situations, and social behaviors, among others. Privacy risk can increase dramatically when different databases (e.g., EHR, wearables, claims) are linked together, a common practice in the analysis of RWD. Data users and policymakers should make every effort to ensure that RWD collection, storage, sharing, and analysis follow established data privacy principles (i.e., lawfulness, fairness, purpose limitation, and data minimization). In addition, privacy-enhancing technology and privacy-preserving data sharing and analysis can be deployed, where there already exist plenty effective and well-accepted state-of-the-art concepts and approaches, such as differential privacy Footnote 3 [ 120 ] and federated learning Footnote 4 [ 121 , 122 ]. Investigators and policymakers may consider integrating these concepts and technology when collecting and analyzing RWD and disseminating the results and RWE from the RWD.

Diversity, Equity, Algorithmic fairness, and Transparency (DEAT) : DEAT is another important ethical issue to consider in an RWD project. RWD may contain information from various demographic groups, which can be used to generate RWE with improved generalizability compared to data collected in controlled settings. On the other hand, certain types of RWD may be heavily biased and unbalanced toward a certain group, not as diverse or inclusive, and in some cases, even exacerbate disparity (e.g., wearables and access to facilities and treatment may be limited to certain demographic groups). Greater effort will be needed to gain access to RWD from underrepresented groups and to effectively take into account the heterogeneity in RWD while being mindful of the limitation for diversity/equity. This topic also relates to algorithmic fairness, which aims at understanding and preventing bias in ML models. Algorithmic fairness is an increasingly popular research topic in literature [ 123 , 124 , 125 , 126 , 127 ]. Incorrect and misleading conclusions may be drawn if the trained models systematically disadvantage a certain group (e.g., a trained algorithm might be less likely to detect cancer in black patients than white patients or in men than women). Transparency means that information and communication concerning the processing of personal data must be easily accessible and easy to understand. Transparency ensures that data contributors are aware of how their data are being used and for what purposes and decision-makers can evaluate the quality of the methods and the applicability of the generated RWE [ 128 , 129 , 130 , 131 ]. Being transparent when working with RWD is critical for building trust among the key stakeholders during an RWD life cycle (individuals who supply the data, those who collect and manage the data, data curators who design studies and analyze the data, and decision and policy makers).

The above challenges are not isolated but rather connected as depicted in Fig. 2 . Data quality affects the performance of statistical and ML procedures; data sources and the cleaning and pre-processing process relate to result reproducibility and replicability. How data are analyzed and which statistical and ML procedures to use have an impact on reproducibility and replicability, whether privacy-preserving procedures are used during data collected and analysis and how information is shared and released relate to data privacy, DEAT, and explainability and interpretability, which can in turns affect which ML procedures to apply and development of new ML techniques.

figure 2

Challenges in RWD and Their Relations

Conclusions

RWD provide a valuable and rich data source beyond the confines of traditional epidemiological studies, clinical trials, and lab-based experiments, with lower cost in data collection compared to the latter. If used and analyzed appropriately, RWD have the potential to generate valid and unbiased RWE with savings in both cost and time, compared to controlled trials, and to enhance the efficiency of medical and health-related research and decision-making. Procedures that improve the quality of the data and overcome the limitation of RWD to make the best of them have been and will continue to be developed. With the enthusiasm, commitment, and investment in RWD from all key stakeholders, we hope that the day that RWD unleashes its full potential will come soon.

Availability of data and materials

Not applicable. This is a review article. No data or materials were generated or collected.

Change history

02 may 2023.

A Correction to this paper has been published: https://doi.org/10.1186/s12874-023-01937-1

Upcoding refers to instances in which a medical service provider obtains additional reimbursement from insurance by coding a service it provided as a more expensive service than what was actually performed

Reproducibility refers to “instances in which the original researcher’s data and computer codes are used to regenerate the results” and replicability refers to “instances in which a researcher collects new data to arrive at the same scientific findings as a previous study.” [ 112 ]

Differential privacy provides a mathematically rigorous framework in which randomized procedures are used to guarantee individual privacy when releasing information.

Federated learning enables local devices to collaboratively learn a shared model while keeping all training data on the local devices without sharing, mitigating privacy risks.

Abbreviations

artificial intelligence

cardiovascular disease

coronavirus disease

diversity, equity, algorithmic fairness, and transparency

electronic health records

machine learning

natural language processing

patient-reported outcome

real-world data

real-world evidence

US Food and Drug Administration, et al. Real-World Evidence. 2022. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence . Accessed 1 Sep 2022.

Wikipedia. Real world data. 2022. https://en.wikipedia.org/wiki/Real_world_data . Accessed 19 Mar 2022.

Powell AA, Power L, Westrop S, McOwat K, Campbell H, Simmons R, et al. Real-world data shows increased reactogenicity in adults after heterologous compared to homologous prime-boost COVID-19 vaccination, March- June 2021, England. Eurosurveillance. 2021;26(28):2100634.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hunter PR, Brainard JS. Estimating the effectiveness of the Pfizer COVID-19 BNT162b2 vaccine after a single dose. A reanalysis of a study of ’real-world’ vaccination outcomes from Israel. medRxiv. 2021.02.01.21250957. https://doi.org/10.1101/2021.02.01.21250957 .

Henry DA, Jones MA, Stehlik P, Glasziou PP. Effectiveness of COVID-19 vaccines: findings from real world studies. Med J Aust. 2021;215(4):149.

Article   PubMed   PubMed Central   Google Scholar  

Firth JA, Hellewell J, Klepac P, Kissler S, Kucharski AJ, Spurgin LG. Using a real-world network to model localized COVID-19 control strategies. Nat Med. 2020;26(10):1616–22.

Article   CAS   PubMed   Google Scholar  

Shapiro A, Marinsek N, Clay I, Bradshaw B, Ramirez E, Min J, et al. Characterizing COVID-19 and influenza illnesses in the real world via person-generated health data. Patterns. 2021;2(1):100188.

Ahrens KF, Neumann RJ, Kollmann B, Plichta MM, Lieb K, Tüscher O, et al. Differential impact of COVID-related lockdown on mental health in Germany. World Psychiatr. 2021;20(1):140.

Article   Google Scholar  

Hernández MA, Stolfo SJ. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min Knowl Disc. 1998;2(1):9–37.

Corrigan-Curay J, Sacks L, Woodcock J. Real-world evidence and real-world data for evaluating drug safety and effectiveness. Jama. 2018;320(9):867–8.

Article   PubMed   Google Scholar  

Makady A, de Boer A, Hillege H, Klungel O, Goettsch W, et al. What is real-world data? A review of definitions based on literature and stakeholder interviews. Value Health. 2017;20(7):858–65.

Franklin JM, Schneeweiss S. When and how can real world data analyses substitute for randomized controlled trials? Clin Pharmacol Ther. 2017;102(6):924–33.

Miksad RA, Abernethy AP. Harnessing the power of real-world evidence (RWE): a checklist to ensure regulatory-grade data quality. Clin Pharmacol Ther. 2018;103(2):202–5.

Curtis MD, Griffith SD, Tucker M, Taylor MD, Capra WB, Carrigan G, et al. Development and validation of a high-quality composite real-world mortality endpoint. Health Serv Res. 2018;53(6):4460–76.

Booth CM, Karim S, Mackillop WJ. Real-world data: towards achieving the achievable in cancer care. Nat Rev Clin Oncol. 2019;16(5):312–25.

Swift B, Jain L, White C, Chandrasekaran V, Bhandari A, Hughes DA, et al. Innovation at the intersection of clinical trials and real-world data science to advance patient care. Clin Transl Sci. 2018;11(5):450–60.

Sun W, Cai Z, Li Y, Liu F, Fang S, Wang G. Data processing and text mining technologies on electronic medical records: a review. J Healthc Eng. 2018;2018:4302425. https://doi.org/10.1155/2018/4302425 .

Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care. 2010;48(6 Suppl):S106-13. https://doi.org/10.1097/MLR.0b013e3181de9e17 , https://pubmed.ncbi.nlm.nih.gov/20473190/ .

Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma. 2010;2010:1.

Google Scholar  

Kawaler E, Cobian A, Peissig P, Cross D, Yale S, Craven M. Learning to predict post-hospitalization VTE risk from EHR data. In: AMIA annual symposium proceedings. vol. 2012. p. 436. American Medical Informatics Association Country United States.

Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2017;22(5):1589–604.

Poirier C, Hswen Y, Bouzillé G, Cuggia M, Lavenu A, Brownstein JS, et al. Influenza forecasting for French regions combining EHR, web and climatic data sources with a machine learning ensemble approach. PloS ONE. 2021;16(5):e0250890.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Pivovarov R, Perotte AJ, Grave E, Angiolillo J, Wiggins CH, Elhadad N. Learning probabilistic phenotypes from heterogeneous EHR data. J Biomed Inform. 2015;58:156–65.

Zhao D, Weng C. Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction. J Biomed Informa. 2011;44(5):859–68.

Veturi Y, Lucas A, Bradford Y, Hui D, Dudek S, Theusch E, et al. A unified framework identifies new links between plasma lipids and diseases from electronic medical records across large-scale cohorts. Nat Genet. 2021;53(7):972–81.

Kwon BC, Choi MJ, Kim JT, Choi E, Kim YB, Kwon S, et al. Retainvis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Trans Vis Comput Graph. 2018;25(1):299–309.

Mahmoudi E, Kamdar N, Kim N, Gonzales G, Singh K, Waljee AK. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: systematic review. BMJ. 2020;369:m958.

Desai RJ, Wang SV, Vaduganathan M, Evers T, Schneeweiss S. Comparison of machine learning methods with traditional models for use of administrative claims with electronic medical records to predict heart failure outcomes. JAMA Netw Open. 2020;3(1):e1918962.

Huang L, Shea AL, Qian H, Masurkar A, Deng H, Liu D. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J Biomed Inform. 2019;99:103291.

Bartlett VL, Dhruva SS, Shah ND, Ryan P, Ross JS. Feasibility of using real-world data to replicate clinical trial evidence. JAMA Netw Open. 2019;2(10):e1912869.

Dreyer NA, Garner S. Registries for robust evidence. Jama. 2009;302(7):790–1.

Larsson S, Lawyer P, Garellick G, Lindahl B, Lundström M. Use of 13 disease registries in 5 countries demonstrates the potential to use outcome data to improve health care’s value. Health Affairs. 2012;31(1):220–7.

McGettigan P, Alonso Olmo C, Plueschke K, Castillon M, Nogueras Zondag D, Bahri P, et al. Patient registries: an underused resource for medicines evaluation. Drug Saf. 2019;42(11):1343–51.

Izmirly PM, Parton H, Wang L, McCune WJ, Lim SS, Drenkard C, et al. Prevalence of systemic lupus erythematosus in the United States: estimates from a meta-analysis of the Centers for Disease Control and Prevention National Lupus Registries. Arthritis Rheumatol. 2021;73(6):991–6.

Jansen-Van Der Weide MC, Gaasterland CM, Roes KC, Pontes C, Vives R, Sancho A, et al. Rare disease registries: potential applications towards impact on development of new drug treatments. Orphanet J Rare Dis. 2018;13(1):1–11.

Lacaze P, Millis N, Fookes M, Zurynski Y, Jaffe A, Bellgard M, et al. Rare disease registries: a call to action. Intern Med J. 2017;47(9):1075–9.

Gliklich RE, Dreyer NA, Leavy MB, editors. Registries for Evaluating Patient Outcomes: A User's Guide. 3rd ed. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Apr. Report No.: 13(14)-EHC111. PMID: 24945055.

Svarstad BL, Shireman TI, Sweeney J. Using drug claims data to assess the relationship of medication adherence with hospitalization and costs. Psychiatr Serv. 2001;52(6):805–11.

Izurieta HS, Wu X, Lu Y, Chillarige Y, Wernecke M, Lindaas A, et al. Zostavax vaccine effectiveness among US elderly using real-world evidence: Addressing unmeasured confounders by using multiple imputation after linking beneficiary surveys with Medicare claims. Pharmacoepidemiol Drug Saf. 2019;28(7):993–1001.

Allen AM, Van Houten HK, Sangaralingham LR, Talwalkar JA, McCoy RG. Healthcare cost and utilization in nonalcoholic fatty liver disease: real-world data from a large US claims database. Hepatology. 2018;68(6):2230–8.

Sruamsiri R, Iwasaki K, Tang W, Mahlich J. Persistence rates and medical costs of biological therapies for psoriasis treatment in Japan: a real-world data study using a claims database. BMC Dermatol. 2018;18(1):1–11.

Quock TP, Yan T, Chang E, Guthrie S, Broder MS. Epidemiology of AL amyloidosis: a real-world study using US claims data. Blood Adv. 2018;2(10):1046–53.

Herland M, Bauder RA, Khoshgoftaar TM. Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In: 2017 IEEE international conference on information reuse and integration (IRI). New York City: IEEE; 2017. p. 579–88.

Momo K, Kobayashi H, Sugiura Y, Yasu T, Koinuma M, Kuroda SI. Prevalence of drug–drug interaction in atrial fibrillation patients based on a large claims data. PLoS ONE. 2019;14(12):e0225297.

Ghiani M, Maywald U, Wilke T, Heeg B. RW1 Bridging The Gap Between Clinical Trials And Real World Data: Evidence On Replicability Of Efficacy Results Using German Claims Data. Value Health. 2020;23:S757–8.

Silverman E, Skinner J. Medicare upcoding and hospital ownership. J Health Econ. 2004;23(2):369–89.

Kirlidog M, Asuk C. A fraud detection approach with data mining in health insurance. Procedia-Soc Behav Sci. 2012;62:989–94.

Li J, Huang KY, Jin J, Shi J. A survey on statistical methods for health care fraud detection. Health Care Manag Sci. 2008;11(3):275–87.

Viaene S, Dedene G, Derrig RA. Auto claim fraud detection using Bayesian learning neural networks. Expert Syst Appl. 2005;29(3):653–66.

Phua C, Lee V, Smith K, Gayler R. A comprehensive survey of data mining-based fraud detection research. arXiv preprint arXiv:1009.6119 . 2010.

Roche N, Small M, Broomfield S, Higgins V, Pollard R. Real world COPD: association of morning symptoms with clinical and patient reported outcomes. COPD J Chronic Obstructive Pulm Dis. 2013;10(6):679–86.

Small M, Anderson P, Vickers A, Kay S, Fermer S. Importance of inhaler-device satisfaction in asthma treatment: real-world observations of physician-observed compliance and clinical/patient-reported outcomes. Adv Ther. 2011;28(3):202–12.

Pinsker JE, Müller L, Constantin A, Leas S, Manning M, McElwee Malloy M, et al. Real-world patient-reported outcomes and glycemic results with initiation of control-IQ technology. Diabetes Technol Ther. 2021;23(2):120–7.

Touma Z, Hoskin B, Atkinson C, Bell D, Massey O, Lofland JH, Berry P, Karyekar CS, Costenbader KH. Systemic lupus erythematosus symptom clusters and their association with Patient‐Reported outcomes and treatment: analysis of Real‐World data. Arthritis Care & Research. 2022;74(7):1079-88.

Martinez GJ, Mattingly SM, Mirjafari S, Nepal SK, Campbell AT, Dey AK, et al. On the quality of real-world wearable data in a longitudinal study of information workers. In: 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). New York City: IEEE; 2020. p. 1–6.

Christensen JH, Saunders GH, Porsbo M, Pontoppidan NH. The everyday acoustic environment and its association with human heart rate: evidence from real-world data logging with hearing aids and wearables. Royal Soc Open Sci. 2021;8(2):201345.

Johnson KT, Picard RW. Advancing neuroscience through wearable devices. Neuron. 2020;108(1):8–12.

Pickham D, Berte N, Pihulic M, Valdez A, Mayer B, Desai M. Effect of a wearable patient sensor on care delivery for preventing pressure injuries in acutely ill adults: A pragmatic randomized clinical trial (LS-HAPI study). Int J Nurs Stud. 2018;80:12–9.

Adams JL, Dinesh K, Snyder CW, Xiong M, Tarolli CG, Sharma S, et al. A real-world study of wearable sensors in Parkinson’s disease. NPJ Park Dis. 2021;7(1):1–8.

Hernán MA, Robins JM, et al. Per-protocol analyses of pragmatic trials. N Engl J Med. 2017;377(14):1391–8.

Murray EJ, Swanson SA, Hernán MA. Guidelines for estimating causal effects in pragmatic randomized trials. arXiv preprint arXiv:1911.06030 . 2019.

Hernandez AF, Fleurence RL, Rothman RL. The ADAPTABLE Trial and PCORnet: shining light on a new research paradigm. Ann Intern Med. 2015;163(8):635-6.

Baigent C. Pragmatic trials-need for ADAPTABLE design. N Engl J Med. 2021;384(21).

Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183(8):758–64.

Ioannou GN, Locke ER, O’Hare AM, Bohnert AS, Boyko EJ, Hynes DM, et al. COVID-19 vaccination effectiveness against infection or death in a National US Health Care system: a target trial emulation study. Ann Intern Med. 2022;175(3):352–61.

García-Albéniz X, Hsu J, Hernán MA. The value of explicitly emulating a target trial when using real world evidence: an application to colorectal cancer screening. Eur J Epidemiol. 2017;32(6):495–500.

Takeuchi Y, Kumamaru H, Hagiwara Y, Matsui H, Yasunaga H, Miyata H, et al. Sodium-glucose cotransporter-2 inhibitors and the risk of urinary tract infection among diabetic patients in Japan: Target trial emulation using a nationwide administrative claims database. Diabetes Obes Metab. 2021;23(6):1379–88.

Jen EY, Xu Q, Schetter A, Przepiorka D, Shen YL, Roscoe D, et al. FDA approval: blinatumomab for patients with B-cell precursor acute lymphoblastic leukemia in morphologic remission with minimal residual disease. Clin Cancer Res. 2019;25(2):473–7.

Gross AM. Using real world data to support regulatory approval of drugs in rare diseases: A review of opportunities, limitations & a case example. Curr Probl Cancer. 2021;45(4):100769.

Wu J, Wang C, Toh S, Pisa FE, Bauer L. Use of real-world evidence in regulatory decisions for rare diseases in the United States—Current status and future directions. Pharmacoepidemiol Drug Saf. 2020;29(10):1213–8.

Hayeems RZ, Michaels-Igbokwe C, Venkataramanan V, Hartley T, Acker M, Gillespie M, et al. The complexity of diagnosing rare disease: An organizing framework for outcomes research and health economics based on real-world evidence. Genet Med. 2022;24(3):694–702.

Hernán MA, Robins JM. Causal inference. Boca Raton: CRC; 2010.

Ho M, van der Laan M, Lee H, Chen J, Lee K, Fang Y, et al. The current landscape in biostatistics of real-world data and evidence: Causal inference frameworks for study design and analysis. Stat Biopharm Res. 2021. https://www.tandfonline.com/doi/abs/10.1080/19466315.2021.1883475 .

Crown WH. Real-world evidence, causal inference, and machine learning. Value Health. 2019;22(5):587–92.

Cui P, Shen Z, Li S, Yao L, Li Y, Chu Z, et al. Causal inference meets machine learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery; 2020. p. 3527–3528.

Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, Hua Y, Gueroussov S, Najafabadi HS, Hughes TR, Morris Q. The human splicing code reveals new insights into the genetic determinants of disease. Science. 2015;347(6218):1254806.

Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31(5):761–3.

Anthimopoulos M, Christodoulidis S, Ebner L, Christe A, Mougiakakou S. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1207–16.

Van Grinsven MJ, van Ginneken B, Hoyng CB, Theelen T, Sánchez CI. Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images. IEEE Trans Med Imaging. 2016;35(5):1273–84.

Kleesiek J, Urban G, Hubert A, Schwarz D, Maier-Hein K, Bendszus M, et al. Deep MRI brain extraction: A 3D convolutional neural network for skull stripping. NeuroImage. 2016;129:460–9.

Gibson E, Li W, Sudre C, Fidon L, Shakir DI, Wang G, et al. NiftyNet: a deep-learning platform for medical imaging. Comput Methods Prog Biomed. 2018;158:113–22.

Coccia M. Deep learning technology for improving cancer care in society: New directions in cancer imaging driven by artificial intelligence. Technol Soc. 2020;60:101198.

Bien N, Rajpurkar P, Ball RL, Irvin J, Park A, Jones E, et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet. PLoS Med. 2018;15(11):e1002699.

Johansson FD, Collins JE, Yau V, Guan H, Kim SC, Losina E, et al. Predicting response to tocilizumab monotherapy in rheumatoid arthritis: a real-world data analysis using machine learning. J Rheumatol. 2021;48(9):1364–70.

Ravì D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, et al. Deep learning for health informatics. IEEE J Biomed Health Informa. 2016;21(1):4–21.

Suzuki K. Overview of deep learning in medical imaging. Radiol Phys Technol. 2017;10(3):257–73.

Shen D, Wu G, Suk HI. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–48.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Lee JG, Jun S, Cho YW, Lee H, Kim GB, Seo JB, et al. Deep learning in medical imaging: general overview. Korean J Radiol. 2017;18(4):570–84.

Amyar A, Modzelewski R, Li H, Ruan S. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation. Comput Biol Med. 2020;126:104037.

Oh Y, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Hemdan EED, Shouman MA, Karar ME. Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in x-ray images. arXiv preprint arXiv:2003.11055 . 2020.

Wang S, Zha Y, Li W, Wu Q, Li X, Niu M, et al. A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis. Eur Respir J. 2020;56(2).

Ardakani AA, Kanafi AR, Acharya UR, Khadem N, Mohammadi A. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks. Comput Biol Med. 2020;121:103795.

Food U, Administration D. Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device(SaMD) - Discussion Paper and Request for Feedback. 2019. https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf . Accessed 24 Mar 2022.

Food U, Administration D. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. 2021. https://www.fda.gov/media/145022/download . Accessed 24 March 2022.

of Medicines Regulatory Authorities IC. Informal Innovation Network Horizon Scanning Assessment Report - Artificial Intelligence. 2021. https://www.icmra.info/drupal/sites/default/files/2021-08/horizon_scanning_report_artificial_intelligence.pdf . Accessed 24 March 2022.

Agency EM. Artificial intelligence in medicine regulation. 2021. https://www.ema.europa.eu/en/news/artificial-intelligence-medicine-regulation . Accessed 24 Mar 2022.

Van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. 2011. Springer-Verlag New York Inc., United States.

Van der Laan MJ, Rose S. Targeted learning in data science. Causal Inference for Complex Longitudinal Studies 2018. Cham: Springer.

van der Laan MJ, Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. J Causal Infer. 2015;3(1):61–95.

Sofrygin O, Zhu Z, Schmittdiel JA, Adams AS, Grant RW, van der Laan MJ, et al. Targeted learning with daily EHR data. Stat Med. 2019;38(16):3073–90.

Chakravarti P, Wilson A, Krikov S, Shao N, van der Laan M. PIN68 Estimating Effects in Observational Real-World Data, From Target Trials to Targeted Learning: Example of Treating COVID-Hospitalized Patients. Value Health. 2021;24:S118.

Eichler HG, Koenig F, Arlett P, Enzmann H, Humphreys A, Pétavy F, et al. Are novel, nonrandomized analytic methods fit for decision making? The need for prospective, controlled, and transparent validation. Clin Pharmacol Ther. 2020;107(4):773–9.

Chakraborty S, Tomsett R, Raghavendra R, Harborne D, Alzantot M, Cerutti F, et al. Interpretability of deep learning models: A survey of results. In: 2017 IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation. New York City: IEEE; 2017. p. 1–6.

Zhang Q, Zhu SC. Visual interpretability for deep learning: a survey. arXiv preprint arXiv:1802.00614 . 2018.

Hohman F, Park H, Robinson C, Chau DHP. Summit: Scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE Trans Vis Comput Graph. 2019;26(1):1096–106.

Ghoshal B, Tucker A. Estimating uncertainty and interpretability in deep learning for coronavirus (COVID-19) detection. arXiv preprint arXiv:2003.10769 . 2020.

Raghu M, Gilmer J, Yosinski J, Sohl-Dickstein J. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. 2017; 31st Conference on Neural Information Processing Systems (NIPS 2017). Long Beach: NEURAL INFO PROCESS SYS F, LA JOLLA; 2017. ISBN: 9781510860964.

Cruz-Roa AA, Ovalle JEA, Madabhushi A, Osorio FAG. A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham; Springer; 2013. p. 403–410.

Barba LA. Terminologies for reproducible research. arXiv preprint arXiv:1802.03311 . 2018.

Stupple A, Singerman D, Celi LA. The reproducibility crisis in the age of digital medicine. NPJ Digit Med. 2019;2(1):1–3.

Carter RE, Attia ZI, Lopez-Jimenez F, Friedman PA. Pragmatic considerations for fostering reproducible research in artificial intelligence. NPJ Digit Med. 2019;2(1):1–3.

Liu C, Gao C, Xia X, Lo D, Grundy J, Yang X. On the replicability and reproducibility of deep learning in software engineering. ACM Transactions on Software Engineering and Methodology. 2021;31(1):1–46.

Springate DA, Kontopantelis E, Ashcroft DM, Olier I, Parisi R, Chamapiwa E, et al. ClinicalCodes: an online clinical codes repository to improve the validity and reproducibility of research using electronic medical records. PloS ONE. 2014;9(6):e99825.

Wang SV, Schneeweiss S, Berger ML, Brown J, de Vries F, Douglas I, et al. Reporting to improve reproducibility and facilitate validity assessment for healthcare database studies V1. 0. Value health. 2017;20(8):1009–22.

Panagiotou OA, Heller R. Inferential challenges for real-world evidence in the era of routinely collected health data: many researchers, many more hypotheses, a single database. JAMA Oncol. 2021;7(11):1605–7.

Belbasis L, Panagiotou OA. Reproducibility of prediction models in health services research. BMC Res Notes. 2022;15(1):1–5.

Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography conference. Springer; 2006. p. 265–284.

Konečnỳ J, McMahan B, Ramage D. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575 . 2015.

Konečnỳ J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 . 2016.

McCradden MD, Joshi S, Mazwi M, Anderson JA. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020;2(5):e221–3.

Mitchell S, Potash E, Barocas S, D’Amour A, Lum K. Algorithmic fairness: Choices, assumptions, and definitions. Ann Rev Stat Appl. 2021;8:141–63.

Mhasawade V, Zhao Y, Chunara R. Machine learning and algorithmic fairness in public and population health. Nat Mach Intell. 2021;3(8):659–66.

Wong PH. Democratizing algorithmic fairness. Philos Technol. 2020;33(2):225–44.

Paulus JK, Kent DM. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. NPJ Digit Med. 2020;3(1):1–8.

Orsini LS, Berger M, Crown W, Daniel G, Eichler HG, Goettsch W, et al. Improving transparency to build trust in real-world secondary data studies for hypothesis testing—why, what, and how: recommendations and a road map from the real-world evidence transparency initiative. Value Health. 2020;23(9):1128–36.

Patorno E, Schneeweiss S, Wang SV. Transparency in real-world evidence (RWE) studies to build confidence for decision-making: reporting RWE research in diabetes. Diabetes Obes Metab. 2020;22:45–59.

White R. Building trust in real-world evidence and comparative effectiveness research: the need for transparency. Future Med. 2017;6(1):5–7.

Rodriguez-Villa E, Torous J. Regulating digital health technologies with transparency: the case for dynamic and multi-stakeholder evaluation. BMC Med. 2019;17(1):1–5.

Download references

Acknowledgements

We thank the editor and two referees for reviewing the paper and providing suggestions.

Not applicable.

Author information

Authors and affiliations.

Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, 46530, Notre Dame, IN, USA

School of Health Sciences and Education, Harokopio University, Athens, Greece

Demosthenes Panagiotakos

You can also search for this author in PubMed   Google Scholar

Contributions

FL and PD came up with the general idea for the article. FL did the literature review and wrote the manuscript. PD reviewed and revised the manuscript. Both authors have read and approved the manuscript.

Corresponding author

Correspondence to Fang Liu .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

Both authors are Senior Editorial Board Members for the journal of BMC Medical Research Methodology.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the authors identified an error in the author name of Demosthenes Panagiotakos. The given name and family name were erroneously transposed. The incorrect author name is: Panagiotakos Demosthenes. The correct author name is: Demosthenes Panagiotakos.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Liu, F., Panagiotakos, D. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 22 , 287 (2022). https://doi.org/10.1186/s12874-022-01768-6

Download citation

Received : 08 April 2022

Accepted : 22 October 2022

Published : 05 November 2022

DOI : https://doi.org/10.1186/s12874-022-01768-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Real-world data (RWD)
  • Real-world evidence (RWE)
  • Electronic health records
  • Machine learning
  • Artificial intelligence
  • Causal inference

BMC Medical Research Methodology

ISSN: 1471-2288

real world evidence research based on big data

  • Download PDF
  • Share X Facebook Email LinkedIn
  • Permissions

Strategies to Turn Real-world Data Into Real-world Knowledge

  • 1 Department of Radiation Oncology, University of California, San Francisco
  • 2 Bakar Computational Health Sciences Institute, University of California, San Francisco
  • Original Investigation Assessment of Alectinib vs Ceritinib in ALK -Positive NSCLC in Phase 2 Trials and Real-world Data Samantha Wilkinson, PhD; Alind Gupta, PhD; Nicolas Scheuer, PhD; Eric Mackay, MA, MSc; Paul Arora, PhD; Kristian Thorlund, PhD; Radek Wasiak, PhD; Joshua Ray, MSc; Sreeram Ramagopalan, PhD; Vivek Subbiah, MD JAMA Network Open

Real-world data (RWD), defined as “data regarding the usage, or the potential benefits or risks, of a drug derived from sources other than randomized clinical trials,” 1 have emerged as an important source of clinical information since the 21st Century Cures Act was signed in 2016. Although randomized clinical trials (RCTs) remain the highest standard of clinical evidence, RWD have offered the promise of generating insights from the vast clinical data aggregated in routine care. RWD build on the history of retrospective studies, filling knowledge gaps to supplement RCTs and generating hypotheses for future trials. RWD can address a number of limitations of RCTs, including (1) resource and time intensiveness; (2) issues with external generalizability due to stringent inclusion criteria, narrow practice settings, and patient disparities in access; and (3) insufficient power to detect rare events or study uncommon diseases. RWD have played an important role in US Food and Drug Administration regulatory review but also have serious limitations attributable to bias and data quality. These shortcomings can make it challenging to draw conclusions from comparative effectiveness studies. 2

Wilkinson et al 3 report their analysis of patients receiving alectinib and ceritinib for non–small cell lung cancer, investigating single-group phase 2 alectinib trials and real-world alectinib and ceritinib populations. Their study 3 focuses on evaluating uncertainty when using RWD. The authors should be commended for applying several approaches to characterize and manage the limitations of RWD. This presents an important opportunity to discuss best practices in analyzing RWD.

Many strategies have been critical in the evolution of RWD studies. One important standard that remains underutilized is the target trial framework, which emulates an RCT with observational data. 4 This approach includes specifying a time 0 (similar to the randomization time on an RCT) to facilitate assessment of eligibility criteria and appropriate end point definition. Although the term “target trial” is not explicitly stated, Wilkinson and colleagues 3 adopt 2 important strategies from this framework: definition of time 0 and assessing eligibility criteria based on this time. Although the authors do their best to approximate the intention-to-treat analysis leveraged in prospective trials, the use of treatment initiation as time 0 differs from the randomized setting, where events between randomization and treatment initiation (such as the development of a contraindication to a specific therapy or mortality) can occur. 4 This limitation is less impactful in this study, 3 because each comparison group is a systemic agent. In other studies, this can result in selection and immortal time bias (eg, patients who receive adjuvant chemotherapy after surgery must have survived long enough to receive the treatment). 4 Where possible, identifying the time at which a physician decided to initiate therapy would more closely approximate the randomization time of an RCT. However, this requires availability and manual review of clinical documentation.

Confounding is an important consideration, as comparisons frequently occur between biased populations. Confounders can impact either or both treatment selection and treatment outcomes. Wilkinson and colleagues 3 apply propensity weighting, where the propensity of treatment based on covariates is weighted to create balanced groups. Other strategies include matching, restriction, stratification, and regression; these offer complementary approaches to RWD. There is no single panacea—each approach has strengths and weaknesses 5 —and transparent reporting and interpretation are critical.

Unfortunately, these strategies are limited by the availability and quality of measured confounders and can be vulnerable to unmeasured confounders. 5 Wilkinson et al 3 demonstrate important strategies to mitigate 2 primary limitations in their available data. ECOG performance status was missing for 47.3% of patients in the ceritinib RWD group and 34.6% of patients in the alectinib RWD group, and important confounders such as socioeconomic characteristics and prior receipt of nonsystemic therapy (which is relevant given the role of local therapies in stage IIIB and oligometastatic non–small cell lung cancer) were not available. 3 To evaluate the potential impact of unmeasured confounders, the authors use quantitative bias analysis, which quantifies the required biasing effect of unmeasured confounders to impact study results. This can provide an understanding of the robustness of results against unmeasured confounders. Best practices in bias analysis, including conservative interpretation because of the underlying assumptions, have been previously described. 5 , 6 In oncology, future studies may also benefit from the integration of multiple data sources, such as oncology information systems and tumor registries, and the applications of computational approaches, such as natural language processing, 7 to improve the measurement of confounders.

Data quality and missingness in analyzed covariates are also important challenges. In particular, routine clinical data are frequently acquired or missing because of intentional processes in the health care system (eg, greater frequency of obtaining vital signs for patients with more-acute conditions). This informative missingness can result in information bias. Therefore, limiting analyses to patients with complete data can bias results, and alternative approaches such as imputation can be helpful. Wilkinson and colleagues 3 highlight the limitations of missing baseline ECOG performance status and appropriately consider the possibility that performance status was missing from patients in a nonrandom fashion. The use of multiple approaches under varying assumptions to verify their findings strengthens their analyses. Overall, it is important for investigators and readers to understand the strengths and weaknesses of different imputation approaches. 8

Differential data acquisition can also create discrepancies between data collected during routine clinical care vs clinical trials. In their study, Wilkinson et al 3 compare 2 populations receiving alectinib (patients enrolled in phase 2 trials and real-world patients) with real-world patients receiving ceritinib. Importantly, RWD have played an increasing role in the development of synthetic controls for single-group studies. As the authors describe, 3 data harmonization for this use remains an important barrier, and harmonization decisions should be clearly reported. Given this limitation and variations in study populations, the authors’ replication of results using both single-group and real-world populations is important to support confidence in their findings.

RWD and RCTs will continue to serve complementary roles. RWD can efficiently inform the next generation of RCTs and fill knowledge gaps when RCTs are not feasible (whether because of cost, time, or other considerations), while also providing large sample sizes and better external generalizability. RCTs are designed to maximize internal validity and the ability to make causal inferences within the confines of a well-controlled environment.

For clinicians to identify best practices, the challenge will be in synthesizing RCT results and RWD to optimize decisions for individual patients. It has been well documented that the results of observational studies can be challenging to reconcile with those of RCTs. 2 Discordance should be anticipated and often relates to the aforementioned challenges. At other times, this may reflect differences with real-world populations and clinical practice. It is important to consider analogs that exist in RCTs. Benefits have been identified in smaller-scale RCTs that are not reproduced when evaluated in the cooperative group setting. Moreover, a proportion of RCTs will have incorrect findings due to chance, which can be identified only when multiple trials address similar hypotheses. Overall, disagreement between RCTs and RWD may further inform the design of future RCTs.

We should look forward to the continued evolution of best practices in extracting, harmonizing, and analyzing RWD. Maximizing our ability draw conclusions from these data and placing them in appropriate context with RCTs will be critical to advance patient care in a timely and resource-efficient manner.

Published: October 7, 2021. doi:10.1001/jamanetworkopen.2021.28045

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2021 Hong JC. JAMA Network Open .

Corresponding Author: Julian C. Hong, MD, MS, Department of Radiation Oncology, University of California, San Francisco, 1825 Fourth St, Ste L1101, San Francisco, CA 94158 ( [email protected] ).

Conflict of Interest Disclosures: Dr Hong reported being a coinventor on a pending patent that is unrelated to this article. His research has been funded by the American Cancer Society and he is supported by a Career Development Award from Conquer Cancer. No other disclosures were reported.

See More About

Hong JC. Strategies to Turn Real-world Data Into Real-world Knowledge. JAMA Netw Open. 2021;4(10):e2128045. doi:10.1001/jamanetworkopen.2021.28045

Manage citations:

© 2024

Select Your Interests

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Get the latest research based on your areas of interest.

Others also liked.

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts
  • Open access
  • Published: 15 May 2024

Learning together for better health using an evidence-based Learning Health System framework: a case study in stroke

  • Helena Teede 1 , 2   na1 ,
  • Dominique A. Cadilhac 3 , 4   na1 ,
  • Tara Purvis 3 ,
  • Monique F. Kilkenny 3 , 4 ,
  • Bruce C.V. Campbell 4 , 5 , 6 ,
  • Coralie English 7 ,
  • Alison Johnson 2 ,
  • Emily Callander 1 ,
  • Rohan S. Grimley 8 , 9 ,
  • Christopher Levi 10 ,
  • Sandy Middleton 11 , 12 ,
  • Kelvin Hill 13 &
  • Joanne Enticott   ORCID: orcid.org/0000-0002-4480-5690 1  

BMC Medicine volume  22 , Article number:  198 ( 2024 ) Cite this article

1 Altmetric

Metrics details

In the context of expanding digital health tools, the health system is ready for Learning Health System (LHS) models. These models, with proper governance and stakeholder engagement, enable the integration of digital infrastructure to provide feedback to all relevant parties including clinicians and consumers on performance against best practice standards, as well as fostering innovation and aligning healthcare with patient needs. The LHS literature primarily includes opinion or consensus-based frameworks and lacks validation or evidence of benefit. Our aim was to outline a rigorously codesigned, evidence-based LHS framework and present a national case study of an LHS-aligned national stroke program that has delivered clinical benefit.

Current core components of a LHS involve capturing evidence from communities and stakeholders (quadrant 1), integrating evidence from research findings (quadrant 2), leveraging evidence from data and practice (quadrant 3), and generating evidence from implementation (quadrant 4) for iterative system-level improvement. The Australian Stroke program was selected as the case study as it provides an exemplar of how an iterative LHS works in practice at a national level encompassing and integrating evidence from all four LHS quadrants. Using this case study, we demonstrate how to apply evidence-based processes to healthcare improvement and embed real-world research for optimising healthcare improvement. We emphasize the transition from research as an endpoint, to research as an enabler and a solution for impact in healthcare improvement.

Conclusions

The Australian Stroke program has nationally improved stroke care since 2007, showcasing the value of integrated LHS-aligned approaches for tangible impact on outcomes. This LHS case study is a practical example for other health conditions and settings to follow suit.

Peer Review reports

Internationally, health systems are facing a crisis, driven by an ageing population, increasing complexity, multi-morbidity, rapidly advancing health technology and rising costs that threaten sustainability and mandate transformation and improvement [ 1 , 2 ]. Although research has generated solutions to healthcare challenges, and the advent of big data and digital health holds great promise, entrenched siloes and poor integration of knowledge generation, knowledge implementation and healthcare delivery between stakeholders, curtails momentum towards, and consistent attainment of, evidence-and value-based care [ 3 ]. This is compounded by the short supply of research and innovation leadership within the healthcare sector, and poorly integrated and often inaccessible health data systems, which have crippled the potential to deliver on digital-driven innovation [ 4 ]. Current approaches to healthcare improvement are also often isolated with limited sustainability, scale-up and impact [ 5 ].

Evidence suggests that integration and partnership across academic and healthcare delivery stakeholders are key to progress, including those with lived experience and their families (referred to here as consumers and community), diverse disciplines (both research and clinical), policy makers and funders. Utilization of evidence from research and evidence from practice including data from routine care, supported by implementation research, are key to sustainably embedding improvement and optimising health care and outcomes. A strategy to achieve this integration is through the Learning Health System (LHS) (Fig.  1 ) [ 2 , 6 , 7 , 8 ]. Although there are numerous publications on LHS approaches [ 9 , 10 , 11 , 12 ], many focus on research perspectives and data, most do not demonstrate tangible healthcare improvement or better health outcomes. [ 6 ]

figure 1

Monash Learning Health System: The Learn Together for Better Health Framework developed by Monash Partners and Monash University (from Enticott et al. 2021 [ 7 ]). Four evidence quadrants: Q1 (orange) is evidence from stakeholders; Q2 (green) is evidence from research; Q3 (light blue) is evidence from data; and, Q4 (dark blue) is evidence from implementation and healthcare improvement

In developed nations, it has been estimated that 60% of care provided aligns with the evidence base, 30% is low value and 10% is potentially harmful [ 13 ]. In some areas, clinical advances have been rapid and research and evidence have paved the way for dramatic improvement in outcomes, mandating rapid implementation of evidence into healthcare (e.g. polio and COVID-19 vaccines). However, healthcare improvement is challenging and slow [ 5 ]. Health systems are highly complex in their design, networks and interacting components, and change is difficult to enact, sustain and scale up. [ 3 ] New effective strategies are needed to meet community needs and deliver evidence-based and value-based care, which reorients care from serving the provider, services and system, towards serving community needs, based on evidence and quality. It goes beyond cost to encompass patient and provider experience, quality care and outcomes, efficiency and sustainability [ 2 , 6 ].

The costs of stroke care are expected to rise rapidly in the next decades, unless improvements in stroke care to reduce the disabling effects of strokes can be successfully developed and implemented [ 14 ]. Here, we briefly describe the Monash LHS framework (Fig.  1 ) [ 2 , 6 , 7 ] and outline an exemplar case in order to demonstrate how to apply evidence-based processes to healthcare improvement and embed real-world research for optimising healthcare. The Australian LHS exemplar in stroke care has driven nationwide improvement in stroke care since 2007.

An evidence-based Learning Health System framework

In Australia, members of this author group (HT, AJ, JE) have rigorously co-developed an evidence-based LHS framework, known simply as the Monash LHS [ 7 ]. The Monash LHS was designed to support sustainable, iterative and continuous robust benefit of improved clinical outcomes. It was created with national engagement in order to be applicable to Australian settings. Through this rigorous approach, core LHS principles and components have been established (Fig.  1 ). Evidence shows that people/workforce, culture, standards, governance and resources were all key to an effective LHS [ 2 , 6 ]. Culture is vital including trust, transparency, partnership and co-design. Key processes include legally compliant data sharing, linkage and governance, resources, and infrastructure [ 4 ]. The Monash LHS integrates disparate and often siloed stakeholders, infrastructure and expertise to ‘Learn Together for Better Health’ [ 7 ] (Fig.  1 ). This integrates (i) evidence from community and stakeholders including priority areas and outcomes; (ii) evidence from research and guidelines; (iii) evidence from practice (from data) with advanced analytics and benchmarking; and (iv) evidence from implementation science and health economics. Importantly, it starts with the problem and priorities of key stakeholders including the community, health professionals and services and creates an iterative learning system to address these. The following case study was chosen as it is an exemplar of how a Monash LHS-aligned national stroke program has delivered clinical benefit.

Australian Stroke Learning Health System

Internationally, the application of LHS approaches in stroke has resulted in improved stroke care and outcomes [ 12 ]. For example, in Canada a sustained decrease in 30-day in-hospital mortality has been found commensurate with an increase in resources to establish the multifactorial stroke system intervention for stroke treatment and prevention [ 15 ]. Arguably, with rapid advances in evidence and in the context of an ageing population with high cost and care burden and substantive impacts on quality of life, stroke is an area with a need for rapid research translation into evidence-based and value-based healthcare improvement. However, a recent systematic review found that the existing literature had few comprehensive examples of LHS adoption [ 12 ]. Although healthcare improvement systems and approaches were described, less is known about patient-clinician and stakeholder engagement, governance and culture, or embedding of data informatics into everyday practice to inform and drive improvement [ 12 ]. For example, in a recent review of quality improvement collaborations, it was found that although clinical processes in stroke care are improved, their short-term nature means there is uncertainty about sustainability and impacts on patient outcomes [ 16 ]. Table  1 provides the main features of the Australian Stroke LHS based on the four core domains and eight elements of the Learning Together for Better Health Framework described in Fig.  1 . The features are further expanded on in the following sections.

Evidence from stakeholders (LHS quadrant 1, Fig.  1 )

Engagement, partners and priorities.

Within the stroke field, there have been various support mechanisms to facilitate an LHS approach including partnership and broad stakeholder engagement that includes clinical networks and policy makers from different jurisdictions. Since 2008, the Australian Stroke Coalition has been co-led by the Stroke Foundation, a charitable consumer advocacy organisation, and Stroke Society of Australasia a professional society with membership covering academics and multidisciplinary clinician networks, that are collectively working to improve stroke care ( https://australianstrokecoalition.org.au/ ). Surveys, focus groups and workshops have been used for identifying priorities from stakeholders. Recent agreed priorities have been to improve stroke care and strengthen the voice for stroke care at a national ( https://strokefoundation.org.au/ ) and international level ( https://www.world-stroke.org/news-and-blog/news/world-stroke-organization-tackle-gaps-in-access-to-quality-stroke-care ), as well as reduce duplication amongst stakeholders. This activity is built on a foundation and culture of research and innovation embedded within the stroke ‘community of practice’. Consumers, as people with lived experience of stroke are important members of the Australian Stroke Coalition, as well as representatives from different clinical colleges. Consumers also provide critical input to a range of LHS activities via the Stroke Foundation Consumer Council, Stroke Living Guidelines committees, and the Australian Stroke Clinical Registry (AuSCR) Steering Committee (described below).

Evidence from research (LHS quadrant 2, Fig.  1 )

Advancement of the evidence for stroke interventions and synthesis into clinical guidelines.

To implement best practice, it is crucial to distil the large volume of scientific and trial literature into actionable recommendations for clinicians to use in practice [ 24 ]. The first Australian clinical guidelines for acute stroke were produced in 2003 following the increasing evidence emerging for prevention interventions (e.g. carotid endarterectomy, blood pressure lowering), acute medical treatments (intravenous thrombolysis, aspirin within 48 h of ischemic stroke), and optimised hospital management (care in dedicated stroke units by a specialised and coordinated multidisciplinary team) [ 25 ]. Importantly, a number of the innovations were developed, researched and proven effective by key opinion leaders embedded in the Australian stroke care community. In 2005, the clinical guidelines for Stroke Rehabilitation and Recovery [ 26 ] were produced, with subsequent merged guidelines periodically updated. However, the traditional process of periodic guideline updates is challenging for end users when new research can render recommendations redundant and this lack of currency erodes stakeholder trust [ 27 ]. In response to this challenge the Stroke Foundation and Cochrane Australia entered a pioneering project to produce the first electronic ‘living’ guidelines globally [ 20 ]. Major shifts in the evidence for reperfusion therapies (e.g. extended time-window intravenous thrombolysis and endovascular clot retrieval), among other advances, were able to be converted into new recommendations, approved by the Australian National Health and Medical Research Council within a few months of publication. Feedback on this process confirmed the increased use and trust in the guidelines by clinicians. The process informed other living guidelines programs, including the successful COVID-19 clinical guidelines [ 28 ].

However, best practice clinical guideline recommendations are necessary but insufficient for healthcare improvement and nesting these within an LHS with stakeholder partnership, enables implementation via a range of proven methods, including audit and feedback strategies [ 29 ].

Evidence from data and practice (LHS quadrant 3, Fig.  1 )

Data systems and benchmarking : revealing the disparities in care between health services. A national system for standardized stroke data collection was established as the National Stroke Audit program in 2007 by the Stroke Foundation [ 30 ] following various state-level programs (e.g. New South Wales Audit) [ 31 ] to identify evidence-practice gaps and prioritise improvement efforts to increase access to stroke units and other acute treatments [ 32 ]. The Audit program alternates each year between acute (commencing in 2007) and rehabilitation in-patient services (commencing in 2008). The Audit program provides a ‘deep dive’ on the majority of recommendations in the clinical guidelines whereby participating hospitals provide audits of up to 40 consecutive patient medical records and respond to a survey about organizational resources to manage stroke. In 2009, the AuSCR was established to provide information on patients managed in acute hospitals based on a small subset of quality processes of care linked to benchmarked reports of performance (Fig.  2 ) [ 33 ]. In this way, the continuous collection of high-priority processes of stroke care could be regularly collected and reviewed to guide improvement to care [ 34 ]. Plus clinical quality registry programs within Australia have shown a meaningful return on investment attributed to enhanced survival, improvements in quality of life and avoided costs of treatment or hospital stay [ 35 ].

figure 2

Example performance report from the Australian Stroke Clinical Registry: average door-to-needle time in providing intravenous thrombolysis by different hospitals in 2021 [ 36 ]. Each bar in the figure represents a single hospital

The Australian Stroke Coalition endorsed the creation of an integrated technological solution for collecting data through a single portal for multiple programs in 2013. In 2015, the Stroke Foundation, AuSCR consortium, and other relevant groups cooperated to design an integrated data management platform (the Australian Stroke Data Tool) to reduce duplication of effort for hospital staff in the collection of overlapping variables in the same patients [ 19 ]. Importantly, a national data dictionary then provided the common data definitions to facilitate standardized data capture. Another important feature of AuSCR is the collection of patient-reported outcome surveys between 90 and 180 days after stroke, and annual linkage with national death records to ascertain survival status [ 33 ]. To support a LHS approach, hospitals that participate in AuSCR have access to a range of real-time performance reports. In efforts to minimize the burden of data collection in the AuSCR, interoperability approaches to import data directly from hospital or state-level managed stroke databases have been established (Fig.  3 ); however, the application has been variable and 41% of hospitals still manually enter all their data.

figure 3

Current status of automated data importing solutions in the Australian Stroke Clinical Registry, 2022, with ‘ n ’ representing the number of hospitals. AuSCR, Australian Stroke Clinical Registry; AuSDaT, Australian Stroke Data Tool; API, Application Programming Interface; ICD, International Classification of Diseases; RedCAP, Research Electronic Data Capture; eMR, electronic medical records

For acute stroke care, the Australian Commission on Quality and Safety in Health Care facilitated the co-design (clinicians, academics, consumers) and publication of the national Acute Stroke Clinical Care Standard in 2015 [ 17 ], and subsequent review [ 18 ]. The indicator set for the Acute Stroke Standard then informed the expansion of the minimum dataset for AuSCR so that hospitals could routinely track their performance. The national Audit program enabled hospitals not involved in the AuSCR to assess their performance every two years against the Acute Stroke Standard. Complementing these efforts, the Stroke Foundation, working with the sector, developed the Acute and Rehabilitation Stroke Services Frameworks to outline the principles, essential elements, models of care and staffing recommendations for stroke services ( https://informme.org.au/guidelines/national-stroke-services-frameworks ). The Frameworks are intended to guide where stroke services should be developed, and monitor their uptake with the organizational survey component of the Audit program.

Evidence from implementation and healthcare improvement (LHS quadrant 4, Fig.  1 )

Research to better utilize and augment data from registries through linkage [ 37 , 38 , 39 , 40 ] and to ensure presentation of hospital or service level data are understood by clinicians has ensured advancement in the field for the Australian Stroke LHS [ 41 ]. Importantly, greater insights into whole patient journeys, before and after a stroke, can now enable exploration of value-based care. The LHS and stroke data platform have enabled focused and time-limited projects to create a better understanding of the quality of care in acute or rehabilitation settings [ 22 , 42 , 43 ]. Within stroke, all the elements of an LHS culminate into the ready availability of benchmarked performance data and support for implementation of strategies to address gaps in care.

Implementation research to grow the evidence base for effective improvement interventions has also been a key pillar in the Australian context. These include multi-component implementation interventions to achieve behaviour change for particular aspects of stroke care, [ 22 , 23 , 44 , 45 ] and real-world approaches to augmenting access to hyperacute interventions in stroke through the use of technology and telehealth [ 46 , 47 , 48 , 49 ]. The evidence from these studies feeds into the living guidelines program and the data collection systems, such as the Audit program or AuSCR, which are then amended to ensure data aligns to recommended care. For example, the use of ‘hyperacute aspirin within the first 48 h of ischemic stroke’ was modified to be ‘hyperacute antiplatelet…’ to incorporate new evidence that other medications or combinations are appropriate to use. Additionally, new datasets have been developed to align with evidence such as the Fever, Sugar, and Swallow variables [ 42 ]. Evidence on improvements in access to best practice care from the acute Audit program [ 50 ] and AuSCR is emerging [ 36 ]. For example, between 2007 and 2017, the odds of receiving intravenous thrombolysis after ischemic stroke increased by 16% 9OR 1.06 95% CI 1.13–1.18) and being managed in a stroke unit by 18% (OR 1.18 95% CI 1.17–1.20). Over this period, the median length of hospital stay for all patients decreased from 6.3 days in 2007 to 5.0 days in 2017 [ 51 ]. When considering the number of additional patients who would receive treatment in 2017 in comparison to 2007 it was estimated that without this additional treatment, over 17,000 healthy years of life would be lost in 2017 (17,786 disability-adjusted life years) [ 51 ]. There is evidence on the cost-effectiveness of different system-focussed strategies to augment treatment access for acute ischemic stroke (e.g. Victorian Stroke Telemedicine program [ 52 ] and Melbourne Mobile Stroke Unit ambulance [ 53 ]). Reciprocally, evidence from the national Rehabilitation Audit, where the LHS approach has been less complete or embedded, has shown fewer areas of healthcare improvement over time [ 51 , 54 ].

Within the field of stroke in Australia, there is indirect evidence that the collective efforts that align to establishing the components of a LHS have had an impact. Overall, the age-standardised rate of stroke events has reduced by 27% between 2001 and 2020, from 169 to 124 events per 100,000 population. Substantial declines in mortality rates have been reported since 1980. Commensurate with national clinical guidelines being updated in 2007 and the first National Stroke Audit being undertaken in 2007, the mortality rates for men (37.4 deaths per 100,000) and women (36.1 deaths per 100,0000 has declined to 23.8 and 23.9 per 100,000, respectively in 2021 [ 55 ].

Underpinning the LHS with the integration of the four quadrants of evidence from stakeholders, research and guidelines, practice and implementation, and core LHS principles have been addressed. Leadership and governance have been important, and programs have been established to augment workforce training and capacity building in best practice professional development. Medical practitioners are able to undertake courses and mentoring through the Australasian Stroke Academy ( http://www.strokeacademy.com.au/ ) while nurses (and other health professionals) can access teaching modules in stroke care from the Acute Stroke Nurses Education Network ( https://asnen.org/ ). The Association of Neurovascular Clinicians offers distance-accessible education and certification to develop stroke expertise for interdisciplinary professionals, including advanced stroke co-ordinator certification ( www.anvc.org ). Consumer initiative interventions are also used in the design of the AuSCR Public Summary Annual reports (available at https://auscr.com.au/about/annual-reports/ ) and consumer-related resources related to the Living Guidelines ( https://enableme.org.au/resources ).

The important success factors and lessons from stroke as a national exemplar LHS in Australia include leadership, culture, workforce and resources integrated with (1) established and broad partnerships across the academic-clinical sector divide and stakeholder engagement; (2) the living guidelines program; (3) national data infrastructure, including a national data dictionary that provides the common data framework to support standardized data capture; (4) various implementation strategies including benchmarking and feedback as well as engagement strategies targeting different levels of the health system; and (5) implementation and improvement research to advance stroke systems of care and reduce unwarranted variation in practice (Fig.  1 ). Priority opportunities now include the advancement of interoperability with electronic medical records as an area all clinical quality registry’s programs needs to be addressed, as well as providing more dynamic and interactive data dashboards tailored to the need of clinicians and health service executives.

There is a clear mandate to optimise healthcare improvement with big data offering major opportunities for change. However, we have lacked the approaches to capture evidence from the community and stakeholders, to integrate evidence from research, to capture and leverage data or evidence from practice and to generate and build on evidence from implementation using iterative system-level improvement. The LHS provides this opportunity and is shown to deliver impact. Here, we have outlined the process applied to generate an evidence-based LHS and provide a leading exemplar in stroke care. This highlights the value of moving from single-focus isolated approaches/initiatives to healthcare improvement and the benefit of integration to deliver demonstrable outcomes for our funders and key stakeholders — our community. This work provides insight into strategies that can both apply evidence-based processes to healthcare improvement as well as implementing evidence-based practices into care, moving beyond research as an endpoint, to research as an enabler, underpinning delivery of better healthcare.

Availability of data and materials

Not applicable

Abbreviations

Australian Stroke Clinical Registry

Confidence interval

  • Learning Health System

World Health Organization. Delivering quality health services . OECD Publishing; 2018.

Enticott J, Braaf S, Johnson A, Jones A, Teede HJ. Leaders’ perspectives on learning health systems: A qualitative study. BMC Health Serv Res. 2020;20:1087.

Article   PubMed   PubMed Central   Google Scholar  

Melder A, Robinson T, McLoughlin I, Iedema R, Teede H. An overview of healthcare improvement: Unpacking the complexity for clinicians and managers in a learning health system. Intern Med J. 2020;50:1174–84.

Article   PubMed   Google Scholar  

Alberto IRI, Alberto NRI, Ghosh AK, Jain B, Jayakumar S, Martinez-Martin N, et al. The impact of commercial health datasets on medical research and health-care algorithms. Lancet Digit Health. 2023;5:e288–94.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Dixon-Woods M. How to improve healthcare improvement—an essay by Mary Dixon-Woods. BMJ. 2019;367: l5514.

Enticott J, Johnson A, Teede H. Learning health systems using data to drive healthcare improvement and impact: A systematic review. BMC Health Serv Res. 2021;21:200.

Enticott JC, Melder A, Johnson A, Jones A, Shaw T, Keech W, et al. A learning health system framework to operationalize health data to improve quality care: An Australian perspective. Front Med (Lausanne). 2021;8:730021.

Dammery G, Ellis LA, Churruca K, Mahadeva J, Lopez F, Carrigan A, et al. The journey to a learning health system in primary care: A qualitative case study utilising an embedded research approach. BMC Prim Care. 2023;24:22.

Foley T, Horwitz L, Zahran R. The learning healthcare project: Realising the potential of learning health systems. 2021. Available from https://learninghealthcareproject.org/wp-content/uploads/2021/05/LHS2021report.pdf . Accessed Jan 2024.

Institute of Medicine. Best care at lower cost: The path to continuously learning health care in America. Washington: The National Academies Press; 2013.

Google Scholar  

Zurynski Y, Smith CL, Vedovi A, Ellis LA, Knaggs G, Meulenbroeks I, et al. Mapping the learning health system: A scoping review of current evidence - a white paper. 2020:63

Cadilhac DA, Bravata DM, Bettger J, Mikulik R, Norrving B, Uvere E, et al. Stroke learning health systems: A topical narrative review with case examples. Stroke. 2023;54:1148–59.

Braithwaite J, Glasziou P, Westbrook J. The three numbers you need to know about healthcare: The 60–30-10 challenge. BMC Med. 2020;18:1–8.

Article   Google Scholar  

King D, Wittenberg R, Patel A, Quayyum Z, Berdunov V, Knapp M. The future incidence, prevalence and costs of stroke in the UK. Age Ageing. 2020;49:277–82.

Ganesh A, Lindsay P, Fang J, Kapral MK, Cote R, Joiner I, et al. Integrated systems of stroke care and reduction in 30-day mortality: A retrospective analysis. Neurology. 2016;86:898–904.

Lowther HJ, Harrison J, Hill JE, Gaskins NJ, Lazo KC, Clegg AJ, et al. The effectiveness of quality improvement collaboratives in improving stroke care and the facilitators and barriers to their implementation: A systematic review. Implement Sci. 2021;16:16.

Australian Commission on Safety and Quality in Health Care. Acute stroke clinical care standard. 2015. Available from https://www.safetyandquality.gov.au/our-work/clinical-care-standards/acute-stroke-clinical-care-standard . Accessed Jan 2024.

Australian Commission on Safety and Quality in Health Care. Acute stroke clinical care standard. Sydney: ACSQHC; 2019. Available from https://www.safetyandquality.gov.au/publications-and-resources/resource-library/acute-stroke-clinical-care-standard-evidence-sources . Accessed Jan 2024.

Ryan O, Ghuliani J, Grabsch B, Hill K, G CC, Breen S, et al. Development, implementation, and evaluation of the Australian Stroke Data Tool (AuSDaT): Comprehensive data capturing for multiple uses. Health Inf Manag. 2022:18333583221117184.

English C, Bayley M, Hill K, Langhorne P, Molag M, Ranta A, et al. Bringing stroke clinical guidelines to life. Int J Stroke. 2019;14:337–9.

English C, Hill K, Cadilhac DA, Hackett ML, Lannin NA, Middleton S, et al. Living clinical guidelines for stroke: Updates, challenges and opportunities. Med J Aust. 2022;216:510–4.

Cadilhac DA, Grimley R, Kilkenny MF, Andrew NE, Lannin NA, Hill K, et al. Multicenter, prospective, controlled, before-and-after, quality improvement study (Stroke123) of acute stroke care. Stroke. 2019;50:1525–30.

Cadilhac DA, Marion V, Andrew NE, Breen SJ, Grabsch B, Purvis T, et al. A stepped-wedge cluster-randomized trial to improve adherence to evidence-based practices for acute stroke management. Jt Comm J Qual Patient Saf. 2022.

Elliott J, Lawrence R, Minx JC, Oladapo OT, Ravaud P, Jeppesen BT, et al. Decision makers need constantly updated evidence synthesis. Nature. 2021;600:383–5.

Article   CAS   PubMed   Google Scholar  

National Stroke Foundation. National guidelines for acute stroke management. Melbourne: National Stroke Foundation; 2003.

National Stroke Foundation. Clinical guidelines for stroke rehabilitation and recovery. Melbourne: National Stroke Foundation; 2005.

Phan TG, Thrift A, Cadilhac D, Srikanth V. A plea for the use of systematic review methodology when writing guidelines and timely publication of guidelines. Intern Med J . 2012;42:1369–1371; author reply 1371–1362

Tendal B, Vogel JP, McDonald S, Norris S, Cumpston M, White H, et al. Weekly updates of national living evidence-based guidelines: Methods for the Australian living guidelines for care of people with COVID-19. J Clin Epidemiol. 2021;131:11–21.

Grimshaw JM, Eccles MP, Lavis JN, Hill SJ, Squires JE. Knowledge translation of research findings. Implement Sci. 2012;7:50.

Harris D, Cadilhac D, Hankey GJ, Hillier S, Kilkenny M, Lalor E. National stroke audit: The Australian experience. Clin Audit. 2010;2:25–31.

Cadilhac DA, Purvis T, Kilkenny MF, Longworth M, Mohr K, Pollack M, et al. Evaluation of rural stroke services: Does implementation of coordinators and pathways improve care in rural hospitals? Stroke. 2013;44:2848–53.

Cadilhac DA, Moss KM, Price CJ, Lannin NA, Lim JY, Anderson CS. Pathways to enhancing the quality of stroke care through national data monitoring systems for hospitals. Med J Aust. 2013;199:650–1.

Cadilhac DA, Lannin NA, Anderson CS, Levi CR, Faux S, Price C, et al. Protocol and pilot data for establishing the Australian Stroke Clinical Registry. Int J Stroke. 2010;5:217–26.

Ivers N, Jamtvedt G, Flottorp S, Young J, Odgaard-Jensen J, French S, et al. Audit and feedback: Effects on professional practice and healthcare outcomes. Cochrane Database Syst Rev . 2012

Australian Commission on Safety and Quality in Health Care. Economic evaluation of clinical quality registries. Final report. . 2016:79

Cadilhac DA, Dalli LL, Morrison J, Lester M, Paice K, Moss K, et al. The Australian Stroke Clinical Registry annual report 2021. Melbourne; 2022. Available from https://auscr.com.au/about/annual-reports/ . Accessed 6 May 2024.

Kilkenny MF, Kim J, Andrew NE, Sundararajan V, Thrift AG, Katzenellenbogen JM, et al. Maximising data value and avoiding data waste: A validation study in stroke research. Med J Aust. 2019;210:27–31.

Eliakundu AL, Smith K, Kilkenny MF, Kim J, Bagot KL, Andrew E, et al. Linking data from the Australian Stroke Clinical Registry with ambulance and emergency administrative data in Victoria. Inquiry. 2022;59:469580221102200.

PubMed   Google Scholar  

Andrew NE, Kim J, Cadilhac DA, Sundararajan V, Thrift AG, Churilov L, et al. Protocol for evaluation of enhanced models of primary care in the management of stroke and other chronic disease (PRECISE): A data linkage healthcare evaluation study. Int J Popul Data Sci. 2019;4:1097.

CAS   PubMed   PubMed Central   Google Scholar  

Mosalski S, Shiner CT, Lannin NA, Cadilhac DA, Faux SG, Kim J, et al. Increased relative functional gain and improved stroke outcomes: A linked registry study of the impact of rehabilitation. J Stroke Cerebrovasc Dis. 2021;30: 106015.

Ryan OF, Hancock SL, Marion V, Kelly P, Kilkenny MF, Clissold B, et al. Feedback of aggregate patient-reported outcomes (PROs) data to clinicians and hospital end users: Findings from an Australian codesign workshop process. BMJ Open. 2022;12:e055999.

Grimley RS, Rosbergen IC, Gustafsson L, Horton E, Green T, Cadigan G, et al. Dose and setting of rehabilitation received after stroke in Queensland, Australia: A prospective cohort study. Clin Rehabil. 2020;34:812–23.

Purvis T, Middleton S, Craig LE, Kilkenny MF, Dale S, Hill K, et al. Inclusion of a care bundle for fever, hyperglycaemia and swallow management in a national audit for acute stroke: Evidence of upscale and spread. Implement Sci. 2019;14:87.

Middleton S, McElduff P, Ward J, Grimshaw JM, Dale S, D’Este C, et al. Implementation of evidence-based treatment protocols to manage fever, hyperglycaemia, and swallowing dysfunction in acute stroke (QASC): A cluster randomised controlled trial. Lancet. 2011;378:1699–706.

Middleton S, Dale S, Cheung NW, Cadilhac DA, Grimshaw JM, Levi C, et al. Nurse-initiated acute stroke care in emergency departments. Stroke. 2019:STROKEAHA118020701.

Hood RJ, Maltby S, Keynes A, Kluge MG, Nalivaiko E, Ryan A, et al. Development and pilot implementation of TACTICS VR: A virtual reality-based stroke management workflow training application and training framework. Front Neurol. 2021;12:665808.

Bladin CF, Kim J, Bagot KL, Vu M, Moloczij N, Denisenko S, et al. Improving acute stroke care in regional hospitals: Clinical evaluation of the Victorian Stroke Telemedicine program. Med J Aust. 2020;212:371–7.

Bladin CF, Bagot KL, Vu M, Kim J, Bernard S, Smith K, et al. Real-world, feasibility study to investigate the use of a multidisciplinary app (Pulsara) to improve prehospital communication and timelines for acute stroke/STEMI care. BMJ Open. 2022;12:e052332.

Zhao H, Coote S, Easton D, Langenberg F, Stephenson M, Smith K, et al. Melbourne mobile stroke unit and reperfusion therapy: Greater clinical impact of thrombectomy than thrombolysis. Stroke. 2020;51:922–30.

Purvis T, Cadilhac DA, Hill K, Reyneke M, Olaiya MT, Dalli LL, et al. Twenty years of monitoring acute stroke care in Australia from the national stroke audit program (1999–2019): Achievements and areas of future focus. J Health Serv Res Policy. 2023.

Cadilhac DA, Purvis T, Reyneke M, Dalli LL, Kim J, Kilkenny MF. Evaluation of the national stroke audit program: 20-year report. Melbourne; 2019.

Kim J, Tan E, Gao L, Moodie M, Dewey HM, Bagot KL, et al. Cost-effectiveness of the Victorian Stroke Telemedicine program. Aust Health Rev. 2022;46:294–301.

Kim J, Easton D, Zhao H, Coote S, Sookram G, Smith K, et al. Economic evaluation of the Melbourne mobile stroke unit. Int J Stroke. 2021;16:466–75.

Stroke Foundation. National stroke audit – rehabilitation services report 2020. Melbourne; 2020.

Australian Institute of Health and Welfare. Heart, stroke and vascular disease: Australian facts. 2023. Webpage https://www.aihw.gov.au/reports/heart-stroke-vascular-diseases/hsvd-facts/contents/about (accessed Jan 2024).

Download references

Acknowledgements

The following authors hold National Health and Medical Research Council Research Fellowships: HT (#2009326), DAC (#1154273), SM (#1196352), MFK Future Leader Research Fellowship (National Heart Foundation #105737). The Funders of this work did not have any direct role in the design of the study, its execution, analyses, interpretation of the data, or decision to submit results for publication.

Author information

Helena Teede and Dominique A. Cadilhac contributed equally.

Authors and Affiliations

Monash Centre for Health Research and Implementation, 43-51 Kanooka Grove, Clayton, VIC, Australia

Helena Teede, Emily Callander & Joanne Enticott

Monash Partners Academic Health Science Centre, 43-51 Kanooka Grove, Clayton, VIC, Australia

Helena Teede & Alison Johnson

Stroke and Ageing Research, Department of Medicine, School of Clinical Sciences at Monash Health, Monash University, Level 2 Monash University Research, Victorian Heart Hospital, 631 Blackburn Rd, Clayton, VIC, Australia

Dominique A. Cadilhac, Tara Purvis & Monique F. Kilkenny

Stroke Theme, The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Heidelberg, VIC, Australia

Dominique A. Cadilhac, Monique F. Kilkenny & Bruce C.V. Campbell

Department of Neurology, Melbourne Brain Centre, Royal Melbourne Hospital, Parkville, VIC, Australia

Bruce C.V. Campbell

Department of Medicine, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Victoria, Australia

School of Health Sciences, Heart and Stroke Program, University of Newcastle, Hunter Medical Research Institute, University Drive, Callaghan, NSW, Australia

Coralie English

School of Medicine and Dentistry, Griffith University, Birtinya, QLD, Australia

Rohan S. Grimley

Clinical Excellence Division, Queensland Health, Brisbane, Australia

John Hunter Hospital, Hunter New England Local Health District and University of Newcastle, Sydney, NSW, Australia

Christopher Levi

School of Nursing, Midwifery and Paramedicine, Australian Catholic University, Sydney, NSW, Australia

Sandy Middleton

Nursing Research Institute, St Vincent’s Health Network Sydney and and Australian Catholic University, Sydney, NSW, Australia

Stroke Foundation, Level 7, 461 Bourke St, Melbourne, VIC, Australia

Kelvin Hill

You can also search for this author in PubMed   Google Scholar

Contributions

HT: conception, design and initial draft, developed the theoretical formalism for learning health system framework, approved the submitted version. DAC: conception, design and initial draft, provided essential literature and case study examples, approved the submitted version. TP: revised the manuscript critically for important intellectual content, approved the submitted version. MFK: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. BC: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. CE: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. AJ: conception, design and initial draft, developed the theoretical formalism for learning health system framework, approved the submitted version. EC: revised the manuscript critically for important intellectual content, approved the submitted version. RSG: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. CL: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. SM: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. KH: revised the manuscript critically for important intellectual content, provided essential literature and case study examples, approved the submitted version. JE: conception, design and initial draft, developed the theoretical formalism for learning health system framework, approved the submitted version. All authors read and approved the final manuscript.

Authors’ Twitter handles

@HelenaTeede

@DominiqueCad

@Coralie_English

@EmilyCallander

@EnticottJo

Corresponding authors

Correspondence to Helena Teede or Dominique A. Cadilhac .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Teede, H., Cadilhac, D.A., Purvis, T. et al. Learning together for better health using an evidence-based Learning Health System framework: a case study in stroke. BMC Med 22 , 198 (2024). https://doi.org/10.1186/s12916-024-03416-w

Download citation

Received : 23 July 2023

Accepted : 30 April 2024

Published : 15 May 2024

DOI : https://doi.org/10.1186/s12916-024-03416-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Evidence-based medicine
  • Person-centred care
  • Models of care
  • Healthcare improvement

BMC Medicine

ISSN: 1741-7015

real world evidence research based on big data

Practical implications of using real-world evidence (RWE) in comparative effectiveness research: learnings from IMI-GetReal

Affiliations.

  • 1 The National Healthcare Institute (ZIN), Diemen, the Netherlands.
  • 2 Department of Pharmacoepidemiology & Clinical Pharmacotherapy, Utrecht University, Utrecht, the Netherlands.
  • 3 The National Institute for Health & Care Excellence (NICE), London, UK.
  • 4 International Alliance of Patients' Organizations, London, UK.
  • 5 Julius Center for Health Sciences & Primary Care, University Medical Centre Utrecht, Utrecht, the Netherlands.
  • 6 Cochrane Netherlands, University Medical Center Utrecht, Utrecht, the Netherlands.
  • 7 Bristol-Myers Squibb, Paris, France.
  • 8 Eli Lilly, Hamburg, Germany.
  • 9 Melanoma Patient Network Europe, Uppsala, Sweden.
  • 10 Uppsala Centre for Evolution & Genomics, Uppsala University, Uppsala, Sweden.
  • 11 Department of Health Science, University of Leicester, Leicester, UK.
  • 12 Takeda, London, UK.
  • PMID: 28857631
  • DOI: 10.2217/cer-2017-0044

In light of increasing attention towards the use of real-world evidence (RWE) in decision making in recent years, this commentary aims to reflect on the experiences gained in accessing and using RWE for comparative effectiveness research as a part of the Innovative Medicines Initiative GetReal Consortium and discuss their implications for RWE use in decision-making.

Keywords: comparative effectiveness research; health technology assessment; non-randomized trials; real-world evidence.

  • Clinical Decision-Making*
  • Comparative Effectiveness Research*
  • Data Collection
  • Evidence-Based Medicine
  • Technology Assessment, Biomedical
  • Search Menu
  • Advance articles
  • Editor's Choice
  • 100 years of the AJE
  • Collections
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • About American Journal of Epidemiology
  • About the Johns Hopkins Bloomberg School of Public Health
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Society for Epidemiologic Research

Article Contents

Real world data are not always big data: the case for primary data collection on medication use in pregnancy in the context of birth defects research.

  • Article contents
  • Figures & tables
  • Supplementary Data

Elizabeth C Ailes, Martha M Werler, Meredith M Howley, Mary M Jenkins, Jennita Reefhuis, Real World Data are Not Always Big Data: The Case for Primary Data Collection on Medication Use in Pregnancy in the Context of Birth Defects Research, American Journal of Epidemiology , 2024;, kwae060, https://doi.org/10.1093/aje/kwae060

  • Permissions Icon Permissions

Many examples of the use of real-world data in the area of pharmacoepidemiology include “big data” such as insurance claims, medical records, or hospital discharge databases. However, “big” is not always better, particularly when studying outcomes with narrow windows of etiologic relevance. Birth defects are one such outcome, where specificity of exposure timing is critical. Studies with primary data collection can be designed to query details on the timing of medication use, as well as type, dose, frequency, duration, and indication, that can better characterize the “real world”. Because birth defects are rare, etiologic studies are typically case-control in design, like the National Birth Defects Prevention Study, Birth Defects Study to Evaluate Pregnancy exposureS, and Slone Birth Defects Study. Recall bias can be a concern, but the ability to collect detailed information on both prescription and over-the-counter medication use and on other exposures such as diet, family history, and sociodemographic factors is a distinct advantage over claims and medical record data sources. Case-control studies with primary data collection are essential to advancing the pharmacoepidemiology of birth defects.

Email alerts

Citing articles via, looking for your next opportunity.

  • Recommend to your Library

Affiliations

  • Online ISSN 1476-6256
  • Print ISSN 0002-9262
  • Copyright © 2024 Johns Hopkins Bloomberg School of Public Health
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 03 May 2024

A dataset for measuring the impact of research data and their curation

  • Libby Hemphill   ORCID: orcid.org/0000-0002-3793-7281 1 , 2 ,
  • Andrea Thomer 3 ,
  • Sara Lafia 1 ,
  • Lizhou Fan 2 ,
  • David Bleckley   ORCID: orcid.org/0000-0001-7715-4348 1 &
  • Elizabeth Moss 1  

Scientific Data volume  11 , Article number:  442 ( 2024 ) Cite this article

686 Accesses

8 Altmetric

Metrics details

  • Research data
  • Social sciences

Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.

Similar content being viewed by others

real world evidence research based on big data

SciSciNet: A large-scale open data lake for the science of science research

real world evidence research based on big data

Data, measurement and empirical methods in the science of science

real world evidence research based on big data

Interdisciplinarity revisited: evidence for research impact and dynamism

Background & summary.

Recent policy changes in funding agencies and academic journals have increased data sharing among researchers and between researchers and the public. Data sharing advances science and provides the transparency necessary for evaluating, replicating, and verifying results. However, many data-sharing policies do not explain what constitutes an appropriate dataset for archiving or how to determine the value of datasets to secondary users 1 , 2 , 3 . Questions about how to allocate data-sharing resources efficiently and responsibly have gone unanswered 4 , 5 , 6 . For instance, data-sharing policies recognize that not all data should be curated and preserved, but they do not articulate metrics or guidelines for determining what data are most worthy of investment.

Despite the potential for innovation and advancement that data sharing holds, the best strategies to prioritize datasets for preparation and archiving are often unclear. Some datasets are likely to have more downstream potential than others, and data curation policies and workflows should prioritize high-value data instead of being one-size-fits-all. Though prior research in library and information science has shown that the “analytic potential” of a dataset is key to its reuse value 7 , work is needed to implement conceptual data reuse frameworks 8 , 9 , 10 , 11 , 12 , 13 , 14 . In addition, publishers and data archives need guidance to develop metrics and evaluation strategies to assess the impact of datasets.

Several existing resources have been compiled to study the relationship between the reuse of scholarly products, such as datasets (Table  1 ); however, none of these resources include explicit information on how curation processes are applied to data to increase their value, maximize their accessibility, and ensure their long-term preservation. The CCex (Curation Costs Exchange) provides models of curation services along with cost-related datasets shared by contributors but does not make explicit connections between them or include reuse information 15 . Analyses on platforms such as DataCite 16 have focused on metadata completeness and record usage, but have not included related curation-level information. Analyses of GenBank 17 and FigShare 18 , 19 citation networks do not include curation information. Related studies of Github repository reuse 20 and Softcite software citation 21 reveal significant factors that impact the reuse of secondary research products but do not focus on research data. RD-Switchboard 22 and DSKG 23 are scholarly knowledge graphs linking research data to articles, patents, and grants, but largely omit social science research data and do not include curation-level factors. To our knowledge, other studies of curation work in organizations similar to ICPSR – such as GESIS 24 , Dataverse 25 , and DANS 26 – have not made their underlying data available for analysis.

This paper describes a dataset 27 compiled for the MICA project (Measuring the Impact of Curation Actions) led by investigators at ICPSR, a large social science data archive at the University of Michigan. The dataset was originally developed to study the impacts of data curation and archiving on data reuse. The MICA dataset has supported several previous publications investigating the intensity of data curation actions 28 , the relationship between data curation actions and data reuse 29 , and the structures of research communities in a data citation network 30 . Collectively, these studies help explain the return on various types of curatorial investments. The dataset that we introduce in this paper, which we refer to as the MICA dataset, has the potential to address research questions in the areas of science (e.g., knowledge production), library and information science (e.g., scholarly communication), and data archiving (e.g., reproducible workflows).

We constructed the MICA dataset 27 using records available at ICPSR, a large social science data archive at the University of Michigan. Data set creation involved: collecting and enriching metadata for articles indexed in the ICPSR Bibliography of Data-related Literature against the Dimensions AI bibliometric database; gathering usage statistics for studies from ICPSR’s administrative database; processing data curation work logs from ICPSR’s project tracking platform, Jira; and linking data in social science studies and series to citing analysis papers (Fig.  1 ).

figure 1

Steps to prepare MICA dataset for analysis - external sources are red, primary internal sources are blue, and internal linked sources are green.

Enrich paper metadata

The ICPSR Bibliography of Data-related Literature is a growing database of literature in which data from ICPSR studies have been used. Its creation was funded by the National Science Foundation (Award 9977984), and for the past 20 years it has been supported by ICPSR membership and multiple US federally-funded and foundation-funded topical archives at ICPSR. The Bibliography was originally launched in the year 2000 to aid in data discovery by providing a searchable database linking publications to the study data used in them. The Bibliography collects the universe of output based on the data shared in each study through, which is made available through each ICPSR study’s webpage. The Bibliography contains both peer-reviewed and grey literature, which provides evidence for measuring the impact of research data. For an item to be included in the ICPSR Bibliography, it must contain an analysis of data archived by ICPSR or contain a discussion or critique of the data collection process, study design, or methodology 31 . The Bibliography is manually curated by a team of librarians and information specialists at ICPSR who enter and validate entries. Some publications are supplied to the Bibliography by data depositors, and some citations are submitted to the Bibliography by authors who abide by ICPSR’s terms of use requiring them to submit citations to works in which they analyzed data retrieved from ICPSR. Most of the Bibliography is populated by Bibliography team members, who create custom queries for ICPSR studies performed across numerous sources, including Google Scholar, ProQuest, SSRN, and others. Each record in the Bibliography is one publication that has used one or more ICPSR studies. The version we used was captured on 2021-11-16 and included 94,755 publications.

To expand the coverage of the ICPSR Bibliography, we searched exhaustively for all ICPSR study names, unique numbers assigned to ICPSR studies, and DOIs 32 using a full-text index available through the Dimensions AI database 33 . We accessed Dimensions through a license agreement with the University of Michigan. ICPSR Bibliography librarians and information specialists manually reviewed and validated new entries that matched one or more search criteria. We then used Dimensions to gather enriched metadata and full-text links for items in the Bibliography with DOIs. We matched 43% of the items in the Bibliography to enriched Dimensions metadata including abstracts, field of research codes, concepts, and authors’ institutional information; we also obtained links to full text for 16% of Bibliography items. Based on licensing agreements, we included Dimensions identifiers and links to full text so that users with valid publisher and database access can construct an enriched publication dataset.

Gather study usage data

ICPSR maintains a relational administrative database, DBInfo, that organizes study-level metadata and information on data reuse across separate tables. Studies at ICPSR consist of one or more files collected at a single time or for a single purpose; studies in which the same variables are observed over time are grouped into series. Each study at ICPSR is assigned a DOI, and its metadata are stored in DBInfo. Study metadata follows the Data Documentation Initiative (DDI) Codebook 2.5 standard. DDI elements included in our dataset are title, ICPSR study identification number, DOI, authoring entities, description (abstract), funding agencies, subject terms assigned to the study during curation, and geographic coverage. We also created variables based on DDI elements: total variable count, the presence of survey question text in the metadata, the number of author entities, and whether an author entity was an institution. We gathered metadata for ICPSR’s 10,605 unrestricted public-use studies available as of 2021-11-16 ( https://www.icpsr.umich.edu/web/pages/membership/or/metadata/oai.html ).

To link study usage data with study-level metadata records, we joined study metadata from DBinfo on study usage information, which included total study downloads (data and documentation), individual data file downloads, and cumulative citations from the ICPSR Bibliography. We also gathered descriptive metadata for each study and its variables, which allowed us to summarize and append recoded fields onto the study-level metadata such as curation level, number and type of principle investigators, total variable count, and binary variables indicating whether the study data were made available for online analysis, whether survey question text was made searchable online, and whether the study variables were indexed for search. These characteristics describe aspects of the discoverability of the data to compare with other characteristics of the study. We used the study and series numbers included in the ICPSR Bibliography as unique identifiers to link papers to metadata and analyze the community structure of dataset co-citations in the ICPSR Bibliography 32 .

Process curation work logs

Researchers deposit data at ICPSR for curation and long-term preservation. Between 2016 and 2020, more than 3,000 research studies were deposited with ICPSR. Since 2017, ICPSR has organized curation work into a central unit that provides varied levels of curation that vary in the intensity and complexity of data enhancement that they provide. While the levels of curation are standardized as to effort (level one = less effort, level three = most effort), the specific curatorial actions undertaken for each dataset vary. The specific curation actions are captured in Jira, a work tracking program, which data curators at ICPSR use to collaborate and communicate their progress through tickets. We obtained access to a corpus of 669 completed Jira tickets corresponding to the curation of 566 unique studies between February 2017 and December 2019 28 .

To process the tickets, we focused only on their work log portions, which contained free text descriptions of work that data curators had performed on a deposited study, along with the curators’ identifiers, and timestamps. To protect the confidentiality of the data curators and the processing steps they performed, we collaborated with ICPSR’s curation unit to propose a classification scheme, which we used to train a Naive Bayes classifier and label curation actions in each work log sentence. The eight curation action labels we proposed 28 were: (1) initial review and planning, (2) data transformation, (3) metadata, (4) documentation, (5) quality checks, (6) communication, (7) other, and (8) non-curation work. We note that these categories of curation work are very specific to the curatorial processes and types of data stored at ICPSR, and may not match the curation activities at other repositories. After applying the classifier to the work log sentences, we obtained summary-level curation actions for a subset of all ICPSR studies (5%), along with the total number of hours spent on data curation for each study, and the proportion of time associated with each action during curation.

Data Records

The MICA dataset 27 connects records for each of ICPSR’s archived research studies to the research publications that use them and related curation activities available for a subset of studies (Fig.  2 ). Each of the three tables published in the dataset is available as a study archived at ICPSR. The data tables are distributed as statistical files available for use in SAS, SPSS, Stata, and R as well as delimited and ASCII text files. The dataset is organized around studies and papers as primary entities. The studies table lists ICPSR studies, their metadata attributes, and usage information; the papers table was constructed using the ICPSR Bibliography and Dimensions database; and the curation logs table summarizes the data curation steps performed on a subset of ICPSR studies.

Studies (“ICPSR_STUDIES”): 10,605 social science research datasets available through ICPSR up to 2021-11-16 with variables for ICPSR study number, digital object identifier, study name, series number, series title, authoring entities, full-text description, release date, funding agency, geographic coverage, subject terms, topical archive, curation level, single principal investigator (PI), institutional PI, the total number of PIs, total variables in data files, question text availability, study variable indexing, level of restriction, total unique users downloading study data files and codebooks, total unique users downloading data only, and total unique papers citing data through November 2021. Studies map to the papers and curation logs table through ICPSR study numbers as “STUDY”. However, not every study in this table will have records in the papers and curation logs tables.

Papers (“ICPSR_PAPERS”): 94,755 publications collected from 2000-08-11 to 2021-11-16 in the ICPSR Bibliography and enriched with metadata from the Dimensions database with variables for paper number, identifier, title, authors, publication venue, item type, publication date, input date, ICPSR series numbers used in the paper, ICPSR study numbers used in the paper, the Dimension identifier, and the Dimensions link to the publication’s full text. Papers map to the studies table through ICPSR study numbers in the “STUDY_NUMS” field. Each record represents a single publication, and because a researcher can use multiple datasets when creating a publication, each record may list multiple studies or series.

Curation logs (“ICPSR_CURATION_LOGS”): 649 curation logs for 563 ICPSR studies (although most studies in the subset had one curation log, some studies were associated with multiple logs, with a maximum of 10) curated between February 2017 and December 2019 with variables for study number, action labels assigned to work description sentences using a classifier trained on ICPSR curation logs, hours of work associated with a single log entry, and total hours of work logged for the curation ticket. Curation logs map to the study and paper tables through ICPSR study numbers as “STUDY”. Each record represents a single logged action, and future users may wish to aggregate actions to the study level before joining tables.

figure 2

Entity-relation diagram.

Technical Validation

We report on the reliability of the dataset’s metadata in the following subsections. To support future reuse of the dataset, curation services provided through ICPSR improved data quality by checking for missing values, adding variable labels, and creating a codebook.

All 10,605 studies available through ICPSR have a DOI and a full-text description summarizing what the study is about, the purpose of the study, the main topics covered, and the questions the PIs attempted to answer when they conducted the study. Personal names (i.e., principal investigators) and organizational names (i.e., funding agencies) are standardized against an authority list maintained by ICPSR; geographic names and subject terms are also standardized and hierarchically indexed in the ICPSR Thesaurus 34 . Many of ICPSR’s studies (63%) are in a series and are distributed through the ICPSR General Archive (56%), a non-topical archive that accepts any social or behavioral science data. While study data have been available through ICPSR since 1962, the earliest digital release date recorded for a study was 1984-03-18, when ICPSR’s database was first employed, and the most recent date is 2021-10-28 when the dataset was collected.

Curation level information was recorded starting in 2017 and is available for 1,125 studies (11%); approximately 80% of studies with assigned curation levels received curation services, equally distributed between Levels 1 (least intensive), 2 (moderately intensive), and 3 (most intensive) (Fig.  3 ). Detailed descriptions of ICPSR’s curation levels are available online 35 . Additional metadata are available for a subset of 421 studies (4%), including information about whether the study has a single PI, an institutional PI, the total number of PIs involved, total variables recorded is available for online analysis, has searchable question text, has variables that are indexed for search, contains one or more restricted files, and whether the study is completely restricted. We provided additional metadata for this subset of ICPSR studies because they were released within the past five years and detailed curation and usage information were available for them. Usage statistics including total downloads and data file downloads are available for this subset of studies as well; citation statistics are available for 8,030 studies (76%). Most ICPSR studies have fewer than 500 users, as indicated by total downloads, or citations (Fig.  4 ).

figure 3

ICPSR study curation levels.

figure 4

ICPSR study usage.

A subset of 43,102 publications (45%) available in the ICPSR Bibliography had a DOI. Author metadata were entered as free text, meaning that variations may exist and require additional normalization and pre-processing prior to analysis. While author information is standardized for each publication, individual names may appear in different sort orders (e.g., “Earls, Felton J.” and “Stephen W. Raudenbush”). Most of the items in the ICPSR Bibliography as of 2021-11-16 were journal articles (59%), reports (14%), conference presentations (9%), or theses (8%) (Fig.  5 ). The number of publications collected in the Bibliography has increased each decade since the inception of ICPSR in 1962 (Fig.  6 ). Most ICPSR studies (76%) have one or more citations in a publication.

figure 5

ICPSR Bibliography citation types.

figure 6

ICPSR citations by decade.

Usage Notes

The dataset consists of three tables that can be joined using the “STUDY” key as shown in Fig.  2 . The “ICPSR_PAPERS” table contains one row per paper with one or more cited studies in the “STUDY_NUMS” column. We manipulated and analyzed the tables as CSV files with the Pandas library 36 in Python and the Tidyverse packages 37 in R.

The present MICA dataset can be used independently to study the relationship between curation decisions and data reuse. Evidence of reuse for specific studies is available in several forms: usage information, including downloads and citation counts; and citation contexts within papers that cite data. Analysis may also be performed on the citation network formed between datasets and papers that use them. Finally, curation actions can be associated with properties of studies and usage histories.

This dataset has several limitations of which users should be aware. First, Jira tickets can only be used to represent the intensiveness of curation for activities undertaken since 2017, when ICPSR started using both Curation Levels and Jira. Studies published before 2017 were all curated, but documentation of the extent of that curation was not standardized and therefore could not be included in these analyses. Second, the measure of publications relies upon the authors’ clarity of data citation and the ICPSR Bibliography staff’s ability to discover citations with varying formality and clarity. Thus, there is always a chance that some secondary-data-citing publications have been left out of the bibliography. Finally, there may be some cases in which a paper in the ICSPSR bibliography did not actually obtain data from ICPSR. For example, PIs have often written about or even distributed their data prior to their archival in ICSPR. Therefore, those publications would not have cited ICPSR but they are still collected in the Bibliography as being directly related to the data that were eventually deposited at ICPSR.

In summary, the MICA dataset contains relationships between two main types of entities – papers and studies – which can be mined. The tables in the MICA dataset have supported network analysis (community structure and clique detection) 30 ; natural language processing (NER for dataset reference detection) 32 ; visualizing citation networks (to search for datasets) 38 ; and regression analysis (on curation decisions and data downloads) 29 . The data are currently being used to develop research metrics and recommendation systems for research data. Given that DOIs are provided for ICPSR studies and articles in the ICPSR Bibliography, the MICA dataset can also be used with other bibliometric databases, including DataCite, Crossref, OpenAlex, and related indexes. Subscription-based services, such as Dimensions AI, are also compatible with the MICA dataset. In some cases, these services provide abstracts or full text for papers from which data citation contexts can be extracted for semantic content analysis.

Code availability

The code 27 used to produce the MICA project dataset is available on GitHub at https://github.com/ICPSR/mica-data-descriptor and through Zenodo with the identifier https://doi.org/10.5281/zenodo.8432666 . Data manipulation and pre-processing were performed in Python. Data curation for distribution was performed in SPSS.

He, L. & Han, Z. Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech 35 , 332–342 (2017).

Article   Google Scholar  

Brickley, D., Burgess, M. & Noy, N. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference - WWW ‘19 , 1365–1375 (ACM Press, San Francisco, CA, USA, 2019).

Buneman, P., Dosso, D., Lissandrini, M. & Silvello, G. Data citation and the citation graph. Quantitative Science Studies 2 , 1399–1422 (2022).

Chao, T. C. Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. Proceedings of the American Society for Information Science and Technology 48 , 1–8 (2011).

Article   ADS   Google Scholar  

Parr, C. et al . A discussion of value metrics for data repositories in earth and environmental sciences. Data Science Journal 18 , 58 (2019).

Eschenfelder, K. R., Shankar, K. & Downey, G. The financial maintenance of social science data archives: Four case studies of long–term infrastructure work. J. Assoc. Inf. Sci. Technol. 73 , 1723–1740 (2022).

Palmer, C. L., Weber, N. M. & Cragin, M. H. The analytic potential of scientific data: Understanding re-use value. Proceedings of the American Society for Information Science and Technology 48 , 1–10 (2011).

Zimmerman, A. S. New knowledge from old data: The role of standards in the sharing and reuse of ecological data. Sci. Technol. Human Values 33 , 631–652 (2008).

Cragin, M. H., Palmer, C. L., Carlson, J. R. & Witt, M. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 , 4023–4038 (2010).

Article   ADS   CAS   Google Scholar  

Fear, K. M. Measuring and Anticipating the Impact of Data Reuse . Ph.D. thesis, University of Michigan (2013).

Borgman, C. L., Van de Sompel, H., Scharnhorst, A., van den Berg, H. & Treloar, A. Who uses the digital data archive? An exploratory study of DANS. Proceedings of the Association for Information Science and Technology 52 , 1–4 (2015).

Pasquetto, I. V., Borgman, C. L. & Wofford, M. F. Uses and reuses of scientific data: The data creators’ advantage. Harvard Data Science Review 1 (2019).

Gregory, K., Groth, P., Scharnhorst, A. & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review (2020).

York, J. Seeking equilibrium in data reuse: A study of knowledge satisficing . Ph.D. thesis, University of Michigan (2022).

Kilbride, W. & Norris, S. Collaborating to clarify the cost of curation. New Review of Information Networking 19 , 44–48 (2014).

Robinson-Garcia, N., Mongeon, P., Jeng, W. & Costas, R. DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics 11 , 841–854 (2017).

Qin, J., Hemsley, J. & Bratt, S. E. The structural shift and collaboration capacity in GenBank networks: A longitudinal study. Quantitative Science Studies 3 , 174–193 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Acuna, D. E., Yi, Z., Liang, L. & Zhuang, H. Predicting the usage of scientific datasets based on article, author, institution, and journal bibliometrics. In Smits, M. (ed.) Information for a Better World: Shaping the Global Future. iConference 2022 ., 42–52 (Springer International Publishing, Cham, 2022).

Zeng, T., Wu, L., Bratt, S. & Acuna, D. E. Assigning credit to scientific datasets using article citation networks. Journal of Informetrics 14 , 101013 (2020).

Koesten, L., Vougiouklis, P., Simperl, E. & Groth, P. Dataset reuse: Toward translating principles to practice. Patterns 1 , 100136 (2020).

Du, C., Cohoon, J., Lopez, P. & Howison, J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J. Assoc. Inf. Sci. Technol. 72 , 870–884 (2021).

Aryani, A. et al . A research graph dataset for connecting research data repositories using RD-Switchboard. Sci Data 5 , 180099 (2018).

Färber, M. & Lamprecht, D. The data set knowledge graph: Creating a linked open data source for data sets. Quantitative Science Studies 2 , 1324–1355 (2021).

Perry, A. & Netscher, S. Measuring the time spent on data curation. Journal of Documentation 78 , 282–304 (2022).

Trisovic, A. et al . Advancing computational reproducibility in the Dataverse data repository platform. In Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems , P-RECS ‘20, 15–20, https://doi.org/10.1145/3391800.3398173 (Association for Computing Machinery, New York, NY, USA, 2020).

Borgman, C. L., Scharnhorst, A. & Golshan, M. S. Digital data archives as knowledge infrastructures: Mediating data sharing and reuse. Journal of the Association for Information Science and Technology 70 , 888–904, https://doi.org/10.1002/asi.24172 (2019).

Lafia, S. et al . MICA Data Descriptor. Zenodo https://doi.org/10.5281/zenodo.8432666 (2023).

Lafia, S., Thomer, A., Bleckley, D., Akmon, D. & Hemphill, L. Leveraging machine learning to detect data curation activities. In 2021 IEEE 17th International Conference on eScience (eScience) , 149–158, https://doi.org/10.1109/eScience51609.2021.00025 (2021).

Hemphill, L., Pienta, A., Lafia, S., Akmon, D. & Bleckley, D. How do properties of data, their curation, and their funding relate to reuse? J. Assoc. Inf. Sci. Technol. 73 , 1432–44, https://doi.org/10.1002/asi.24646 (2021).

Lafia, S., Fan, L., Thomer, A. & Hemphill, L. Subdivisions and crossroads: Identifying hidden community structures in a data archive’s citation network. Quantitative Science Studies 3 , 694–714, https://doi.org/10.1162/qss_a_00209 (2022).

ICPSR. ICPSR Bibliography of Data-related Literature: Collection Criteria. https://www.icpsr.umich.edu/web/pages/ICPSR/citations/collection-criteria.html (2023).

Lafia, S., Fan, L. & Hemphill, L. A natural language processing pipeline for detecting informal data references in academic literature. Proc. Assoc. Inf. Sci. Technol. 59 , 169–178, https://doi.org/10.1002/pra2.614 (2022).

Hook, D. W., Porter, S. J. & Herzog, C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics 3 , 23, https://doi.org/10.3389/frma.2018.00023 (2018).

https://www.icpsr.umich.edu/web/ICPSR/thesaurus (2002). ICPSR. ICPSR Thesaurus.

https://www.icpsr.umich.edu/files/datamanagement/icpsr-curation-levels.pdf (2020). ICPSR. ICPSR Curation Levels.

McKinney, W. Data Structures for Statistical Computing in Python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference , 56–61 (2010).

Wickham, H. et al . Welcome to the Tidyverse. Journal of Open Source Software 4 , 1686 (2019).

Fan, L., Lafia, S., Li, L., Yang, F. & Hemphill, L. DataChat: Prototyping a conversational agent for dataset search and visualization. Proc. Assoc. Inf. Sci. Technol. 60 , 586–591 (2023).

Download references

Acknowledgements

We thank the ICPSR Bibliography staff, the ICPSR Data Curation Unit, and the ICPSR Data Stewardship Committee for their support of this research. This material is based upon work supported by the National Science Foundation under grant 1930645. This project was made possible in part by the Institute of Museum and Library Services LG-37-19-0134-19.

Author information

Authors and affiliations.

Inter-university Consortium for Political and Social Research, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill, Sara Lafia, David Bleckley & Elizabeth Moss

School of Information, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill & Lizhou Fan

School of Information, University of Arizona, Tucson, AZ, 85721, USA

Andrea Thomer

You can also search for this author in PubMed   Google Scholar

Contributions

L.H. and A.T. conceptualized the study design, D.B., E.M., and S.L. prepared the data, S.L., L.F., and L.H. analyzed the data, and D.B. validated the data. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Libby Hemphill .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hemphill, L., Thomer, A., Lafia, S. et al. A dataset for measuring the impact of research data and their curation. Sci Data 11 , 442 (2024). https://doi.org/10.1038/s41597-024-03303-2

Download citation

Received : 16 November 2023

Accepted : 24 April 2024

Published : 03 May 2024

DOI : https://doi.org/10.1038/s41597-024-03303-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

real world evidence research based on big data

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Health Sciences Policy; Forum on Drug Discovery, Development, and Translation. Real-World Evidence Generation and Evaluation of Therapeutics: Proceedings of a Workshop. Washington (DC): National Academies Press (US); 2017 Feb 15.

Cover of Real-World Evidence Generation and Evaluation of Therapeutics

Real-World Evidence Generation and Evaluation of Therapeutics: Proceedings of a Workshop.

  • Hardcopy Version at National Academies Press

3 Opportunities for Real-World Data

Key messages identified by individual speakers.

  • Electronic health records and databases containing other health-related data (claims, pharmacy) can support observational studies and pragmatic clinical trials, both of which can be important sources of real-world evidence. (Dore, Rothman)
  • Integrating data from different sources creates a richer, more robust dataset than any one source alone can yield. However, combining data from different sources is currently a labor-intensive process due to challenges with data standardization and interoperability. (Dore, Rothman)
  • Patients and consumers have a significant role to play in the collection of real-world data and generation of real-world evidence, but to be effective, patient and consumer engagement approaches would include considering them partners and capturing outcomes that are important to them. (Foschini, Robinson Beale, Rothman, White)
  • Big data bring a number of challenges (high volume, high velocity, high variability). Greater investment in data science could support the health industry in realizing the potential of big data for health care and clinical research purposes. (Berger, Roddam, Shah, White)

Following the identification of stakeholder needs in the first workshop session, the second session was focused on answering a framing question: What can we learn from real-world data? The growing availability of rich clinical data provides opportunities to address a broad range of real-world questions on effectiveness and value. However, as noted by Simon, concerns regarding the quality of clinical data have impeded efforts to incorporate real-world data into the traditional clinical research paradigm. Panelists in this session discussed opportunities to leverage the “data exhaust” from clinical practice (e.g., data captured in electronic health records [EHRs] and claims and pharmacy databases during the course of clinical care) and mechanisms to overcome the challenges that arise when applying those data for the secondary purpose of research. The panelists also discussed the potential of data streams originating from mobile devices and other digital health technologies that capture data outside the clinical setting. Setting the tone for the session's discussions, Marc Berger, vice president, Real-World Data and Analytics, Pfizer Inc., stressed that it is “not a question about [whether] [these] real-world data [are] good enough. It's about how . . . we move to a learning health care system and use the data . . . for an appropriate purpose that drives us to where we want to get.” He reminded the audience that “real-world evidence is good evidence and people are using it every day to make decisions.”

  • LEVERAGING ELECTRONIC HEALTH RECORDS

The promise of a learning health system is dependent on the ability to digitally capture, aggregate, and analyze health data for research and quality improvement purposes. Over the past 15 years, significant progress has been made toward the vision set out in the 2001 Institute of Medicine report Crossing the Quality Chasm ( IOM, 2001 ), which underscored the importance of a robust health information technology infrastructure, observed Jon White, deputy national coordinator for health information technology, Office of the National Coordinator for Health Information Technology (ONC). In 2015, 96 percent of hospitals and 78 percent of office-based physicians used certified EHR technology. Califf emphasized that it is important to take advantage of this infrastructure to move the evidence generation system to a much more efficient model and to answer questions that are critical for people to make the right decisions about their health and health care.

Several workshop participants discussed barriers that arise when using EHRs for research. Andrew Roddam, vice president and head of Real-World Evidence, GlaxoSmithKline, noted that EHRs might not contain all of the data that researchers want, so it is important to consider whether the EHR can be expanded to become the repository of all desired information or, instead, to use what is there and then collect the missing information using simple data collection tools.

Other challenges noted by individual workshop participants included the following:

  • missing data
  • need for computable phenotypes
  • lack of standardization (e.g., data schemes and data transfer protocols)
  • interoperability issues with proprietary health information systems

Addressing the limitations of EHRs, Califf asked, “How much energy do you spend on the upfront regimentation of data collection versus curating data on the back end?” Several workshop participants noted a need for balance. Good evidence can come from back-end curation, although it may not be perfect, replied Sherman. This can help demonstrate the value of those data for other purposes, which can help drive improved data quality and collection for secondary use. Vallance suggested that some effort to improve quality of data entry on the front end is needed to improve back-end curation, citing as an example a study that found 120 different definitions of myocardial infarction. Califf observed that changing reimbursement practices may incentivize entry of more accurate data by providers, who will increasingly require such data to demonstrate the quality and value of care they are delivering.

The promise of EHRs inspires excitement, but also frustration, about the technology's unfulfilled potential. White noted that providers report to ONC not that they want to return to paper-based records systems, but that EHR systems need to work better for them. ONC is actively working on many of the barriers that are frequently noted, he said, including lack of standards and interoperability issues. Certified EHR technology is now required for participation in the Medicare incentive program and the newly released quality payment program, and in October 2015, ONC released the final version of its interoperability roadmap, Connecting Health and Care for the Nation: A Shared Nationwide Interoperability Roadmap Version 1.0 ( ONC, 2015 ). The private sector is also advancing opportunities to leverage EHR data for quality improvement and research, said Berger, citing as an example Pfizer's use of natural language processing to create very rich datasets by mining the wealth of EHR data residing in free text notes.

  • THE POWER OF LINKING AND MINING DISPARATE DATA SOURCES

The linking of multiple datasets provides a richness of data that cannot be achieved with any single data source. Combining EHR data with claims and pharmacy data, for example, captures a more complete picture of the continuity of care for a patient and a record of that person's interaction with the health care system, said David Dore, vice president, Epidemiology, and principal epidemiologist, Optum Life Sciences. Linking in data from other sources also may help to address data-quality issues by filling in missing data and validating data through checks for consistency across data sources. However, Califf pointed out, inconsistencies are not always an indicator of bad data. Inconsistencies may be real, reflecting different perceptions of different providers or variability in lab testing results, and may only be detectable by comparing across datasets. For example, it is possible to identify a patient population prescribed a particular drug using EHR data while claims data show that a certain percentage of those patients never filled the prescription. This has significant implications for any safety or effectiveness analyses conducted on those data and is a question that can only be answered with linked EHR and claims datasets, said Berger. The datasets that need to be linked depend on the question that must be answered, added Dore. One data source may be better at capturing certain data, but may miss others. This is why understanding the inherent biases of different datasets is important, cautioned Luca Foschini, co-founder and chief data scientist, Evidation Health. For example, claims datasets often tend to be more complete because payment serves as the incentive to enter data, but claims data have their own biases—for example, more expensive things are more likely to be captured there.

Patient-level linking of datasets remains a challenge when there are no unique patient identifiers. Although this is an area of significant interest and some work has been done in the private sector, federal efforts to implement unique patient identifiers are currently prohibited by law, 1 explained White. Other efforts to facilitate data linking and aggregation include the use of claims data to link patient records across EHR systems and the development of common data models, which map concepts from different data sources into a common format with common definitions. As discussed by two panelists in this workshop session, these methods have enabled the development of large linked datasets to support both public-sector research and private-sector analyses.

PCORnet's Clinical Data Research Networks

The Patient-Centered Outcomes Research Institute (PCORI) supports health-related decision making by patients, providers, payers, and policy makers by generating and examining evidence on the effectiveness of various medical treatments. Russell Rothman, director, Center for Health Services Research, Vanderbilt University, described how the National Patient-Centered Clinical Research Network (PCORnet), funded by PCORI, is advancing real-world evidence research by leveraging existing electronic health data sources to support national comparative effectiveness studies and pragmatic clinical trials. In addition to its 20 patient-powered research networks, PCORnet consists of 13 clinical data research networks (CDRNs) representing more than 100 health care systems and organizations across the country. PCORnet currently has EHR data from more than 110 million patients, and CDRNs are also working to link EHR data to data from other sources, including claims, vital statistics, registries, state health data, Medicare and Medicaid, and private health plans, in an effort to capture a more complete picture of patients for research purposes.

Because it incorporates standardized data from different sources using a common data model, the PCORnet infrastructure can now be used to identify potentially thousands of patients across the networks with particular conditions, to conduct observational studies that follow patient cohorts over time, and for interventional clinical research, including comparative effectiveness trials. Rothman also described tools that have been developed for PCORnet to support clinical trials, including electronic processes for patient identification and recruitment, consenting, and collecting patient-reported outcomes. These tools, along with some administrative simplification, have enabled the conduct of large pragmatic trials with great efficiency, he said. Rothman cited the ADAPTABLE trial on optimal aspirin dosing for patients with coronary heart disease as an example of the potential of the PCORnet infrastructure for conducting faster, cheaper, and more informative clinical research in the real-world space. In this pragmatic trial, which is still ongoing, patients were identified, recruited, and consented electronically and randomized to baby or regular strength aspirin. Data for follow-up were captured from EHRs and claims, and from patients directly using electronic survey tools. “The front door for PCORnet is now open,” said Rothman, for investigators interested in running queries or using the network for observational or interventional research.

Development and Use of Centralized Data Repositories in the Private Sector

In the private sector, efforts to aggregate and analyze data from EHRs, claims, and other sources are driven, in part, by demands from provider networks as they try to control financial risks for managing patient populations. Dore outlined how Optum, part of UnitedHealth Group, compiles data from electronic records (including medical, claims, and pharmacy records) for provider networks into a centralized repository. Data are linked using encrypted data linkage methods so that patient-identifying information is not shared across parties. To address interoperability issues across different record systems within provider networks, Optum uses an intensive manual process to extract information; validates, maps, and normalizes it; and iterates it to get to a standardized data format. Following a series of data quality checks at the end of the process, the company has generated a centralized repository containing data for those patients within a particular provider network. That repository can then be used for a range of analytics, including predictive modeling, quality benchmarking, and risk stratification (e.g., identifying patients who have high risk of rehospitalization). This process can be scaled up so that data from many provider networks are aggregated under a single ontology, capturing more than 70 million patients in a single centralized dataset (see Figure 3-1 ). Dore said Optum is in the process of onboarding other data, including those from clinical trials, registries, and wearables. He emphasized that, beyond supporting clinical decision making for provider networks, these data repositories also have value for clinical research and have been used for observational studies evaluating comparative effectiveness.

Optum's process for aggregating data from multiple provider networks into a centralized data repository. NOTE: EMR = electronic medical record; NLP = natural language processing. SOURCE: Dore presentation, 2016.

  • COLLECTING REAL-WORLD DATA OUTSIDE THE CLINICAL SETTING USING DIGITAL HEALTH TOOLS

Speaking on the opportunities to engage and collect real-world data directly from patients and consumers outside of the clinical setting, Foschini described the tremendous recent growth of digital health technology in the consumer space. Not only has there been a proliferation of devices on the market, but the measurement capability of these devices is also expanding. Collectively, he estimated, wearables and other consumer devices can now measure physiological parameters at a level that is approaching what might be seen in a hospital intensive care unit.

Because many mobile health devices are commonly worn throughout the day and sometimes even during sleep, excitement regarding their potential stems from the ability to capture data from the 99 percent of patient and consumer activity that occurs outside the health care setting. This allows researchers to track the progression of an individual over time at a much finer level of resolution than ever before. Although these devices can be used to compare pre- and postevent or intervention data at the individual level, it is also possible to develop population-level outcome measures. Foschini cited as an example the measurement of recovery of mobility following surgery. Using data from a mobile health device, it is possible to calculate a mobility index and compare postsurgery levels to baseline to determine the time to recovery of full mobility following surgery. With population-level data, an outcome of interest may be the time it takes for an individual who received the surgery to return to 90 percent of his or her presurgery mobility level; in addition, the impact of variables such as age on the outcome measure can be examined to identify individuals at higher risk of not regaining full mobility.

In the context of clinical trials, the broad, consumer-driven distribution of digital health devices across large and diverse populations has important implications for trial design. For example, said Foschini, these devices can enable virtual study recruitment, which has the potential to increase the efficiency of clinical trials and reach subpopulations that might not be reached through traditional recruitment practices. When considering their use for data collection, however, Foschini emphasized that investigators should remember that these devices will be used in unsupervised settings and may therefore necessitate a more user-oriented approach than is typical in traditional trial design.

Although there is a great deal of interest in the emerging potential of digital health tools, as demonstrated by an exponential increase in the number of publications featuring analyses of data collected using these devices, a number of workshop participants raised questions about the reliability of data collected using these tools, both in terms of their accuracy and their ability to engage consumers over the long term. A lot of scientific work is needed to validate results from wearables and define wearable-oriented endpoints that will support regulatory approval, cautioned John Hernandez, head of Health Economics, Value, and Access, Verily Life Sciences.

  • CONSIDERATIONS FOR REALIZING THE POTENTIAL OF REAL-WORLD DATA

The ability to use real-world data to answer research questions regarding effectiveness and value is contingent on access to the full spectrum of health data and capability to transform the data into evidence using analytic tools. In discussions on realizing the potential of real-world data, two key themes emerged: partnering with patients and consumers, and investing in data science capabilities.

Partnering with Patients and the Public

A number of levers can be applied to realize the potential of real-world evidence, including certified EHR technology and regulations, but the fulcrum, said White, is patients and consumers, and specifically, their data and information. Several individual workshop participants commented that the research enterprise needs to do a better job of engaging those individuals as partners. Patients can be a source of important data not routinely collected for purposes of care—socioeconomic, cultural, and educational background factors—that significantly affect treatment outcomes. They can also help to link their own longitudinal care data (e.g., data from surgery and rehabilitation services), said Frankel, who suggested that proactively engaging patients and consumers to obtain such data needs to be part of a data strategy for any research study.

Several examples of patient engagement mechanisms were provided by workshop participants. Rothman described efforts at his institution to make it easy for patients to share their data and participate in research by offering research portals within patient portals. These portals can be used to upload information that could be used for research purposes or to enable patients to sign up to participate in research studies. Robinson Beale highlighted the success of PatientsLikeMe, a patient-powered effort to make data available for the purposes of finding similar patients and comparing outcomes of different treatments. More broadly, though, said Nigam Shah, associate professor of medicine, Stanford University, a culture of data sharing needs to be promoted to advance the public's understanding that to benefit from a learning health system, patients need to contribute their data.

Investing in Data Science

Data are increasingly becoming an asset for health care providers, with incentives for leveraging “big data” coming from CMS and pay-for-performance opportunities. These drivers are also generating opportunities to apply big data to clinical research, but expertise is a key component to support the necessary aggregation and curation of data and analytics. Several individual workshop participants discussed the creation of a culture of data science within organizations and the importance of investment in data science experts to transform health care data into meaningful information. The health care industry is lagging behind others already adept at working with big data, like many of the dominant American corporations such as Amazon and Walmart, said Califf, who added that efforts are needed to recruit that talent into the health care industry.

See Sec. 510, Consolidated Appropriations Act, 2016. Public Law 113, 114th Cong. (December 18, 2015): “None of the funds made available in this Act may be used to promulgate or adopt any final standard under section 1173(b) of the Social Security Act providing for, or providing for the assignment of, a unique health identifier for an individual (except in an individual's capacity as an employer or a health care provider), until legislation is enacted specifically approving the standard.”

  • Cite this Page National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Health Sciences Policy; Forum on Drug Discovery, Development, and Translation. Real-World Evidence Generation and Evaluation of Therapeutics: Proceedings of a Workshop. Washington (DC): National Academies Press (US); 2017 Feb 15. 3, Opportunities for Real-World Data.
  • PDF version of this title (2.6M)

In this Page

Other titles in this collection.

  • The National Academies Collection: Reports funded by National Institutes of Health

Recent Activity

  • Opportunities for Real-World Data - Real-World Evidence Generation and Evaluatio... Opportunities for Real-World Data - Real-World Evidence Generation and Evaluation of Therapeutics

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

IMAGES

  1. What is Real World Evidence and why does it matter?

    real world evidence research based on big data

  2. 9 Ways Real-World Evidence is Changing Healthcare // ArborMetrix

    real world evidence research based on big data

  3. Real-World Evidence

    real world evidence research based on big data

  4. (PDF) Real-world evidence research based on big data: Motivation

    real world evidence research based on big data

  5. Using real-world evidence to supplement randomized controlled trials

    real world evidence research based on big data

  6. Framework for developing real world evidence. RWD: Real world data

    real world evidence research based on big data

VIDEO

  1. How does data in clinical trials differ from real-world evidence?

  2. Vint Cerf: Big Data and Social Media 🗃 CERN

  3. Duality Demo Real World Evidence RWE

  4. Why Real-World Evidence Matters

  5. What is real-world evidence?

  6. How Big Do The Data Have To Be?

COMMENTS

  1. Real-world evidence research based on big data: Motivation ...

    Based on the discussed prerequisites, the analysis of comprehensive and complex real-world data in the context of a RWE study network represents an important and promising complementary partner to RCTs. This enables research into the general quality of cancer care and can permit comparative effectiv …

  2. Real-world evidence research based on big data

    Conclusion. Based on the discussed prerequisites, the analysis of comprehensive and complex real-world data in the context of a RWE study network represents an important and promising complementary partner to RCTs. This enables research into the general quality of cancer care and can permit comparative effectiveness studies across partner centers.

  3. Real-world evidence research based on big data

    Background. The U.S. Food and Drug Administration (FDA) defines RWE as "the clinical evidence regarding the usage, and potential benefits or risks, of a medical product derived from analysis of real-world data" [ 34 ]. The British Academy of Medical Sciences employs a similar definition: "the evidence generated from clinically relevant ...

  4. Real-World Evidence—Current Developments and Perspectives

    Real-world evidence (RWE) is increasingly involved in the early benefit assessment of medicinal drugs. It is expected that RWE will help to speed up approval processes comparable to RWE developments in vaccine research during the COVID-19 pandemic. Definitions of RWE are diverse, marking the highly fluid status in this field.

  5. (PDF) Real-world evidence research based on big data: Motivation

    The strengths of our study include the large sample size and the real-world nature of the data. Owing to some weaknesses of internal validity, study design, incomplete or incorrect data entry, or ...

  6. Real-world evidence research based on big data

    The analysis of comprehensive and complex real-world data in the context of a RWE study network represents an important and promising complementary partner to RCTs and enables research into the general quality of cancer care and can permit comparative effectiveness studies across partner centers. BackgroundIn recent years there has been an increasing, partially also critical interest in ...

  7. Introduction to real-world evidence studies

    Real-world evidence (RWE) is the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD.". [ 1] RWD can be generated from: Electronic health records (EHRs) Medical claims, billing data, and insurance data. Data from product and disease registries.

  8. Big data and real-world data-based medicine in the management of

    Big data has been a hot topic in medical and healthcare research. Big data in healthcare is considered to comprise massive amounts of information from various sources, including electronic health ...

  9. Real-World Evidence: A Primer

    Over the decades since RWD were first recognized, innovation has evolved to take real-world research beyond finding ways to identify, store, and analyze large volumes of data. The research community has developed strong methods to address challenges of using RWD and as a result has increased the acceptance of RWD in research, practice, and policy.

  10. (PDF) Real-World Evidence Research Based on Big Data

    Real-World Evidence Research Based on Big Data by Benedikt E. Maissenhaelter, Ashley L. Woolmore, Peter M. Schlag published in Onkologe ... Real-World Evidence Research Based on Big Data Onkologe - Germany doi 10.1007/s00761-018-0358-3. Full Text Open PDF Abstract. Available in full text. Categories Oncology Hematology. Date. June 7, 2018 ...

  11. A national evaluation analysis and expert interview study of real-world

    Real-world data (RWD) can provide intel (real-world evidence, RWE) for research and development, as well as policy and regulatory decision-making along the full spectrum of health care. Despite ...

  12. Big data and real-world data-based medicine in the management of

    Patient data routinely collected via electronic means are called real-world data (RWD) and are becoming common in healthcare research. RWD and big data are not synonymous with each other, but the two terms seem to be used without distinction with respect to observational studies. In this article, we review hypertension-related papers that use ...

  13. Real-world data: a brief review of the methods, applications

    Per the definition by the US FDA, real-world data (RWD) in the medical and healthcare field "are the data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources"[].The wide usage of the internet, social media, wearable devices and mobile devices, claims and billing activities, (disease) registries, electronic health records (EHRs ...

  14. Real-World Evidence

    Real-world data (RWD) and real-world evidence (RWE) played an increasing role in health care decisions. The 21st Century Cures Act, passed in 2016, placed additional focus on the use of these ...

  15. ESMO Guidance for Reporting Oncology real-World evidence (GROW

    The use of real-world data (RWD) for generating real-world evidence (RWE) to complement interventional clinical trial-based research is rapidly increasing. This evolving field is particularly prevalent in oncology with a growing number of publications and increased use of RWD in medicine regulation in recent years.1-4 Improving the quality of RWE is crucial for patients, the scientific ...

  16. Strategies to Turn Real-world Data Into Real-world Knowledge

    Real-world data (RWD), defined as "data regarding the usage, or the potential benefits or risks, of a drug derived from sources other than randomized clinical trials," 1 have emerged as an important source of clinical information since the 21st Century Cures Act was signed in 2016. Although randomized clinical trials (RCTs) remain the highest standard of clinical evidence, RWD have offered ...

  17. PDF Big data and real-world data-based medicine in the management of

    There is an increasing number of papers on hypertension studies using EHRs (or EMRs, electronic patient charts, etc.). Another source of healthcare big data is national registries. Mihoko Okada ...

  18. Trial designs using real‐world data: The changing landscape of the

    Real-world data (RWD) forms the basis for real-world evidence (RWE) and can be extracted from a broad range of sources such as patient registries, health care databases, claims databases, patient networks, social media, and patient-generated data from wearables. 6-9 The definitions of RWD and RWE are relatively consistent between key regulatory ...

  19. The quality of research with real-world evidence

    The RWE is the compilation of all the evidence routinely collected on patients from clinical systems into an understandable and homogeneously analysable data set (big data) facilitated by the technology, which reflects the reality of treatment in the best possible and comparable way 4. RWE is derived from the RWD analysis.

  20. Developing real‐world evidence from real‐world data: Transforming raw

    1 REAL-WORLD DATA. EHRs are a valuable resource for researchers that can be analyzed with variety of methods, from multivariate regression to machine learning, and may be used to support both cross-sectional and longitudinal studies. 1-3 But regardless of study design and method, all EHR-based research requires recognition of data quality issues 4 as well as data curation and cleaning. 5 Raw ...

  21. Learning together for better health using an evidence-based Learning

    Internationally, health systems are facing a crisis, driven by an ageing population, increasing complexity, multi-morbidity, rapidly advancing health technology and rising costs that threaten sustainability and mandate transformation and improvement [1, 2].Although research has generated solutions to healthcare challenges, and the advent of big data and digital health holds great promise ...

  22. Considerations for the Use of Real-World Data and Real-World Evidence

    Contains Nonbinding Recommendations 3 received or not received during routine medical practice, and subsequent biomedical or health outcomes are identified and (2) case-control studies, in which ...

  23. Use of real-world evidence in regulatory decision making

    These are some of the findings of a report published today on the Real-world evidence framework to support EU regulatory decision-making: Report on the experience gained with regulator-led studies from September 2021 to February 2023 in the past year and a half. The report is part of the Agency's efforts, alongside the European Medicines ...

  24. Practical implications of using real-world evidence (RWE) in

    In light of increasing attention towards the use of real-world evidence (RWE) in decision making in recent years, this commentary aims to reflect on the experiences gained in accessing and using RWE for comparative effectiveness research as a part of the Innovative Medicines Initiative GetReal Consortium and discuss their implications for RWE use in decision-making.

  25. Real-World Evidence: A Primer

    Introduction. Data collection is a routine procedure during drug development. However, structured data collection from the real-world usage of the drug after marketing approval is largely restricted to regulatory safety data collection in the form of pharmacovigilance [1, 2].Of late, it has been realised that the collection of efficacy data (in addition to safety data) in the real-world ...

  26. Real World Data are Not Always Big Data: The Case for Primary Data

    Many examples of the use of real-world data in the area of pharmacoepidemiology include "big data" such as insurance claims, medical records, or hospital discharge databases. However, "big" is not always better, particularly when studying outcomes with narrow windows of etiologic relevance.

  27. Real-world evidence articles published in April 2024

    Our monthly feature curating select articles across the spectrum of real-world evidence (RWE) research published in April 2024. Use of quantitative bias analysis to evaluate single-arm trials with real-world data external controls The integration of external controls derived from real-world data (RWD) into single-arm trials is increasingly being used to augment traditional trial designs.

  28. A dataset for measuring the impact of research data and their ...

    Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset ...

  29. Opportunities for Real-World Data

    The ability to use real-world data to answer research questions regarding effectiveness and value is contingent on access to the full spectrum of health data and capability to transform the data into evidence using analytic tools. In discussions on realizing the potential of real-world data, two key themes emerged: partnering with patients and ...