• Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Research Evaluation
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, what is already known about this topic, what this study adds, acknowledgements.

  • < Previous

Development and validation of a questionnaire to measure research impact

  • Article contents
  • Figures & tables
  • Supplementary Data

Maite Solans-Domènech, Joan MV Pons, Paula Adam, Josep Grau, Marta Aymerich, Development and validation of a questionnaire to measure research impact, Research Evaluation , Volume 28, Issue 3, July 2019, Pages 253–262, https://doi.org/10.1093/reseval/rvz007

  • Permissions Icon Permissions

Although questionnaires are widely used in research impact assessment, their metric properties are not well known. Our aim is to test the internal consistency and content validity of an instrument designed to measure the perceived impacts of a wide range of research projects. To do so, we designed a questionnaire to be completed by principal investigators in a variety of disciplines (arts and humanities, social sciences, health sciences, and information and communication technologies). The impacts perceived and their associated characteristics were also assessed. This easy-to-use questionnaire demonstrated good internal consistency and acceptable content validity. However, its metric properties were more powerful in areas such as knowledge production, capacity building and informing policy and practice, in which the researchers had a degree of control and influence. In general, the research projects represented an stimulus for the production of knowledge and the development of research skills. Behavioural aspects such as engagement with potential users or mission-oriented projects (targeted to practical applications) were associated with higher social benefits. Considering the difficulties in assessing a wide array of research topics, and potential differences in the understanding of the concept of ‘research impact’, an analysis of the context can help to focus on research needs. Analyzing the metric properties of questionnaires can open up new possibilities for validating instruments used to measure research impact. Further to the methodological utility of the current exercise, we see a practical applicability to specific contexts where multiple discipline research impact is requires.

Over the past three decades, increasing attention has been paid to the social role and impact of research carried out at universities. National research evaluation systems, such as the UK’s Research Excellence Framework (REF) ( Higher Education Funding Council of England et al. 2015 ) and the Excellence in Research for Australia ( Australian Research Council 2016 ) are examples of assessment tools that address these concerns. These systems identify and define how research funding is allocated based on a number of dimensions of the research process, including impact of research. ( Berlemann and Haucap 2015 ).

Being explicit about the objective of the impact assessment is emphasized in the International School on Research Impact Assessment (ISRIA) statement ( Adam et al. 2018 ) a 10-point guideline for an effective research impact assessment that includes four purposes: advocacy, analysis, allocation, and accountability. The last one emphasizes transparency, efficiency, value to the public and a return for the investment. With mounting concern about the relevance of research outcomes, funding organizations are increasingly expecting researchers to demonstrate that investments result in tangible improvements for society ( Hanney et al. 2004 ). This accountability is intended to ensure resources have been appropriately utilized and is strongly linked to the drive for value‐for‐money within health services and research ( Panel on the return on investments in health research 2009 ). As policy-makers and society expect science to meet societal needs, scientists have to prioritize social impact, or risk losing public support ( Poppy 2015 ).

To meet these expectations, the Universitat Oberta de Catalunya (UOC) has embraced a number of pioneering initiatives in its current Strategic Plan, which includes the promotion of Open Knowledge, a specific measure related to the social impact of research, ( Universitat Oberta de Catalunya 2017 ) and the development of an institution wide action plan to incorporate it in research evaluation. The UOC is currently investigating how to implement the principals of the DORA Declaration in institutional evaluation processes, taking into account ‘a broad range of impact measures including qualitative indicators of research impact, such as influence on policy and practice’ ( ‘San Francisco Declaration on Research Assessment (DORA)’ n.d. ). The UOC is also taking the lead in meeting the Sustainable Development Goals (SDG) of the UN 2030 Agenda,( Jørgensen and Claeys-Kulik 2018 ) having been selected by the International Association of Universities as one of the 16 university cluster leaders around the world to lead the SDGs ( ‘IAU HESD Cluster | HESD - Higher Education for Sustainable Development portal’ n.d. ).

The term ‘research impact’ has many definitions. On a basic level, the ‘academic impact’ is understood as benefits for further research, while ‘wider and societal impact’ includes the outcomes that reach beyond academia. In our study we will include both categories and refer to ‘research impact’ as any type of output or outcome of research activities that can be considered a ‘positive return or payback’ for a wide range of beneficiaries, including people, organizations, communities, regions, or other entities. The pathways linking science, practice, and outcomes are multifaceted and complex ( Molas-Gallart et al. 2016 ). Indeed, the path from new knowledge to its practical application is neither linear nor simple; the stages may vary considerably in terms of duration, and many impacts of research may not be easily measurable or attributable to a concrete result of research ( Figure 1 ). This outputs and outcomes generated by research characteristics (inputs and processes) are context dependant ( Pawson 2013 ). Therefore, a focus on process is fundamental to understanding the generation of impact.

Effects of research impact.

Effects of research impact.

Surveys are among the most widely used tools in research impact evaluation. Quantitative approaches as surveys are suggested for accountability purposes, as the most appropriate way that calls for transparency ( Guthrie et al. 2013 ). They provide a broad overview of the status of a body of research and supply comparable, easy-to-analyze data referring to a range of researchers and/or grants. Standardization of the approach enhances this comparability and minimizes researcher bias and subjectivity, particularly in the case of web or postal surveys. Careful wording and question construction increases the reliability of resulting data ( Guthrie et al. 2013 ). However, while ex-ante assessments instruments for research proposals have undergone significant study, ( Fogelholm et al. 2012 ; Van den Broucke et al. 2012 ) the metric properties of research evaluation instruments have received little attention ( Aymerich et al. 2012 ). ‘Internal consistency’ is generally considered evidence of internal structure, ( Clark and Watson 1995 ) while the measurement of ‘content validity’ attempts to demonstrate that the elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose ( Nunnally and Bernstein 1994 ).

As the demand for monitoring research impact increases across the world, so does the need for research impact measures that demonstrate validity. Therefore, the aim of this study is to develop and test the internal consistency and the content validity of an instrument designed for accountability purposes to measure the perceived impacts of a wide range of competitively funded research projects, according to the perspectives of the principal investigators (PIs). The study will also focus on the perceived impacts and their characteristics.

A cross-sectional survey was used to assess the research undertaken at UOC. This research originates from four knowledge areas: arts and humanities, social sciences, health sciences, and information and communication technologies (ICT). Research topics include ‘identity, culture, art and society’; ‘technology and social action’; ‘globalization, legal pluralism and human rights’; ‘taxation, labour relations and social benefits’; ‘internet, digital technologies and media’; ‘management, systems and services in information and communications’; and ‘eHealth’. UOC’s Ethics Committee approved this study.

Study population

The study population included all PIs with at least one competitively funded project (either public or private) at local, regional, national, or international level completed by 2017 (n = 159).

The questionnaire

An on-line questionnaire was designed for completion by project PIs in order to retrospectively determine the impacts directly attributed to the projects. The questions were prepared based on the team’s prior experience and questionnaires published in scientific literature. ( Wooding et al. 2010 ; Hanney et al. 2013 ) The questionnaire was structured around the multidimensional categorization of impacts in the Payback Framework. ( Hanney et al. 2017 )

The Payback Framework has been extensively tested and used to analyze the impact of research in various disciplines. It has three elements: first, a logic model which identifies the multiple elements that form part of the research process and contribute to achieving impact; second, two ‘interfaces’, one referring to the project specification and selection, the other referring to the dissemination of research results; and third, a consideration of five impact categories: knowledge production (represented by scientific publications or dissemination to non-scientific audiences); research capacity building (research training, new collaborations, the securing of additional funding or improvement of infrastructures); informing policy and product development (research used to inform policymaking in a wide range of circumstances); social benefits (application of the research within the discipline and topic sector); and broader economic benefits (commercial exploitation or employment) ( Hanney et al. 2013 ).

Our instrument included four sections. The first section recorded information on the PIs, including their sex, age, and the number of years they had been involved in research. The second focused on the nature of the project itself (or a body of work based on continuation/research progression projects). PIs involved in more than one project (or a set of projects within the same body of work) were instructed to select one, in order to reduce the time needed to complete the survey and thereby to increase response rate. This section included the discipline, the main topic of research, the original research drivers, interaction with potential users of the research during the research processes, and funding bodies. The third section addressed the PIs’ perceptions of the impact of the research project, and was structured around the five impact categories of the aforementioned Payback Framework. The last section included general questions, one of which sought to capture other relevant impacts that might not fall within one of the previous five categories. The final question requested an evaluation (as a percentage) of the contribution/attribution of the research to the five impact categories. Respondents were required to decide the level of the contribution/attribution of the impacts according to three answer categories: limited, contribution from 1 to 30%; moderate, contribution from 40 to 60%; and significant, contribution from 70 to 100%).

Questionnaire items included questions with dichotomous answers (yes/no) and additional open box questions for a brief descriptions of the impacts perceived.

Prior to testing, we reviewed the abstracts of 72 REF2014 impact case studies (two per knowledge area). REF2014 ( Higher Education Funding Council of England et al. 2015 ) is the first country-wide exercise to assess the impact of university research beyond academia and has a publicly available database of over 6,000 impact case studies, grouped in 34 subject-based units of assessment. Case studies were randomly selected and the impacts found in each mapped onto the most appropriate items and dimensions of the questionnaire. This review helped to reformulate and add questions, especially in the sections on informing policy and practice and social benefits .

Data collection

The questionnaire was sent to experts in various disciplines with a request for feedback on the relevance of each item to the questionnaire’s aim (impact assessment), which they rated on a 4-point scale (0 = ‘not relevant’, 1 = ‘slightly relevant’, 2 = ‘quite relevant’, 3 = ‘very relevant’) according to the definition of research impact included in our study (defined above). The experts were also asked to evaluate whether the items covered the important aspects or whether certain components were missing. They could also add comments on any item.

The PIs were contacted by email. They were informed of the objectives of the study and assured that the data would be treated confidentially. They received two reminders, also by email.

A quality control exercise was performed prior to data analysis. The data were processed and the correct classification of the various impacts checked by comparing the yes/no responses with the information provided in the additional open box questions. No alterations were required after these comparisons. Questionnaire results provided a measure of the number of research projects contributing to a particular type of impact; therefore, to estimate each level of impact we calculated the frequency of its occurrence in relation to the number of projects. A Chi-squared test was used to test for group differences.

Internal consistency was assessed by focusing on the inter-item correlations within the questionnaire, indicating how well the items fitted together theoretically. This was performed using Cronbach’s alpha ( α ). An alpha between 0.70 and 0.95 was considered acceptable ( Peterson 1994 ).

An expert opinion index was used to estimate content validity at the item level. This index was calculated by dividing the number of experts providing a score of 2 or 3 by the total number of answers. Due to the diverse array of disciplines and topics under examination, values were calculated for all experts and for the experts of each discipline. These were considered acceptable if the level of endorsement was >0.5.

All data were introduced into the statistical programme SPSS18, and the level of significance set at 0.05 for all tests.

Sixty-eight PIs answered the questionnaire, a response rate of 42.8%. Respondents took an average of 26 minutes to complete the questionnaire. Table 1 shows the sample characteristics. Significant differences were found between the respondents and non-respondents for knowledge area (p = 0.014) and age group (p = 0.047). Arts and humanities investigators and PIs older than 50 years were more frequent among non-respondents. The proportion of women did not differ significantly between respondents and non-respondents (p = 0.083).

Sample characteristics

Answers could include more than one response. PI: principal investigator.

Impact and its characteristics

An impact on knowledge production was observed in 97.1% of the projects, and an impact on capacity building in 95.6%. Lower figures were recorded for informing policy and practice (64.7%), and lower still for economic benefits (33.8%), and for social benefits (32.4%), although results were based on a formal evaluation in only 11, 8% of the cases included in social benefits . Estimations of the contribution of projects to the different impact levels were considered significant (between 70% and 100%) to knowledge production , moderate (between 40% and 60%), to capacity building , and limited (1–30%) to informing policy and practice , social benefits and economic benefits . No additional impacts were reported.

Figure 2 shows the different impact categories and the distribution of impact subcategories. The size of the bars indicates the percentage of projects in which this specific impact occurred, according to the PIs.

Achieved impact bars, according to level (n = 68).

Achieved impact bars, according to level (n = 68).

Statistically significant differences were found according to the original impetus for the project: for projects intended to fill certain gaps in knowledge, the greatest impact was observed in knowledge production (p = 0.01) and capacity building (p = 0.03), while for projects targeting to a practical application, the greatest impact was observed in informing policy and practice (p = 0.05) and in social benefits (p = 0.01). In general, projects that interacted with end users had more impact in the levels of knowledge production (p = 0.01), capacity building (p = 0.03), and social benefits (p = 0.05). Projects that had begun over four years before the survey was completed was correlated with knowledge production (p = 0, 04), and PIs over 40 years of age and those with over 3 years research experience were correlated with more frequent impacts on knowledge production and capacity building (p ≤ 0.01). No differences were found regarding the gender of PI’s. The size of the differences can be found in the Supplementary Table S1 .

Internal consistency and content validity

The Cronbach’s alpha score, which measures the internal consistency of the questions, was satisfactory ( α  = 0.89). Table 2 shows its value in each domain (impact level). Internal consistency was satisfactory in all domains with the exception of economic benefits . However, the removal of any of the questions would have resulted in an equal or lower Cronbach's alpha.

Internal consistency for each domain (impact level)

Thirteen of the 17 experts contacted completed the content validity form and assessed whether the content of the questionnaire was appropriate and relevant to the purpose of the study. Seven were from social sciences and humanities, four from health sciences and two from ICT; 39% were women. All had longstanding experience as either researchers or research managers. The experts scored the 45 items according to their relevance and 76% of the ratings (n = 34) had an index of 0.5 or greater. The results for each item are shown in Table 3 . In accordance with the expert review, an item relating to ‘new academic networks’ was added.

Content validity of items according to experts (n = 13)

Items rated greater than or equal to 0.5; ICT: information and communication technologies.

Ninety-one percent of the items in knowledge production were rated acceptable (expert opinion index ≥ 0.5), as were 89% of the items in capacity building , 83% of the items in informing policy and practice , and 63% of the items in social benefits . In contrast, only 43% of the items (three out of seven) in the economic benefits domain achieved an acceptable rating. Some items were of higher relevance in specific fields: for example, items relating to health and social determinants were considered acceptable by health experts; training for final undergraduate’s projects was considered acceptable by ICT experts; influencing education systems and curricular assessments, was considered acceptable by social sciences and humanities, and ICT experts; and commercialization items were considered acceptable by health and ICT experts ( Table 3 ).

In this study, we tested the metric properties of a questionnaire designed to record the impact of university research originating from various disciplines. Tests of this kind, although rare in research impact assessment, are common in other study areas such as patient-reported outcome measures, education and psychology. The questionnaire displayed good internal consistency and acceptable content validity in our context. Internal consistency for all items on the instrument was excellent demonstrating that they all measured the same construct. However, since ‘impact’ is a multidimensional concept and, by definition, Cronbach’s alpha ‘indicates the correlation among items that measure one single construct’, ( Osburn 2000 ) the internal consistency of each of the five domains required evaluation; this was found to be excellent in all cases except economic benefits . Low internal consistency in this domain may be related to the fact it contained relatively few items, and/or the fact that most of the researchers who answered the questionnaire worked in the social sciences and humanities, and therefore impacts relating to transfer, commercialization and innovation were less likely to occur. An alternative possibility is that the items are, in fact, measuring more than one construct.

There is a consensus in the literature that content validity is largely a matter of judgment, ( Mastaglia et al. 2003 ) as content validity is not a property of the instrument, but of the instrument’s interpretation. We therefore incorporated two distinct phases in our study. In the first phase of development, conceptualization was enhanced through the analysis and mapping of the impacts of the randomly selected REF case; in the second the relevance of the scale’s content was evaluated through expert assessment. The expert assessment revealed that some items did not achieve acceptable content validity, especially in the domains of social benefits and economic benefits . However, it should be taken into account that while many of the items in the questionnaire were generic and thus relevant for all fields, a number were primarily specific to one field, and therefore, more relevant for experts in a particular field. Content validity was stronger in the domains ‘closest’ to the investigators. This may be due to the most frequently recognized impacts being both in areas where researchers have a degree of control and influence, ( Kalucy et al. 2009 ) and those which have been ‘traditionally’ used to measure research. In other words, their understanding of the concept of impact in the knowledge production , capacity building and informing policy and practice domains: that is, those at the intermediate level (secondary outputs) display greater homogeneity ( Kalucy et al. 2009 ).

Use of an online questionnaire in this research impact study provided data on a wide range of benefits deriving from UOC’s funded projects at a particular moment and its results address a message of accountability. Questionnaires can provide insights into respondents’ viewpoints and can systematically enhance accountability. Although assuming that PIs will provide truthful responses about the impact of their research is clearly a potential limitation, Hanney et al (2013) demonstrate that researchers do not routinely exaggerate the impacts of their research, at least in studies like this one, where there is no clear link between the replies given and future funding. International guidelines on research impact assessment studies recommend the use of a combination of methods to achieve comprehensive, robust results. ( Adam et al. 2018 ) However, the primary focus of this study was the quality and value of the survey instrument itself, therefore the issue of triangulating the findings with other methods was not explored. The questionnaire could be applied in future studies to select projects that require a more in-depth and closer analysis, such as how an understanding of scientific processes works in this context. Previous attempts have been made to assess the impact of university’s research in our context, but these have been restricted to the level of outputs (i.e. publications and patents), ( Associació Catalana d’Universitats Públiques (ACUP) 2017 ) or inputs’ level (i.e. contributions to Catalan GDP) ( Suriñach et al. 2017 ).

Evaluated as a whole, the research projects covered in this study was effective in the production of knowledge and the development of research skills in individuals and teams. This funded research has helped to generate new knowledge for other researchers and, to a lesser extent, for non-academic audiences. It has consolidated the position of UOC researchers (both experienced and novice) within national and international scientific communities, enabling them to develop and enhance ability to conduct quality research ( Trostle 1992 ).

Assessing the possible wider benefits of the research process (in terms of informing policy and practice , social benefits and economic benefits for society) proved more problematic. The relatively short period that had elapsed since the projects finished might have limited the assessment of impact. There was a striking disparity, in our results, between the return on research measured in terms of scientific impact ( knowledge production and capacity building ), notably high and uniform, and the limited and uneven contribution to wider benefits. This disparity is not a local phenomenon, but a recurrent finding in contemporary biomedical research worldwide. The Retrosight study, ( Wooding et al. 2014 ), which analyzed cardiovascular and stroke research in the United Kingdom found no correlation between knowledge production and the broader social impact of research. Behavioural aspects such as researcher engagement with potential users of the research or mission-oriented projects (targeted to practical applications) were associated with higher social benefits. This might be interpreted as strategic thinking on the part of researchers, in the sense that they consider the potential ‘mechanisms’ that might enhance the impact of their work. These results do not appear to be exceptional, since the final impact of research is influenced by the extent to which the knowledge obtained is made available to those in a position to use it.

Although the response rate was lower than expected, 43% is within the normal range for on-line surveys. ( Shih and Xitao 2008 ) In addition, arts and humanities researchers were underrepresented among PIs, but not between experts for considering content validity. One possible reason for this is that investigators are not fully aware of the influence of their research; another is the belief that research impact assessment studies are unable to provide valuable data about how arts and humanities research generates value. ( Molas-Gallart 2015 ) Arts and humanities is a discipline where in some cases the final objective of the research is not a practical application, but rather to change behaviours or people perspectives, which are therefore, more difficult to measure. According to Ochsner et al. (2012) there is a missing link between indicators and humanities scholars’ notions of quality. However, questionnaires have been used to successfully measure the impact of arts and humanities research, including in an approach adapted from the Payback Framework ( Levitt et al. 2010 ), and research impact analyses such as REF2014 ( Higher Education Funding Council of England et al. 2015 ) and the special issue of Arts and Humanities in Higher Education on the public value of arts and humanities research ( Benneworth 2015 ) have demonstrated that research in these disciplines may have many implications for society. Research results provide guidance and expertise and can be easily transferred to public debates, policies and institutional learning.

Weiss describes the rationale and conceptualization of assessment activities relating to the social impact of research as an open challenge ( Weiss 2007 ). As well as the well-known practice of attributing impact to a sole research project and the time-lag between the start of a research project and the attainment of a specific impact, in this study we also had the challenge to assess the impact of research from a diverse variety of topics and disciplines. Research impact studies are prevalent in disciplines such as health sciences ( Hanney et al. 2017 ) and agricultural research ( Weißhuhn et al. 2018 ) but less common in the social sciences and humanities, despite the REF2014 results revealing a wide array of impacts associated with various disciplines. ( Higher Education Funding Council of England et al. 2015 ) Our challenge was to analyze projects from highly diverse disciplines—social sciences, humanities, health sciences, and ICTs—and assess their varied impacts on society. We have attempted to develop a flexible and adaptable approach to assessing research impacts by utilizing a diverse amalgamation of indicators, including impact subcategories. However, due to ‘cultural’ differences between disciplines, we cannot guarantee that PIs from different knowledge areas have a homogeneous understanding of ‘research impact’: indeed the diversity of respondents when assessing the relevance of questionnaire items suggests otherwise. For this reason, a context analysis in which research is carried out and assessed, as described in the literature ( Adam et al. 2018 ) may help to decide which questionnaire items or domains should be included or removed in future studies.

To conclude, this study demonstrates that the easy-to-use questionnaire developed here is capable of measuring a wide range of research impact benefits and provides good internal consistency. Analyzing the metric properties of instruments used to measure research impact and establishing their validity will significantly contribute to research impact assessment and stimulate and extend reflection on the definition of research impact. Therefore, this questionnaire can be a powerful instrument to measure research impact when considered in context. The power of this instrument will be significantly improved when combined with other methodologies.

Surveys are widely used in research impact evaluation. They provide a broad overview of the state of a body of research, and supply comparable, easily analyzable data referring to a range of researchers and/or grants. The standardization of the approach enhances this comparability.

To our knowledge, the metric properties of impact assessment questionnaires have not been studied to date. The analysis of these properties can determine the internal consistency and content validity of these instruments and the extent to which they measure what they are intended to measure.

We thank the UOC principal investigators for providing us with their responses.

This project did not receive any specific grants from funding agencies in the public, commercial, or not-for-profit sectors.

Transparency

The lead authors (the manuscript’s guarantors) affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

Adam P. et al.  ( 2018 ) ‘ ISRIA Statement: Ten-Point Guidelines for an Effective Process of Research Impact Assessment ’, Health Research Policy and Systems , 16 / 1 , DOI: 10.1186/s12961-018-0281-5.

Google Scholar

Associació Catalana d’Universitats Públiques (ACUP) ( 2017 ) Research and Innovation Indicators of Catalan Public Universities . Report 2016.

Australian Research Council ( 2016 ) State of Australian University Research 2015-2016: Volume 1 ERA National Report . Canberra : Commonwealth of Australia .

Google Preview

Aymerich M. et al.  ( 2012 ) ‘ Measuring the Payback of Research Activities: A Feasible Ex-Post Evaluation Methodology in Epidemiology and Public Health ’, Social Science and Medicine , 75 / 3 : 505 – 10 .

Benneworth P. ( 2015 ) ‘ Putting Impact into Context: The Janus Face of the Public Value of Arts and Humanities Research ’, Arts and Humanities in Higher Education , 14 / 1 : 3 – 8 .

Berlemann M. , Haucap J. ( 2015 ) ‘ Which Factors Drive the Decision to Opt out of Individual Research Rankings? An Empirical Study of Academic Resistance to Change ’, Research Policy , 44 / 5 : 1108 – 15 .

Clark L. A. , Watson D. ( 1995 ) ‘ Constructing Validity: Basic Issues in Objective Scale Development ’, Psychological Assessment , 7 / 3 : 309 – 19 .

Fogelholm M. et al.  ( 2012 ) ‘ Panel Discussion Does Not Improve Reliability of Peer Review for Medical Research Grant Proposals ’, Journal of Clinical Epidemiology , 65 / 1 : 47 – 52 .

Guthrie S. et al.  ( 2013 ) Measuring Research: A Guide to Research Evaluation Frameworks and Tools . RAND Corporation .

Hanney S. et al.  ( 2004 ) ‘ Proposed Methods for Reviewing the Outcomes of Health Research: The Impact of Funding by the UK’s “Arthritis Research Campaign” ’, Health Research Policy and Systems , 2 / 1 : 4.

Hanney S. et al.  ( 2013 ) ‘ Conducting Retrospective Impact Analysis to Inform a Medical Research Charity’s Funding Strategies: The Case of Asthma UK ’, Allergy, Asthma, and Clinical Immunology: Official Journal of the Canadian Society of Allergy and Clinical Immunology , 9 / 1 : 17 .

Hanney S. et al.  ( 2017 ) ‘ The Impact on Healthcare, Policy and Practice from 36 Multi-Project Research Programmes: Findings from Two Reviews ’, Health Res Policy Syst , 15 / 1 : 26 .

Higher Education Funding Council of England, et al.  ( 2015 ) The Nature, Scale and Beneficiaries of Research Impact: An Initial Analysis of Research Excellence Framework (REF) 2014 Impact Case Studies . London : HEFCE.

‘IAU HESD Cluster | HESD - Higher Education for Sustainable Development portal’ (n.d.) < http://iau-hesd.net/en/contenu/4648-iau-hesd-cluster.html > accessed 17 Dec 2018.

Jørgensen T. E. , Claeys-Kulik A.-L. ( 2018 ) Universities’ Strategies and Approaches Towards Diversity, Equity and Inclusion. Examples from Across Europe . Brussels : European University Association.

Kalucy E. C. et al.  ( 2009 ) ‘ The Feasibility of Determining the Impact of Primary Health Care Research Projects Using the Payback Framework ’, Health Research Policy and Systems , 7 : 11 .

Levitt R. , Celia C. , Diepeveen S. ( 2010 ) Assessing the Impact of Arts and Humanities Research at the University of Cambridge . Technical Report. RAND Corporation , 104 .

Mastaglia B. , Toye C. , Kristjanson L. J. ( 2003 ) ‘ Ensuring Content Validity in Instrument Development: Challenges and Innovative Approaches ’, Contemporary Nurse , 14 / 3 : 281 – 91 .

Molas-Gallart J. ( 2015 ) ‘ Research Evaluation and the Assessment of Public Value ’, Arts and Humanities in Higher Education , 14 / 1 : 111 – 26 .

Molas-Gallart J. et al.  ( 2016 ) ‘ Towards an Alternative Framework for the Evaluation of Translational Research Initiatives ’, Research Evaluation , 25 / 3 : 235 – 43 .

Nunnally J. C. , Bernstein I. H. ( 1994 ) Psychometric Theory . New York: McGraw-Hill .

Ochsner M. , Hug S. E. , Daniel H.-D. ( 2012 ) ‘ Indicators for Research Quality in the Humanities: Opportunities and Limitations ’, Bibliometrie - Praxis und Forschung , 1 /4: 1-17. DOI: 10.5283/bpf.157.

Osburn H. G. ( 2000 ) ‘ Coefficient Alpha and Related Internal Consistency Reliability Coefficients ’, Psychological Methods , 5 / 3 : 343 – 55 .

Panel on the return on investments in health research. ( 2009 ) Making an Impact: A Preferred Framework and Indicators to Measure Returns on Investment in Health Research . Ottawa, ON (Canada ): Canadian Academy of Health Science (CAHS ), Ed.

Pawson R. ( 2013 ) The Science of Evaluation: A Realist Manifesto . London: Sage Publications Ltd. http://dx.doi.org/10.4135/9781473913820

Peterson R. A. ( 1994 ) ‘ A Meta-Analysis of Cronbach’s Coefficient Alpha ’, Journal of Consumer Research , 21 / 2 : 381 .

Poppy G. ( 2015 ) ‘ Science Must Prepare for Impact ’, Nature , 526 / 7571 : 7.

‘San Francisco Declaration on Research Assessment (DORA)’. (n.d.) < https://sfdora.org/> accessed 17 Dec 2018.

Shih T. H. , Xitao F. ( 2008 ) ‘ Comparing Response Rates from Web and Mail Surveys: A Meta-Analysis ’, Field Methods , 20 / 3 : 249 – 71 .

Suriñach J. et al.  ( 2017 ) Socio-Economic Impacts of Catalan Public Universities and Research , Development and Innovation in Catalonia . Barcelona: Catalan Association of Public Universities (ACUP).

Trostle J. ( 1992 ) ‘ Research Capacity Building in International Health: Definitions, Evaluations and Strategies for Success ’, Social Science & Medicine , 35 / 11 : 1321 – 4 .

Universitat Oberta de Catalunya ( 2017 ) Strategic Plan Stage II 2017-2020 . Barcelona: UOC.

Van den Broucke S. , Dargent G. , Pletschette M. ( 2012 ) ‘ Development and Assessment of Criteria to Select Projects for Funding in the EU Health Programme ’, The European Journal of Public Health , 22 / 4 : 598 – 601 .

Weiss A. P. ( 2007 ) ‘ Reviews and Overviews Measuring the Impact of Medical Research: Moving from Outputs to Outcomes ’, Psychiatry: Interpersonal and Biological Processes , 164 / February : 206 – 14 .

Weißhuhn P. , Helming K. , Ferretti J. ( 2018 ) ‘ Research Impact Assessment in Agriculture—A Review of Approaches and Impact Areas ’, Research Evaluation , 27 / 1 : 36 – 42 .

Wooding S. et al.  ( 2010 ) Mapping the Impact: Exploring the Payback of Arthritis Research . Santa Monica, CA: RAND Corporation .

Wooding S. et al.  ( 2014 ) ‘ Understanding Factors Associated with the Translation of Cardiovascular Research: A Multinational Case Study Approach ’, Implementation Science , 9 / 1 : 47 .

Supplementary data

Email alerts, citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5449
  • Print ISSN 0958-2029
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Open access
  • Published: 06 April 2022

Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties

  • Andres Jung   ORCID: orcid.org/0000-0003-0201-0694 1 ,
  • Julia Balzer   ORCID: orcid.org/0000-0001-7139-229X 2 ,
  • Tobias Braun   ORCID: orcid.org/0000-0002-8851-2574 3 , 4 &
  • Kerstin Luedtke   ORCID: orcid.org/0000-0002-7308-5469 1  

BMC Medical Research Methodology volume  22 , Article number:  100 ( 2022 ) Cite this article

5683 Accesses

4 Citations

9 Altmetric

Metrics details

Internal and external validity are the most relevant components when critically appraising randomized controlled trials (RCTs) for systematic reviews. However, there is no gold standard to assess external validity. This might be related to the heterogeneity of the terminology as well as to unclear evidence of the measurement properties of available tools. The aim of this review was to identify tools to assess the external validity of RCTs. It was further, to evaluate the quality of identified tools and to recommend the use of individual tools to assess the external validity of RCTs in future systematic reviews.

A two-phase systematic literature search was performed in four databases: PubMed, Scopus, PsycINFO via OVID, and CINAHL via EBSCO. First, tools to assess the external validity of RCTs were identified. Second, studies investigating the measurement properties of these tools were selected. The measurement properties of each included tool were appraised using an adapted version of the COnsensus based Standards for the selection of health Measurement INstruments (COSMIN) guidelines.

38 publications reporting on the development or validation of 28 included tools were included. For 61% (17/28) of the included tools, there was no evidence for measurement properties. For the remaining tools, reliability was the most frequently assessed property. Reliability was judged as “ sufficient ” for three tools (very low certainty of evidence). Content validity was rated as “ sufficient ” for one tool (moderate certainty of evidence).

Conclusions

Based on these results, no available tool can be fully recommended to assess the external validity of RCTs in systematic reviews. Several steps are required to overcome the identified difficulties to either adapt and validate available tools or to develop a better suitable tool.

Trial registration

Prospective registration at Open Science Framework (OSF): https://doi.org/10.17605/OSF.IO/PTG4D .

Peer Review reports

Systematic reviews are powerful research formats to summarize and synthesize the evidence from primary research in health sciences [ 1 , 2 ]. In clinical practice, their results are often applied for the development of clinical guidelines and treatment recommendations [ 3 ]. Consequently, the methodological quality of systematic reviews is of great importance. In turn, the informative value of systematic reviews depends on the overall quality of the included controlled trials [ 3 , 4 ]. Accordingly, the evaluation of the internal and external validity is considered a key step in systematic review methodology [ 4 , 5 ].

Internal validity relates to the systematic error or bias in clinical trials [ 6 ] and expresses how methodologically robust the study was conducted. External validity is the inference about the extent to which “a causal relationship holds over variations in persons, settings, treatments and outcomes” [ 7 , 8 ]. There are plenty of definitions for external validity and a variety of different terms. Hence, external validity, generalizability, applicability, and transferability, among others, are used interchangeably in the literature [ 9 ]. Schünemann et al. [ 10 ] suggest that: (1) generalizability “may refer to whether or not the evidence can be generalized from the population from which the actual research evidence is obtained to the population for which a healthcare answer is required”; (2) applicability may be interpreted as “whether or not the research evidence answers the healthcare question asked by a clinician or public health practitioner” and (3) transferability is often interpreted as to “whether research evidence can be transferred from one setting to another”. Four essential dimensions are proposed to evaluate the external validity of controlled clinical trials in systematic reviews: patients, treatment (including comparator) variables, settings, and outcome modalities [ 4 , 11 ]. Its evaluation depends on the specificity of the reviewers´ research question, the review´s inclusion and exclusion criteria compared to the trial´s population, the setting of the study, as well as the quality of reporting these four dimensions.

In health research, however, external validity is often neglected when critically appraising clinical studies [ 12 , 13 ]. One possible explanation might be the lack of a gold standard for assessing the external validity of clinical trials. Systematic and scoping reviews examined published frameworks and tools for assessing the external validity of clinical trials in health research [ 9 , 12 , 14 – 18 ]. A substantial heterogeneity of terminology and criteria as well as a lack of guidance on how to assess the external validity of intervention studies was found [ 9 , 12 , 15 – 18 ]. The results and conclusions of previous reviews were based on descriptive as well as content analysis of frameworks and tools on external validity [ 9 , 14 – 18 ]. Although the feasibility of some frameworks and tools was assessed [ 12 ], none of the previous reviews evaluated the quality regarding the development and validation processes of the used frameworks and tools.

RCTs are considered the most suitable research design for investigating cause and effect mechanisms of interventions [ 19 ]. However, the study design of RCTs is susceptible to a lack of external validity due to the randomization, the use of exclusion criteria and poor willingness of eligible participants to participate [ 20 , 21 ]. There is evidence that the reliability of external validity evaluations with the same measurement tool differed between randomized and non-randomized trials [ 22 ]. In addition, due to differences in requested information from reporting guidelines (e.g. consolidated standards of reporting trials (CONSORT) statement, strengthening the reporting of observational studies in Epidemiology (STROBE) statement), respective items used for assessing the external validity vary between research designs. Acknowledging the importance of RCTs in the medical field, this review focused only on tools developed to assess the external validity of RCTs. The aim was to identify tools to assess the external validity of RCTs in systematic reviews and to evaluate the quality of evidence regarding their measurement properties. Objectives: (1) to identify published measurement tools to assess the external validity of RCTs in systematic reviews; (2) to evaluate the quality of identified tools; (3) to recommend the use of tools to assess the external validity of RCTs in future systematic reviews.

This systematic review was reported in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 Statement [ 23 ] and used an adapted version of the PRISMA flow diagram to illustrate the systematic search strategy used to identify clinimetric papers [ 24 ]. This study was conducted according to an adapted version of the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology for systematic reviews of measurement instruments in health sciences [ 25 – 27 ] and followed recommendations of the JBI manual for systematic reviews of measurement properties [ 28 ]. The COSMIN methodology was chosen since this method is comprehensive and validation processes do not differ substantially between patient-reported outcome measures (PROMs) and measurement instruments of other latent constructs. According to the COSMIN authors, it is acceptable to use this methodology for non-PROMs [ 26 ]. Furthermore, because of its flexibility, it has already been used in systematic reviews assessing measurement tools which are not health measurement instruments [ 29 – 31 ]. However, adaptations or modifications may be necessary [ 26 ]. The type of measurement instrument of interest for the current study were reviewer-reported measurement tools. Pilot tests and adaptation-processes of the COSMIN methodology are described below (see section “Quality assessment and evidence synthesis”). The definition of each measurement property evaluated in the present review is based on COSMIN´s taxonomy, terminology and definition of measurement properties [ 32 ]. The review protocol was prospectively registered on March 6, 2020 in the Open Science Framework (OSF) with the registration DOI: https://doi.org/10.17605/OSF.IO/PTG4D [ 33 ].

Deviations from the preregistered protocol

One of the aims listed in the review protocol was to evaluate the characteristics and restrictions of measurement tools in terms of terminology and criteria for assessing external validity. This issue has been addressed in two recent reviews with a similar scope [ 9 , 17 ]. Although our eligibility criteria differed, it was concluded that no novel data was available for the present review to extract, since authors of included tools did not describe the definition or construct of interest or cited the same reports. Therefore, this objective was omitted.

Literature search and screening

A search of the literature was conducted in four databases: PubMed, Scopus, PsycINFO via OVID, and CINAHL via EBSCO. The eligibility criteria and search strategy were predefined in collaboration with a research librarian and is detailed in Table S1 (see Additional file 1 ). The search strategy was designed according to the COSMIN methodology and consists of the following four key elements: (1) construct (external validity of RCTs from the review authors´perspective), (2) population(s) (RCTs), (3) type of instrument(s) (measurement tools, checklists, surveys etc.), and (4) measurement properties (e.g. validity and reliability) [ 34 ]. The four key elements were divided into two main searches (adapted from previous reviews [ 24 , 35 , 36 ]): the phase 1 search contained the first three key elements to identify measurement tools to assess the external validity of RCTs. The phase 2 search aimed to identify studies evaluating the measurement properties of each tool, which was identified and included during phase 1. For this second search, a sensitive PubMed search filter developed by Terwee et al. [ 37 ] was applied. Translations of this filter for the remaining databases were taken from the COSMIN website and from other published COSMIN reviews [ 38 , 39 ] with permission from the authors. Both searches were conducted until March 2021 without restriction regarding the time of publication (databases were searched from inception). In addition, forward citation tracking with Scopus (which is a specialized citation database) was conducted in phase 2 using the ‘cited by’-function. The Scopus search filter was then entered into the ‘search within results’-function. The results from the forward citation tracking with Scopus were added to the database search results into the Rayyan app for screening. Reference lists of the retrieved full-text articles and forward citations with PubMed were scanned manually for any additional studies by one reviewer (AJ) and checked by a second reviewer (KL).

Title and abstract screening for both searches and the full-text screening during phase 2 were performed independently by at least two out of three involved researchers (AJ, KL & TB). For pragmatic reasons, full-text screening and tool/data extraction in phase 1 was performed by one reviewer (AJ) and checked by a second reviewer (TB). This screening method is acceptable for full-text screening as well as data extraction [ 40 ]. Data extraction for both searches was performed with a predesigned extraction sheet based on the recommendations of the COSMIN user manual [ 34 ]. The Rayyan Qatar Computing Research Institute (QCRI) web app [ 41 ] was used to facilitate the screening process (both searches) according to a priori defined eligibility criteria. A pilot test was conducted for both searches in order to reach agreement between the reviewers during the screening process. For this purpose, the first 100 records in phase 1 and the first 50 records in phase 2 (sorted by date) in the Rayyan app were screened by two reviewers independently and subsequently, issues regarding the feasibility of screening methods were discussed in a meeting.

Eligibility criteria

Phase 1 search (identification of tools).

Records were considered for inclusion based on their title and abstract according to the following criteria: (1) records that described the development and or implementation (application), e.g. manual or handbook, of any tool to assess the external validity of RCTs; (2) systematic reviews that applied tools to assess the external validity of RCTs and which explicitly mentioned the tool in the title or abstract; (3) systematic reviews or any other publication potentially using a tool for external validity assessment, but the tool was not explicitly mentioned in the title or abstract; (4) records that gave other references to, or dealt with, tools for the assessment of external validity of RCTs, e.g. method papers, commentaries.

The full-text screening was performed to extract or to find references to potential tools. If a tool was cited, but not presented or available in the full-text version, the internet was searched for websites on which this tool was presented, to extract and review for inclusion. Potential tools were extracted and screened for eligibility as follows: measurement tools aiming to assess the external validity of RCTs and designed for implementation in systematic reviews of intervention studies. Since the terms external validity, applicability, generalizability, relevance and transferability are used interchangeably in the literature [ 10 , 11 ], tools aiming to assess one of these constructs were eligible. Exclusion criteria: (1) The multidimensional tool included at least one item related to external validity, but it was not possible to assess and interpret external validity separately. (2) The tool was developed exclusively for study designs other than RCTs. (3) The tool contained items assessing information not requested in the CONSORT-Statement [ 42 ] (e.g. cost-effectiveness of the intervention, salary of health care provider) and these items could not be separated from items on external validity. (4) The tool was published in a language other than English or German. (5) The tool was explicitly designed for a specific medical profession or field and cannot be used in other medical fields.

Phase 2 search (identification of reports on the measurement properties of included tools)

For the phase 2 search, records evaluating the measurement properties of at least one of the included measurement tools were selected. Reports only using the measurement tool as an outcome measure without the evaluation of at least one measurement property were excluded. If a report did not evaluate the measurement properties of a tool, it was also excluded. Hence, reports providing data on the validity or the reliability of sum-scores of multidimensional tools, only, were excluded if the dimension “external validity” was not evaluated separately.

If there was missing data or information (phase 1 or phase 2), the corresponding authors were contacted.

Quality assessment and evidence synthesis

All included reports were systematically evaluated: (1) for their methodological quality by using the adapted COSMIN Risk of Bias (RoB) checklist [ 25 ] and (2) against the updated criteria for good measurement properties [ 26 , 27 ]. Subsequently, all available evidence for each measurement property for the individual tool were summarized and rated against the updated criteria for good measurement properties and graded for their certainty of evidence, according to COSMIN´s modified GRADE approach [ 26 , 27 ]. The quality assessment was performed by two independent reviewers (AJ & JB). In case of irreconcilable disagreement, a third reviewer (TB) was consulted to reach consensus.

The COSMIN RoB checklist is a tool [ 25 , 27 , 32 , 43 ] designed for the systematic evaluation of the methodological quality of studies assessing the measurement properties of health measurement instruments [ 25 ]. Although this checklist was specifically developed for systematic reviews of PROMs, it can also be used for reviews of non-PROMs [ 26 ] or measurement tools of other latent constructs [ 28 , 29 ]. As mentioned in the COSMIN user manual, adaptations for some items in the COSMIN RoB checklist might be necessary, in relation to the construct being measured [ 34 ]. Therefore, pilot tests were performed for the assessment of measurement properties of tools assessing the quality of RCTs before data extraction, aiming to ensure feasibility during the planned evaluation of the included tools. The pilot tests were performed with a random sample of publications on measurement instruments of potentially relevant tools. After each pilot test, results and problems regarding the comprehensibility, relevance and feasibility of the instructions, items, and response options in relation to the construct of interest were discussed. Where necessary, adaptations and/or supplements were added to the instructions of the evaluation with the COSMIN RoB checklist. Saturation was reached after two rounds of pilot testing. Substantial adaptations or supplements were required for Box 1 (‘development process’) and Box 10 (‘responsiveness’) of the COSMIN RoB checklist. Minor adaptations were necessary for the remaining boxes. The specification list, including the adaptations, can be seen in Table S2 (see Additional file 2 ). The methodological quality of included studies was rated via the four-point rating scale of the COSMIN RoB checklist as “inadequate”, “doubtful”, “adequate”, or “very good” [ 25 ]. The lowest score of any item in a box is taken to determine the overall rating of the methodological quality of each single study on a measurement property [ 25 ].

After the RoB-assessment, the result of each single study on a measurement property was rated against the updated criteria for good measurement properties for content validity [ 27 ] and for the remaining measurement properties [ 26 ] as “sufficient” (+), “insufficient” (-), or “indeterminate” (?). These ratings were summarized and an overall rating for each measurement property was given as “sufficient” (+), “insufficient” (-), “inconsistent” (±), or “indeterminate” (?). However, the overall rating criteria for good content validity was adapted to the research topic of the present review. This method usually requires an additional subjective judgement from reviewers [ 44 ]. Since one of the biggest limitations within this field of research is the lack of consensus on terminology and criteria as well as on how to assess the external validity [ 9 , 12 ], a reviewers’ subjective judgement was considered inappropriate. After this issue was also discussed with one leading member of the COSMIN steering committee, the reviewers’ rating was omitted. A “sufficient” (+) overall rating was given if there was evidence of face or content validity of the final version of the measurement tool assessed by a user or expert panel. Otherwise, the rating “indeterminate” (?) or “insufficient” (-) was used for the content validity.

The summarized evidence for each measurement property for the individual tool was graded using COSMIN´s modified GRADE approach [ 26 , 27 ]. The certainty (quality) of evidence was graded as “high”, “moderate”, “low”, or “very low” according to the approach for content validity [ 27 ] and for the remaining measurement properties [ 26 ]. COSMIN´s modified GRADE approach distinguishes between four factors influencing the certainty of evidence: risk of bias, inconsistency, indirectness, and imprecision. The starting point for all measurement properties is high certainty of evidence and is subsequently downgraded by one to three levels per factor when there is risk of bias, (unexplained) inconsistency, imprecision (not considered for content validity [ 27 ]), or indirect results [ 26 , 27 ]. If there is no study on the content validity of a tool, the starting point for this measurement property is “moderate” and is subsequently downgraded depending on the quality of the development process [ 27 ]. The grading process according to COSMIN [ 26 , 27 ] is described in Table S4. Selective reporting bias or publication bias is not taken into account in COSMIN´s modified GRADE approach, because of a lack of registries for studies on measurement properties [ 26 ].

The evidence synthesis was performed qualitatively according to the COSMIN methodology [ 26 ]. If several reports revealed homogenous quantitative data (e.g. same statistics, population) on internal consistency, reliability, measurement error or hypotheses testing of a measurement tool, pooling the results was considered using generic inverse variance (random effects) methodology and weighted means as well as 95% confidence intervals for each measurement property [ 34 ]. No subgroup analysis was planned. However, statistical pooling was not possible in the present review.

We used three criteria for the recommendation of a measurement tool in accordance with the COSMIN manual: (A) “Evidence for sufficient content validity (any level) and at least low-quality evidence for sufficient internal consistency” for a tool to be recommended; (B) tool “categorized not in A or C” and further research on the quality of this tool is required to be recommended; and (C) tool with “high quality evidence for an insufficient psychometric property” and this tool should not be recommended [ 26 ].

Literature search and selection process

Figure  1 shows the selection process. In the phase 1 search, from 5397 non-duplicate records, 5020 irrelevant records were excluded. 377 reports were screened, and 74 potential tools were extracted. After reaching consensus, 46 tools were excluded (reasons for exclusion are presented in Table S3 (see Additional file 3 )) and finally 28 were included. Any disagreements during the screening process were resolved through discussion. There was one case during the full-text screening process in the phase 1 search, in which the whole review team was involved to reach consensus about the inclusion/exclusion of two tools (Agency for Healthcare Research and Quality (AHRQ) criteria for applicability and TRANSFER approach, both listed in Table S 3 ).

In the phase 2 search, 2191 non-duplicate records were screened for title and abstract. 2146 records were excluded as they did not assess any measurement property of the included tools. Of 45 reports, 8 reports were included. The most common reason for exclusion was that reports evaluating the measurement properties of multidimensional tools did not evaluate external validity as a separate dimension. For example, one study assessing the interrater reliability of the GRADE method [ 45 ] was identified during full-text screening, but had to be excluded, since it did not provide separate data on the reliability of the indirectness domain (representing external validity). Two additional reports were included during reference screening. Any disagreements during the screening process were resolved through discussion.

Thirty-eight publications on the development or evaluation of the measurement properties of 28 included tools were included for quality appraisal according to the adapted COSMIN guidelines.

figure 1

Flow diagram “of systematic search strategy used to identify clinimetric papers”[ 24 ]

We contacted the corresponding authors of three reports [ 46 – 48 ] for additional information. One corresponding author did reply [ 48 ].

Methods to assess the external validity of RCTs

During full-text screening in phase 1, several concepts to assess the external validity of RCTs were found (Table  1 ). Two main concepts were identified: experimental/statistical methods and non-experimental methods. The experimental/statistical methods were summarized and collated into five subcategories giving a descriptive overview of the different approaches used to assess the external validity. However, according to our eligibility criteria, these methods were excluded, since they were not developed for the use in systematic reviews of interventions. In addition, a comparison of these methods as well as appraisal of risk of bias with the COSMIN RoB checklist would not have been feasible. Therefore, the experimental/statistical methods described below were not included for further evaluation.

Characteristics of included measurement tools

The included tools and their characteristics are listed in Table  2 . Overall, the tools were heterogenous with respect to the number of items or dimensions, response options and development processes. The number of items varied between one and 26 items and the response options varied between 2-point-scales to 5-point-scales. Most tools used a 3-point-scale ( n  = 20/28, 71%). For 14/28 (50%) of the tools, the development was not described in detail [ 63 – 76 ]. Seven review authors appear to have developed their own tool but did not provide any information on the development process [ 63 – 68 , 71 ].

The constructs aimed to be measured by the tools or dimensions of interest are diverse. Two of the tools focused on the characterization of RCTs on an efficacy-effectiveness continuum [ 47 , 86 ], three tools focused predominantly on the report quality of factors essential to external validity [ 69 , 75 , 88 ] (rather than the external validity itself), 18 tools aimed to assess the representativeness, generalizability or applicability of population, setting, intervention, and/or outcome measure to usual practice [ 22 , 63 – 65 , 70 , 71 , 73 , 74 , 76 – 78 , 81 – 83 , 92 , 94 , 100 ], and five tools seemed to measure a mixture of these different constructs related to external validity [ 66 , 68 , 72 , 79 , 98 ]. However, the construct of interest of most tools was not described adequately (see below).

  • Measurement properties

The results of the methodological quality assessment according to the adapted COSMIN RoB checklist are detailed in Table 3 . If all data on hypotheses testing in an article had the same methodological quality rating, they were combined and summarized in Table 3  in accordance with the COSMIN manual [ 34 ]. The results of the ratings against the updated criteria for good measurement properties and the overall certainty of evidence, according to the modified GRADE approach, can be seen in Table 4 . The detailed grading is described in Table S4 (see Additional file 4 ). Disagreements between reviewers during the quality assessment were resolved through discussion.

Content validity

The methodological quality of the development process was “inadequate” for 19/28 (68%) of the included tools [ 63 – 66 , 68 – 74 , 76 , 78 , 81 , 88 , 98 , 100 ]. This was mainly due to insufficient description of the construct to be measured, the target population, or missing pilot tests. Six development studies had a “doubtful” methodological quality [ 22 , 75 , 77 , 79 , 82 , 83 ] and three had an “adequate” methodological quality [ 47 , 48 , 94 ].

There was evidence for content validation of five tools [ 22 , 47 , 79 , 81 , 98 ]. However, the methodological quality of the content validity studies was “adequate” and “very good” only for the Rating of Included Trials on the Efficacy-Effectiveness Spectrum (RITES) tool [ 47 ] and “doubtful” for Cho´s Clinical Relevance Instrument [ 79 ], the “external validity”-dimension of the Downs & Black-checklist [ 22 ], the “Selection Bias”-dimension of the Effective Public Health Practice Project (EPHPP) tool [ 98 ], and the “Clinical Relevance” tool [ 81 ]. The overall certainty of evidence for content validity was “very low” for 19 tools (mainly due to very serious risk of bias and serious indirectness) [ 63 – 76 , 78 , 82 , 86 , 88 , 100 ], “low” for three tools (mainly due to serious risk of bias or serious indirectness) [ 77 , 83 , 94 ] and “moderate” for six tools (mainly due to serious risk of bias or serious indirectness) [ 22 , 47 , 79 , 81 , 92 , 98 ]. All but one tool had an “indeterminate” content validity. The RITES tool [ 47 ] had “moderate” certainty of evidence for “sufficient” content validity.

Internal consistency

One study assessed the internal consistency for one tool (“external validity”-dimension of the Downs & Black-checklist) [ 22 ]. The methodological quality of this study was “doubtful” due to a lack of evidence on unidimensionality (or structural validity). Thus, this tool had a “very low” certainty of evidence for “indeterminate” internal consistency. Reasons for downgrading were a very serious risk of bias and imprecision.

Reliability

Out of 13 studies assessing the reliability of 9 tools, eleven evaluated the interrater reliability [ 80 ,  84 , 86 , 87 , 90 , 93 – 95 , 97 , 99 ], one the test-retest reliability [ 98 ], and one evaluated both [ 22 ]. Two studies had an “inadequate” [ 93 , 101 ], two had a “doubtful” [ 98 , 99 ], three had an “adequate” [ 80 ,  91 , 94 , 95 ], and six had a “very good” methodological quality [ 22 , 84 , 86 , 87 ]. The overall certainty of evidence was “very low” for five tools (reasons for downgrading please refer to Table S 4 ) [ 47 , 73 , 88 , 92 , 94 ]. The certainty of evidence was “low” for the “Selection Bias”-dimension of the EPHPP tool (due to serious risk of bias and imprecision) [ 98 ] and “moderate” for Gartlehner´s tool [ 86 ], the “external validity”-dimension of the Downs & Black-checklist [ 22 ], as well as the clinical relevance instrument [ 79 ] (due to serious risk of bias and indirectness).

Out of nine evaluated tools, the Downs & Black-checklist [ 22 ] had “inconsistent” results on reliability. The Clinical Relevance Instrument [ 79 ], Gartlehner´s tool [ 86 ], the “Selection Bias”-dimension of the EPHPP [ 98 ], the indirectness-dimension of the GRADE handbook [ 92 ] and the modified indirectness-checklist [ 94 ] had an “insufficient” rating for reliability. Green & Glasgow´s tool [ 88 ], the external validity dimension of the U.S. Preventive Services Task Force (USPSTF) manual [ 73 ] and the RITES tool [ 47 ] had a “very low” certainty of evidence for “sufficient” reliability.

Measurement error

Measurement error was reported for three tools. Two studies on measurement error of Gartlehner´s tool [ 86 ] and Loyka´s external validity framework [ 75 ], had an “adequate” methodological quality. Two studies on measurement error of the external validity dimension of the Downs & Black-checklist [ 22 ] had an “inadequate” methodological quality. However, all three tools had a “very low” certainty of evidence for “indeterminate” measurement error. Reasons for downgrading were risk of bias, indirectness, and imprecision due to small sample sizes.

Criterion validity

Criterion validity was reported only for Gartlehner´s tool [ 86 ]. Although there was no gold standard available to assess the criterion validity of this tool, the authors used expert opinion as the reference standard. The study assessing this measurement property had an “adequate” methodological quality. The overall certainty of evidence was “very low” for “sufficient” criterion validity due to risk of bias, imprecision, and indirectness.

Construct validity (hypotheses testing)

Five studies [ 22 , 90 , 91 , 97 , 98 ] reported on the construct validity of four tools. Three studies had a “doubtful” [ 90 , 91 , 98 ], one had an “adequate” [ 22 ] and one had a “very good” [ 97 ] methodological quality. The overall certainty of evidence was “very low” for three tools (mainly due to serious risk of bias, imprecision and serious indirectness) [ 22 , 88 , 98 ] and “low” for one tool (due to imprecision and serious indirectness) [ 47 ]. The “Selection-Bias”-dimension of the EPHPP tool [ 98 ] had “very low” certainty of evidence for “sufficient” construct validity and the RITES tool [ 47 ] had “low” certainty of evidence for “sufficient” construct validity. Both, the Green & Glasgow´s tool [ 88 ] and the Downs & Black-checklist [ 22 ], had “very low” certainty of evidence for “insufficient” construct validity.

Structural validity and cross-cultural validity were not assessed in any of the included studies.

Summary and interpretation of results

To our knowledge this is the first systematic review identifying and evaluating the measurement properties of tools to assess the external validity of RCTs. A total of 28 tools were included. Overall, for more than half (n = 17/28, 61%) of the included tools the measurement properties were not reported. Only five tools had at least one “sufficient” measurement property. Moreover, the development process was not described in 14/28 (50%) of the included tools. Reliability was assessed most frequently (including inter-rater and/or test-retest reliability). Only three of the included tools had “sufficient” reliability (“very low” certainty of evidence) [ 47 , 73 , 88 ]. Hypotheses testing was evaluated in four tools, with half of them having “sufficient” construct validity (“low” and “very low” certainty of evidence) [ 47 , 98 ]. Measurement error was evaluated in three tools, all with an “indeterminate” quality rating (“low” and “very low” certainty of evidence) [ 22 , 75 , 86 ]. Criterion validity was evaluated for one tool, having “sufficient” with “very low” certainty of evidence [ 86 ]. The RITES tool [ 47 ] was the measurement tool with the strongest evidence for validity and reliability. Its content validity, based on international expert-consensus, was “sufficient” with “moderate” certainty of evidence, while reliability and construct validity were rated as “sufficient” with “very low” and “low” certainty of evidence, respectively.

Following the three criteria for the recommendation of a measurement tool, all included tools were categorized as ‘B’. Hence, further research will be required for the recommendation for or against any of the included tools [ 26 ]. Sufficient internal consistency may not be relevant for the assessment of external validity, as the measurement models might not be fully reflective. However, none of the authors/developers did specify the measurement model of their measurement tool.

Specification of the measurement model is considered a requirement of the appropriateness for the latent construct of interest during scale or tool development [ 102 ]. It could be argued that researchers automatically expect their tool to be a reflective measurement model. E.g., Downs and Black [ 22 ] assessed internal consistency without prior testing for unidimensionality or structural validity of the tool. Structural validity or unidimensionality is a prerequisite for internal consistency [ 26 ] and both measurement properties are only relevant for reflective measurement models [ 103 , 104 ]. Misspecification as well as lack of specification of the measurement model can lead to potential limitations when developing and validating a scale or tool [ 102 , 105 ]. Hence, the specification of measurement models should be considered in future research.

Content validity is the most important measurement property of health measurement instruments [ 27 ] and a lack of face validity is considered a strong argument for not using or to stop further evaluation of a measurement instrument [ 106 ]. Only the RITES tool [ 47 ] had evidence of “sufficient” content validity. Nevertheless, this tool does not directly measure the external validity of RCTs. The RITES tool [ 47 ] was developed to classify RCTs on an efficacy-effectiveness continuum. An RCT categorized as highly pragmatic or as having a “strong emphasis on effectiveness” [ 47 ] implies that the study design provides rather applicable results, but it does not automatically imply high external validity or generalizability of a trial´s characteristics to other specific contexts and settings [ 107 ]. Even a highly pragmatic/effectiveness study might have little applicability or generalizability to a specific research question of review authors. An individual assessment of external validity may still be needed by review authors in accordance with the research question and other contextual factors.

Another tool which might have some degree of content or face validity is the indirectness-dimension of the GRADE method [ 92 ]. This method is a widely used and accepted method in research synthesis in health science [ 108 ]. It has been evolved over the years based on work from the GRADE Working Group and on feedback from users worldwide [ 108 ]. Thus, it might be assumed that this method has a high degree of face validity, although it has not been systematically tested for content validity.

If all tools are categorized as ‘B’ in a review, the COSMIN guidelines suggests that the measurement instrument “with the best evidence for content validity could be the one to be provisionally recommended for use, until further evidence is provided” [ 34 ]. In accordance with this suggestions, the use of the RITES tool [ 47 ] as an provisionally solution might therefore be justified until more research on this topic is available. However, users should be aware of its limitations, as described above.

Implication for future research

This study affirms and supplements what is already known from previous reviews [ 9 , 12 , 14 – 18 ]. The heterogeneity of characteristics of tools included in those reviews was also observed in the present review. Although Dyrvig et al. [ 16 ] did not assess the measurement properties of available tools, they reported a lack of empirical support of items included in measurement tools. The authors of previous reviews could not recommend a measurement tool. Although their conclusions were mainly based on descriptive analysis rather than the assessment of quality of the tools, the conclusion of the present systematic review is consistent with them.

One major challenge on this topic is the serious heterogeneity regarding the terminology, criteria and guidance to assess the external validity of RCTs. Development of new tools and/or further revision (and validation) of available tools may not be appropriate before consensus-based standards are developed. Generally, it may be argued whether these methods to assess the external validity in systematic reviews of interventions are suitable [ 9 , 12 ]. The experimental/statistical methods presented in Table  1 may offer a more objective approach to evaluate the external validity of RCTs. However, they are not feasible to implement in the conduction of systematic reviews. Furthermore, they focus mainly on the characteristics and generalizability of the study populations, which is insufficient to assess the external validity of clinical trials [ 109 ], since they do not consider other relevant dimensions of external validity such as intervention settings or treatment variables etc. [ 4 , 109 ].

The methodological possibilities in tool/scale development and validation regarding this topic have not been exploited, yet. More than 20 years ago, there was no consensus regarding the definition of quality of RCTs. In 1998, Verhagen et al. [ 110 ] performed a Delphi study to achieve consensus regarding the definition of quality of RCTs and to create a quality criteria list. Until now, these criteria list has been a guidance in tool development and their criteria are still being implemented in methodological quality or risk of bias assessment tools (e.g. the Cochrane Collaboration risk of bias tool 1 & 2.0, the Physiotherapy Evidence Database (PEDro) scale etc.). Consequently, it seems necessary to seek consensus in order to overcome the issues regarding the external validity of RCTs in a similar way. After reaching consensus, further development and validation is needed following standard guidelines for scale/tool development (e.g. de Vet et al. [ 106 ]; Streiner et al. [ 111 ]; DeVellis [ 112 ]). Since the assessment of external validity seems highly context-dependent [ 9 , 12 ], this should be taken into account in future research. A conventional checklist approach seems inappropriate [ 9 , 12 , 109 ] and a more comprehensive but flexible approach might be necessary. The experimental/statistical methods (Table  1 ) may offer a reference standard for convergent validity testing of the dimension “patient population” in future research.

This review has highlighted the necessity for more research in this area. Published studies and evaluation tools are important sources of information and should inform the development of a new tool or approach.

Strengths and limitations

One strength of the present review is the two-phase search method. With this method we believe that the likelihood of missing relevant studies was addressed adequately. The forward citation tracking using Scopus is another strength of the present review. The quality of the included measurement tools was assessed with an adapted and comprehensive methodology (COSMIN). None of the previous reviews has attempted such an assessment.

There are some limitations of the present review. First, a search for grey literature was not performed. Second, we focused on RCTs only and did not include assessment tools for non-randomized or other observational study design. Third, due to heterogeneity in terminology, we might have missed some tools with our electronic literature search strategy. Furthermore, it was challenging to find studies on measurement properties of some included tools, that did not have a specific name or abbreviation (such as EVAT). We tried to address this potential limitation by performing a comprehensive reference screening and snowballing (including forward citation screening).

Based on the results of this review, no available measurement tool can be fully recommended for the use in systematic reviews to assess the external validity of RCTs. Several steps are required to overcome the identified difficulties before a new tool is developed or available tools are further revised and validated.

Availability of data and materials

All data generated or analyzed during this study are included in this published article (and its supplementary information files).

Abbreviations

 Critical Appraisal Skills Programme

 Cochrane Collaboration Back Review Group

controlled clinical trial

COnsensus based Standards for the selection of health Measurement Instruments

Effective Public Health Practice Project

External Validity Assessment Tool

Feasibility, Appropriateness, Meaningfulness and Effectiveness

Graphical Appraisal Tool for Epidemiological Studies

Generalizability, Applicability and Predictability

Grading of Recommendations Assessment, Development and Evaluation

Health Technology Assessment

intraclass correlation

Let Evidence Guide Every New Decision

National Institute for Health and Care Excellence

Physiotherapy Evidence Database

PRagmatic Explanatory Continuum Indicator Summary

randomized controlled trial

Rating of Included Trials on the Efficacy-Effectiveness Spectrum

Transparent Reporting of Evaluations with Nonrandomized Designs

U.S. Preventive Services Task Force

Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;7:e1000326.

PubMed   PubMed Central   Google Scholar  

Aromataris E, Munn Z (eds). JBI Manual for Evidence Synthesis. JBI Man Evid Synth. 2020.  https://doi.org/10.46658/jbimes-20-01

Knoll T, Omar MI, Maclennan S, et al. Key Steps in Conducting Systematic Reviews for Underpinning Clinical Practice Guidelines: Methodology of the European Association of Urology. Eur Urol. 2018;73:290–300.

PubMed   Google Scholar  

Jüni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323:42–6.

Büttner F, Winters M, Delahunt E, Elbers R, Lura CB, Khan KM, Weir A, Ardern CL. Identifying the ’incredible’! Part 1: assessing the risk of bias in outcomes included in systematic reviews. Br J Sports Med. 2020;54:798–800.

Boutron I, Page MJ, Higgins JPT, Altman DG, Lundh A, Hróbjartsson A, Group CBM. Considering bias and conflicts of interest among the included studies. Cochrane Handb. Syst. Rev. Interv. 2021; version 6.2 (updated Febr. 2021)

Cook TD, Campbell DT, Shadish W. Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin; 2002.

Google Scholar  

Avellar SA, Thomas J, Kleinman R, Sama-Miller E, Woodruff SE, Coughlin R, Westbrook TR. External Validity: The Next Step for Systematic Reviews? Eval Rev. 2017;41:283–325.

Weise A, Büchter R, Pieper D, Mathes T. Assessing context suitability (generalizability, external validity, applicability or transferability) of findings in evidence syntheses in healthcare-An integrative review of methodological guidance. Res Synth Methods. 2020;11:760–79.

Schunemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G, Helfand M. Non-randomized studies as a source of complementary, sequential or replacement evidence for randomized controlled trials in systematic reviews on the effects of interventions. Res Synth Methods. 2013;4:49–62.

Atkins D, Chang SM, Gartlehner G, Buckley DI, Whitlock EP, Berliner E, Matchar D. Assessing applicability when comparing medical interventions: AHRQ and the Effective Health Care Program. J Clin Epidemiol. 2011;64:1198–207.

Burchett HED, Blanchard L, Kneale D, Thomas J. Assessing the applicability of public health intervention evaluations from one setting to another: a methodological study of the usability and usefulness of assessment tools and frameworks. Heal Res policy Syst. 2018;16:88.

Dekkers OM, von Elm E, Algra A, Romijn JA, Vandenbroucke JP. How to assess the external validity of therapeutic trials: a conceptual approach. Int J Epidemiol. 2010;39:89–94.

CAS   PubMed   Google Scholar  

Burchett H, Umoquit M, Dobrow M. How do we know when research from one setting can be useful in another? A review of external validity, applicability and transferability frameworks. J Health Serv Res Policy. 2011;16:238–44.

Cambon L, Minary L, Ridde V, Alla F. Transferability of interventions in health education: a review. BMC Public Health. 2012;12:497.

Dyrvig A-K, Kidholm K, Gerke O, Vondeling H. Checklists for external validity: a systematic review. J Eval Clin Pract. 2014;20:857–64.

Munthe-Kaas H, Nøkleby H, Nguyen L. Systematic mapping of checklists for assessing transferability. Syst Rev. 2019;8:22.

Nasser M, van Weel C, van Binsbergen JJ, van de Laar FA. Generalizability of systematic reviews of the effectiveness of health care interventions to primary health care: concepts, methods and future research. Fam Pract. 2012;29(Suppl 1):i94–103.

Hariton E, Locascio JJ. Randomised controlled trials - the gold standard for effectiveness research: Study design: randomised controlled trials. BJOG. 2018;125:1716.

Pressler TR, Kaizar EE. The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Stat Med. 2013;32:3552–68.

Rothwell PM. External validity of randomised controlled trials: “to whom do the results of this trial apply?” Lancet. 2005;365:82–93.

Downs SH, Black N. The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions. J Epidemiol Community Health. 1998;52:377–84.

CAS   PubMed   PubMed Central   Google Scholar  

Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160.

Clark R, Locke M, Hill B, Wells C, Bialocerkowski A. Clinimetric properties of lower limb neurological impairment tests for children and young people with a neurological condition: A systematic review. PLoS One. 2017;12:e0180031.

Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, Terwee CB. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27:1171–9.

Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, de Vet HCW, Terwee CB. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27:1147–57.

Terwee CB, Prinsen CAC, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, Bouter LM, de Vet HCW, Mokkink LB. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27:1159–70.

Stephenson M, Riitano D, Wilson S, Leonardi-Bee J, Mabire C, Cooper K, Monteiro da Cruz D, Moreno-Casbas MT, Lapkin S. Chap. 12: Systematic Reviews of Measurement Properties. JBI Man Evid Synth. 2020  https://doi.org/10.46658/JBIMES-20-13

Glover PD, Gray H, Shanmugam S, McFadyen AK. Evaluating collaborative practice within community-based integrated health and social care teams: a systematic review of outcome measurement instruments. J Interprof Care. 2021;1–15.  https://doi.org/10.1080/13561820.2021.1902292 . Epub ahead of print.

Maassen SM, Weggelaar Jansen AMJW, Brekelmans G, Vermeulen H, van Oostveen CJ. Psychometric evaluation of instruments measuring the work environment of healthcare professionals in hospitals: a systematic literature review. Int J Qual Heal care J Int Soc Qual Heal Care. 2020;32:545–57.

Jabri Yaqoob MohammedAl, Kvist F, Azimirad T, Turunen M. A systematic review of healthcare professionals’ core competency instruments. Nurs Health Sci. 2021;23:87–102.

Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, de Vet HCW. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63:737–45.

Jung A, Balzer J, Braun T, Luedtke K. Psychometric properties of tools to measure the external validity of randomized controlled trials: a systematic review protocol. 2020;  https://doi.org/10.17605/OSF.IO/PTG4D

Mokkink LB, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, de Vet HCW, Terwee CB COSMIN manual for systematic reviews of PROMs, user manual. 2018;1–78. https://www.cosmin.nl/wp-content/uploads/COSMIN-syst-review-for-PROMs-manual_version-1_feb-2018-1.pdf . Accessed 3 Feb 2020.

Bialocerkowski A, O’shea K, Pin TW. Psychometric properties of outcome measures for children and adolescents with brachial plexus birth palsy: a systematic review. Dev Med Child Neurol. 2013;55:1075–88.

Matthews J, Bialocerkowski A, Molineux M. Professional identity measures for student health professionals - a systematic review of psychometric properties. BMC Med Educ. 2019;19:308.

Terwee CB, Jansma EP, Riphagen II, De Vet HCW. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18:1115–23.

Sierevelt IN, Zwiers R, Schats W, Haverkamp D, Terwee CB, Nolte PA, Kerkhoffs GMMJ. Measurement properties of the most commonly used Foot- and Ankle-Specific Questionnaires: the FFI, FAOS and FAAM. A systematic review. Knee Surg Sports Traumatol Arthrosc. 2018;26:2059–73.

van der Hout A, Neijenhuijs KI, Jansen F, et al. Measuring health-related quality of life in colorectal cancer patients: systematic review of measurement properties of the EORTC QLQ-CR29. Support Care Cancer. 2019;27:2395–412.

Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, Davies P, Kleijnen J, Churchill R. ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34.

Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. 2016;5:210.

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Int J Surg. 2012;10:28–55.

Mokkink LB, Terwee CB. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. 2010;539–549

Terwee CB, Prinsen CA, Chiarotto A, De Vet H, Bouter LM, Alonso J, Westerman MJ, Patrick DL, Mokkink LB. COSMIN methodology for assessing the content validity of PROMs–user manual. Amsterdam VU Univ. Med. Cent. 2018;  https://cosmin.nl/wp-content/uploads/COSMIN-methodology-for-content-validity-user-manual-v1.pdf . Accessed 3 Feb 2020.

Mustafa RA, Santesso N, Brozek J, et al. The GRADE approach is reproducible in assessing the quality of evidence of quantitative evidence syntheses. J Clin Epidemiol. 2013;66:735–6.

Jennings H, Hennessy K, Hendry GJ. The clinical effectiveness of intra-articular corticosteroids for arthritis of the lower limb in juvenile idiopathic arthritis: A systematic review. Pediatr Rheumatol. 2014. https://doi.org/10.1186/1546-0096-12-23 .

Article   Google Scholar  

Wieland LS, Berman BM, Altman DG, et al. Rating of Included Trials on the Efficacy-Effectiveness Spectrum: development of a new tool for systematic reviews. J Clin Epidemiol. 2017;84:95–104.

Atkins D, Briss PA, Eccles M, et al. Systems for grading the quality of evidence and the strength of recommendations II: pilot study of a new system. BMC Health Serv Res. 2005;5:25.

Abraham NS, Wieczorek P, Huang J, Mayrand S, Fallone CA, Barkun AN. Assessing clinical generalizability in sedation studies of upper GI endoscopy. Gastrointest Endosc. 2004;60:28–33.

Arabi YM, Cook DJ, Zhou Q, et al. Characteristics and Outcomes of Eligible Nonenrolled Patients in a Mechanical Ventilation Trial of Acute Respiratory Distress Syndrome. Am J Respir Crit Care Med. 2015;192:1306–13.

Williams AC, de Nicholas C, Richardson MK, de Pither PH, FAC. Generalizing from a controlled trial: The effects of patient preference versus randomization on the outcome of inpatient versus outpatient chronic pain management. Pain. 1999;83:57–65.

De Jong Z, Munneke M, Jansen LM, Ronday K, Van Schaardenburg DJ, Brand R, Van Den Ende CHM, Vliet Vlieland TPM, Zuijderduin WM, Hazes JMW. Differences between participants and nonparticipants in an exercise trial for adults with rheumatoid arthritis. Arthritis Care Res. 2004;51:593–600.

Hordijk-Trion M, Lenzen M, Wijns W, et al. Patients enrolled in coronary intervention trials are not representative of patients in clinical practice: Results from the Euro Heart Survey on Coronary Revascularization. Eur Heart J. 2006;27:671–8.

Wilson A, Parker H, Wynn A, Spiers N. Performance of hospital-at-home after a randomised controlled trial. J Heal Serv Res Policy. 2003;8:160–4.

Smyth B, Haber A, Trongtrakul K, Hawley C, Perkovic V, Woodward M, Jardine M. Representativeness of Randomized Clinical Trial Cohorts in End-stage Kidney Disease: A Meta-analysis. JAMA Intern Med. 2019;179:1316–24.

Leinonen A, Koponen M, Hartikainen S. Systematic Review: Representativeness of Participants in RCTs of Acetylcholinesterase Inhibitors. PLoS One. 2015;10:e0124500–e0124500.

Chari A, Romanus D, Palumbo A, Blazer M, Farrelly E, Raju A, Huang H, Richardson P. Randomized Clinical Trial Representativeness and Outcomes in Real-World Patients: Comparison of 6 Hallmark Randomized Clinical Trials of Relapsed/Refractory Multiple Myeloma. Clin Lymphoma Myeloma Leuk. 2020;20:8.

Susukida R, Crum RM, Ebnesajjad C, Stuart EA, Mojtabai R. Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction. 2017;112:1210–9.

Zarin DA, Young JL, West JC. Challenges to evidence-based medicine: a comparison of patients and treatments in randomized controlled trials with patients and treatments in a practice research network. Soc Psychiatry Psychiatr Epidemiol. 2005;40:27–35.

Gheorghe A, Roberts T, Hemming K, Calvert M. Evaluating the Generalisability of Trial Results: Introducing a Centre- and Trial-Level Generalisability Index. Pharmacoeconomics. 2015;33:1195–214.

He Z, Wang S, Borhanian E, Weng C. Assessing the Collective Population Representativeness of Related Type 2 Diabetes Trials by Combining Public Data from ClinicalTrials.gov and NHANES. Stud Health Technol Inform. 2015;216:569–73.

Schmidt AF, Groenwold RHH, van Delden JJM, van der Does Y, Klungel OH, Roes KCB, Hoes AW, van der Graaf R. Justification of exclusion criteria was underreported in a review of cardiovascular trials. J Clin Epidemiol. 2014;67:635–44.

Carr DB, Goudas LC, Balk EM, Bloch R, Ioannidis JP, Lau J. Evidence report on the treatment of pain in cancer patients. J Natl Cancer Inst Monogr. 2004;32:23–31.

Clegg A, Bryant J, Nicholson T, et al. Clinical and cost-effectiveness of donepezil, rivastigmine and galantamine for Alzheimer’s disease: a rapid and systematic review. Health Technol Assess (Rockv). 2001;5:1–136.

Foy R, Hempel S, Rubenstein L, Suttorp M, Seelig M, Shanman R, Shekelle PG. Meta-analysis: effect of interactive communication between collaborating primary care physicians and specialists. Ann Intern Med. 2010;152:247–58.

Haraldsson BG, Gross AR, Myers CD, Ezzo JM, Morien A, Goldsmith C, Peloso PM, Bronfort G. Massage for mechanical neck disorders. Cochrane database Syst Rev. 2006.  https://doi.org/10.1002/14651858.CD004871.pub3 .

Hawk C, Khorsan R, AJ L, RJ F. Chiropractic care for nonmusculoskeletal conditions: a systematic review with implications for whole systems research. J Altern Complement Med. 2007;13:491–512.

Karjalainen K, Malmivaara A, van Tulder M, et al. Multidisciplinary rehabilitation for fibromyalgia and musculoskeletal pain in working age adults. Cochrane Database Syst Rev. 2000. https://doi.org/10.1002/14651858.CD001984 .

Article   PubMed   Google Scholar  

Liberati A, Himel HN, Chalmers TC. A quality assessment of randomized control trials of primary treatment of breast cancer. J Clin Oncol. 1986;4:942–51.

Averis A, Pearson A. Filling the gaps: identifying nursing research priorities through the analysis of completed systematic reviews. Jbi Reports. 2003;1:49–126.

Sorg C, Schmidt J, Büchler MW, Edler L, Märten A. Examination of external validity in randomized controlled trials for adjuvant treatment of pancreatic adenocarcinoma. Pancreas. 2009;38:542–50.

National Institute for Health and Care Excellence. Methods for the development of NICE public health guidance, Third edit. National Institute for Health and Care Excellence. 2012;  https://www.nice.org.uk/process/pmg4/chapter/introduction . Accessed 15 Apr 2020

U.S. Preventive Services Task Force. Criteria for Assessing External Validity (Generalizability) of Individual Studies. US Prev Serv Task Force Appendix VII. 2017;  https://uspreventiveservicestaskforce.org/uspstf/about-uspstf/methods-and-processes/procedure-manual/procedure-manual-appendix-vii-criteria-assessing-external-validity-generalizability-individual-studies . Accessed 15 Apr 2020.

National Health and Medical Research Council NHMRC handbooks. https://www.nhmrc.gov.au/about-us/publications/how-prepare-and-present-evidence-based-information-consumers-health-services#block-views-block-file-attachments-content-block-1 . Accessed 15 Apr 2020.

Loyka CM, Ruscio J, Edelblum AB, Hatch L, Wetreich B, Zabel Caitlin M. Weighing people rather than food: A framework for examining external validity. Perspect Psychol Sci. 2020;15:483–96.

Fernandez-Hermida JR, Calafat A, Becoña E, Tsertsvadze A, Foxcroft DR. Assessment of generalizability, applicability and predictability (GAP) for evaluating external validity in studies of universal family-based prevention of alcohol misuse in young people: systematic methodological review of randomized controlled trials. Addiction. 2012;107:1570–9.

Clark E, Burkett K, Stanko-Lopp D. Let Evidence Guide Every New Decision (LEGEND): an evidence evaluation system for point-of-care clinicians and guideline development teams. J Eval Clin Pract. 2009;15:1054–60.

Bornhöft G, Maxion-Bergemann S, Wolf U, Kienle GS, Michalsen A, Vollmar HC, Gilbertson S, Matthiessen PF. Checklist for the qualitative evaluation of clinical studies with particular focus on external validity and model validity. BMC Med Res Methodol. 2006;6:56.

Cho MK, Bero LA. Instruments for assessing the quality of drug studies published in the medical literature. JAMA J Am Med Assoc. 1994;272:101–4.

CAS   Google Scholar  

Cho MK, Bero LA. The quality of drug studies published in symposium proceedings. Ann Intern Med 1996;124:485–489

van Tulder M, Furlan A, Bombardier C, Bouter L. Updated method guidelines for systematic reviews in the cochrane collaboration back review group. Spine (Phila Pa 1976). 2003;28:1290–9.

Estrada F, Atienzo EE, Cruz-Jiménez L, Campero L. A Rapid Review of Interventions to Prevent First Pregnancy among Adolescents and Its Applicability to Latin America. J Pediatr Adolesc Gynecol. 2021;34:491–503.

Khorsan R, Crawford C. How to assess the external validity and model validity of therapeutic trials: a conceptual approach to systematic review methodology. Evid Based Complement Alternat Med. 2014;2014:694804.

O’Connor SR, Tully MA, Ryan B, Bradley JM, Baxter GD, McDonough SM. Failure of a numerical quality assessment scale to identify potential risk of bias in a systematic review: a comparison study. BMC Res Notes. 2015;8:224.

Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B, Reitman D, Ambroz A. A method for assessing the quality of a randomized control trial. Control Clin Trials. 1981;2:31–49.

Gartlehner G, Hansen RA, Nissman D, Lohr KN, Carey TS. A simple and valid tool distinguished efficacy from effectiveness studies. J Clin Epidemiol. 2006;59:1040–8.

Zettler LL, Speechley MR, Foley NC, Salter KL, Teasell RW. A scale for distinguishing efficacy from effectiveness was adapted and applied to stroke rehabilitation studies. J Clin Epidemiol. 2010;63:11–8.

Green LW, Glasgow RE. Evaluating the relevance, generalization, and applicability of research: issues in external validation and translation methodology. Eval Health Prof. 2006;29:126–53.

Glasgow RE, Vogt TM, Boles SM. Evaluating the public health impact of health promotion interventions: the RE-AIM framework. Am J Public Health. 1999;89:1322–7.

Mirza NA, Akhtar-Danesh N, Staples E, Martin L, Noesgaard C. Comparative Analysis of External Validity Reporting in Non-randomized Intervention Studies. Can J Nurs Res. 2014;46:47–64.

Laws RA, St George AB, Rychetnik L, Bauman AE. Diabetes prevention research: a systematic review of external validity in lifestyle interventions. Am J Prev Med. 2012;43:205–14.

Schünemann H, Brożek J, Guyatt G, Oxman A. Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach (updated October 2013). GRADE Work. Gr. 2013;  https://gdt.gradepro.org/app/handbook/handbook.html . Accessed 15 Apr 2020.

Wu XY, Chung VCH, Wong CHL, Yip BHK, Cheung WKW, Wu JCY. CHIMERAS showed better inter-rater reliability and inter-consensus reliability than GRADE in grading quality of evidence: A randomized controlled trial. Eur J Integr Med. 2018;23:116–22.

Meader N, King K, Llewellyn A, Norman G, Brown J, Rodgers M, Moe-Byrne T, Higgins JPT, Sowden A, Stewart G. A checklist designed to aid consistency and reproducibility of GRADE assessments: Development and pilot validation. Syst Rev. 2014. https://doi.org/10.1186/2046-4053-3-82 .

Article   PubMed   PubMed Central   Google Scholar  

Llewellyn A, Whittington C, Stewart G, Higgins JP, Meader N. The Use of Bayesian Networks to Assess the Quality of Evidence from Research Synthesis: 2. Inter-Rater Reliability and Comparison with Standard GRADE Assessment. PLoS One. 2015;10:e0123511.

Jackson R, Ameratunga S, Broad J, Connor J, Lethaby A, Robb G, Wells S, Glasziou P, Heneghan C. The GATE frame: critical appraisal with pictures. Evid Based Med 2006;11:35 LP– 38

Aves T. The Role of Pragmatism in Explaining Heterogeneity in Meta-Analyses of Randomized Trials: A Methodological Review. 2017; McMaster University. http://hdl.handle.net/11375/22212 . Accessed 12 Jan 2021.

Thomas BH, Ciliska D, Dobbins M, Micucci S. A process for systematically reviewing the literature: providing the research evidence for public health nursing interventions. Worldviews Evidence-Based Nurs. 2004;1:176–84.

Armijo-Olivo S, Stiles CR, Hagen NA, Biondo PD, Cummings GG. Assessment of study quality for systematic reviews: a comparison of the Cochrane Collaboration Risk of Bias Tool and the Effective Public Health Practice Project Quality Assessment Tool: methodological research. J Eval Clin Pract. 2012;18:12–8.

Critical Appraisal Skills Programme. CASP Randomised Controlled Trial Standard Checklist. 2020;  https://casp-uk.net/casp-tools-checklists/ . Accessed 10 Dec 2020.

Aves T, Allan KS, Lawson D, Nieuwlaat R, Beyene J, Mbuagbaw L. The role of pragmatism in explaining heterogeneity in meta-analyses of randomised trials: a protocol for a cross-sectional methodological review. BMJ Open. 2017;7:e017887.

Diamantopoulos A, Riefler P, Roth KP. Advancing formative measurement models. J Bus Res. 2008;61:1203–18.

Fayers PM, Hand DJ. Factor analysis, causal indicators and quality of life. Qual Life Res. 1997. https://doi.org/10.1023/A:1026490117121 .

Streiner DL. Being Inconsistent About Consistency: When Coefficient Alpha Does and Doesn’t Matter. J Pers Assess. 2003;80:217–22.

MacKenzie SB, Podsakoff PM, Jarvis CB. The Problem of Measurement Model Misspecification in Behavioral and Organizational Research and Some Recommended Solutions. J Appl Psychol. 2005;90:710–30.

De Vet HCW, Terwee CB, Mokkink LB, Knol DL. Measurement in medicine: a practical guide. 2011;  https://doi.org/10.1017/CBO9780511996214

Dekkers OM, Bossuyt PM, Vandenbroucke JP. How trial results are intended to be used: is PRECIS-2 a step forward? J Clin Epidemiol. 2017;84:25–6.

Brozek JL, Canelo-Aybar C, Akl EA, et al. GRADE Guidelines 30: the GRADE approach to assessing the certainty of modeled evidence-An overview in the context of health decision-making. J Clin Epidemiol. 2021;129:138–50.

Burchett HED, Kneale D, Blanchard L, Thomas J. When assessing generalisability, focusing on differences in population or setting alone is insufficient. Trials. 2020;21:286.

Verhagen AP, de Vet HCW, de Bie RA, Kessels AGH, Boers M, Bouter LM, Knipschild PG. The Delphi List: A Criteria List for Quality Assessment of Randomized Clinical Trials for Conducting Systematic Reviews Developed by Delphi Consensus. J Clin Epidemiol. 1998;51:1235–41.

Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical guide to their development and use, Fifth edit. Oxford: Oxford University Press; 2015.

DeVellis RF. Scale development: Theory and applications, Fourth edi. Los Angeles: Sage publications; 2017.

Download references

Acknowledgements

We would like to thank Sven Bossmann and Sarah Tiemann for their assistance with the elaboration of the search strategy.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

Institute of Health Sciences, Department of Physiotherapy, Pain and Exercise Research Luebeck (P.E.R.L), Universität zu Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany

Andres Jung & Kerstin Luedtke

Faculty of Applied Public Health, European University of Applied Sciences, Werftstr. 5, 18057, Rostock, Germany

Julia Balzer

Division of Physiotherapy, Department of Applied Health Sciences, Hochschule für Gesundheit (University of Applied Sciences), Gesundheitscampus 6‑8, 44801, Bochum, Germany

Tobias Braun

Department of Health, HSD Hochschule Döpfer (University of Applied Sciences), Waidmarkt 9, 50676, Cologne, Germany

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the design of the study. AJ designed the search strategy and conducted the systematic search. AJ and TB screened titles and abstracts as well as full-text reports in phase (1) AJ and KL screened titles and abstracts as well as full-text reports in phase (2) Data extraction was performed by AJ and checked by TB. Quality appraisal and data analysis was performed by AJ and JB. AJ drafted the manuscript. JB, TB and KL critically revised the manuscript for important intellectual content. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Andres Jung .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1., additional file 2., additional file 3., additional file 4., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jung, A., Balzer, J., Braun, T. et al. Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties. BMC Med Res Methodol 22 , 100 (2022). https://doi.org/10.1186/s12874-022-01561-5

Download citation

Received : 20 August 2021

Accepted : 28 February 2022

Published : 06 April 2022

DOI : https://doi.org/10.1186/s12874-022-01561-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • External validity
  • Generalizability
  • Applicability
  • Randomized controlled trial

BMC Medical Research Methodology

ISSN: 1471-2288

research tool validation

The Classroom | Empowering Students in Their College Journey

How to Validate a Research Instrument

The Real Difference Between Reliability and Validity

The Real Difference Between Reliability and Validity

In the field of Psychology, research is a necessary component of determining whether a given treatment is effective and if our current understanding of human behavior is accurate. Therefore, the instruments used to evaluate research data must be valid and precise. If they are not, the information collected from a study is likely to be biased or factually flawed, doing more harm than good.

Protect construct validity. A construct is the behavior or outcome a researcher seeks to measure within a study, often revealed by the independent variable. Therefore, it is important to operationalize or define the construct precisely. For example, if you are studying depression but only measure the number of times a person cries, your construct is not valid and your research will likely be skewed.

Protect internal validity. Internal validity refers to how well your experiment is free of outside influence that could taint its results. Thus, a research instrument that takes students’ grades into account but not their developmental age is not a valid determinant of intelligence. Because the grades on a test will vary within different age brackets, a valid instrument should control for differences and isolate true scores.

Protect external validity. External validity refers to how well your study reflects the real world and not just an artificial situation. An instrument may work perfectly with a group of white male college students but this does not mean its results are generalizable to children, blue-collar adults or those of varied gender and ethnicity. For an instrument to have high external validity, it must be applicable to a diverse group of people and a wide array of natural environments.

Protect conclusion validity. When the study is complete, researchers may still invalidate their data by making a conclusion error. Essentially, there are two types to guard against. A Type I error is concluding there is no relationship between experimental variables when, in fact, there is. Conversely, a Type II error is claiming a relationship exists when the correlation is merely the result of flawed data.

Validating instruments and conducting experimental research is an extensive area within the mental health profession and should never be taken lightly. For an in-depth treatment of this topic, see Research Design in Counseling (3rd edition) by Heppner, Wampold, & Kivlighan.

Related Articles

How to Evaluate Statistical Analysis

How to Evaluate Statistical Analysis

Forms of Validity Used in Assessment Instruments

Forms of Validity Used in Assessment Instruments

What Is Experimental Research Design?

What Is Experimental Research Design?

Types of Primary Data

Types of Primary Data

How Important Is Scientific Evidence?

How Important Is Scientific Evidence?

Types of research hypotheses.

Limitations to Qualitative Research

Limitations to Qualitative Research

The Disadvantages of Logistic Regression

The Disadvantages of Logistic Regression

  • Validating instruments and conducting experimental research is an extensive area within the mental health profession and should never be taken lightly. For an in-depth treatment of this topic, see Research Design in Counseling (3rd edition) by Heppner, Wampold, & Kivlighan.

David Kingsbury holds degrees in psychology and theology from Campbellsville University. He has published two university-level pieces and numerous freelance articles on culture and society, entertainment, and travel. His portfolio includes clips from "Polymancer" magazine and various online health and fitness sites, where he offers his services as a mental health writer.

Uncomplicated Reviews of Educational Research Methods

  • Instrument, Validity, Reliability

.pdf version of this page

Part I: The Instrument

Instrument is the general term that researchers use for a measurement device (survey, test, questionnaire, etc.). To help distinguish between instrument and instrumentation, consider that the instrument is the device and instrumentation is the course of action (the process of developing, testing, and using the device).

Instruments fall into two broad categories, researcher-completed and subject-completed, distinguished by those instruments that researchers administer versus those that are completed by participants. Researchers chose which type of instrument, or instruments, to use based on the research question. Examples are listed below:

Usability refers to the ease with which an instrument can be administered, interpreted by the participant, and scored/interpreted by the researcher. Example usability problems include:

  • Students are asked to rate a lesson immediately after class, but there are only a few minutes before the next class begins (problem with administration).
  • Students are asked to keep self-checklists of their after school activities, but the directions are complicated and the item descriptions confusing (problem with interpretation).
  • Teachers are asked about their attitudes regarding school policy, but some questions are worded poorly which results in low completion rates (problem with scoring/interpretation).

Validity and reliability concerns (discussed below) will help alleviate usability issues. For now, we can identify five usability considerations:

  • How long will it take to administer?
  • Are the directions clear?
  • How easy is it to score?
  • Do equivalent forms exist?
  • Have any problems been reported by others who used it?

It is best to use an existing instrument, one that has been developed and tested numerous times, such as can be found in the Mental Measurements Yearbook . We will turn to why next.

Part II: Validity

Validity is the extent to which an instrument measures what it is supposed to measure and performs as it is designed to perform. It is rare, if nearly impossible, that an instrument be 100% valid, so validity is generally measured in degrees. As a process, validation involves collecting and analyzing data to assess the accuracy of an instrument. There are numerous statistical tests and measures to assess the validity of quantitative instruments, which generally involves pilot testing. The remainder of this discussion focuses on external validity and content validity.

External validity is the extent to which the results of a study can be generalized from a sample to a population. Establishing eternal validity for an instrument, then, follows directly from sampling. Recall that a sample should be an accurate representation of a population, because the total population may not be available. An instrument that is externally valid helps obtain population generalizability, or the degree to which a sample represents the population.

Content validity refers to the appropriateness of the content of an instrument. In other words, do the measures (questions, observation logs, etc.) accurately assess what you want to know? This is particularly important with achievement tests. Consider that a test developer wants to maximize the validity of a unit test for 7th grade mathematics. This would involve taking representative questions from each of the sections of the unit and evaluating them against the desired outcomes.

Part III: Reliability

Reliability can be thought of as consistency. Does the instrument consistently measure what it is intended to measure? It is not possible to calculate reliability; however, there are four general estimators that you may encounter in reading research:

  • Inter-Rater/Observer Reliability : The degree to which different raters/observers give consistent answers or estimates.
  • Test-Retest Reliability : The consistency of a measure evaluated over time.
  • Parallel-Forms Reliability: The reliability of two tests constructed the same way, from the same content.
  • Internal Consistency Reliability: The consistency of results across items, often measured with Cronbach’s Alpha.

Relating Reliability and Validity

Reliability is directly related to the validity of the measure. There are several important principles. First, a test can be considered reliable, but not valid. Consider the SAT, used as a predictor of success in college. It is a reliable test (high scores relate to high GPA), though only a moderately valid indicator of success (due to the lack of structured environment – class attendance, parent-regulated study, and sleeping habits – each holistically related to success).

Second, validity is more important than reliability. Using the above example, college admissions may consider the SAT a reliable test, but not necessarily a valid measure of other quantities colleges seek, such as leadership capability, altruism, and civic involvement. The combination of these aspects, alongside the SAT, is a more valid measure of the applicant’s potential for graduation, later social involvement, and generosity (alumni giving) toward the alma mater.

Finally, the most useful instrument is both valid and reliable. Proponents of the SAT argue that it is both. It is a moderately reliable predictor of future success and a moderately valid measure of a student’s knowledge in Mathematics, Critical Reading, and Writing.

Part IV: Validity and Reliability in Qualitative Research

Thus far, we have discussed Instrumentation as related to mostly quantitative measurement. Establishing validity and reliability in qualitative research can be less precise, though participant/member checks, peer evaluation (another researcher checks the researcher’s inferences based on the instrument ( Denzin & Lincoln, 2005 ), and multiple methods (keyword: triangulation ), are convincingly used. Some qualitative researchers reject the concept of validity due to the constructivist viewpoint that reality is unique to the individual, and cannot be generalized. These researchers argue for a different standard for judging research quality. For a more complete discussion of trustworthiness, see Lincoln and Guba’s (1985) chapter .

Share this:

  • How To Assess Research Validity | Windranger5
  • How unreliable are the judges on Strictly Come Dancing? | Delight Through Logical Misery

Comments are closed.

About Research Rundowns

Research Rundowns was made possible by support from the Dewar College of Education at Valdosta State University .

  • Experimental Design
  • What is Educational Research?
  • Writing Research Questions
  • Mixed Methods Research Designs
  • Qualitative Coding & Analysis
  • Qualitative Research Design
  • Correlation
  • Effect Size
  • Mean & Standard Deviation
  • Significance Testing (t-tests)
  • Steps 1-4: Finding Research
  • Steps 5-6: Analyzing & Organizing
  • Steps 7-9: Citing & Writing
  • Writing a Research Report

Create a free website or blog at WordPress.com.

' src=

  • Already have a WordPress.com account? Log in now.
  • Subscribe Subscribed
  • Copy shortlink
  • Report this content
  • View post in Reader
  • Manage subscriptions
  • Collapse this bar

Research-Methodology

Research validity in surveys relates to the extent at which the survey measures right elements that need to be measured. In simple terms, validity refers to how well an instrument as measures what it is intended to measure.

Reliability alone is not enough, measures need to be reliable, as well as, valid. For example, if a weight measuring scale is wrong by 4kg (it deducts 4 kg of the actual weight), it can be specified as reliable, because the scale displays the same weight every time we measure a specific item. However, the scale is not valid because it does not display the actual weight of the item.

Research validity can be divided into two groups: internal and external. It can be specified that “internal validity refers to how the research findings match reality, while external validity refers to the extend to which the research findings can be replicated to other environments” (Pelissier, 2008, p.12).

Moreover, validity can also be divided into five types:

1. Face Validity is the most basic type of validity and it is associated with a highest level of subjectivity because it is not based on any scientific approach. In other words, in this case a test may be specified as valid by a researcher because it may seem as valid, without an in-depth scientific justification.

Example: questionnaire design for a study that analyses the issues of employee performance can be assessed as valid because each individual question may seem to be addressing specific and relevant aspects of employee performance.

2. Construct Validity relates to assessment of suitability of measurement tool to measure the phenomenon being studied. Application of construct validity can be effectively facilitated with the involvement of panel of ‘experts’ closely familiar with the measure and the phenomenon.

Example: with the application of construct validity the levels of leadership competency in any given organisation can be effectively assessed by devising questionnaire to be answered by operational level employees and asking questions about the levels of their motivation to do their duties in a daily basis.

3. Criterion-Related Validity involves comparison of tests results with the outcome. This specific type of validity correlates results of assessment with another criterion of assessment.

Example: nature of customer perception of brand image of a specific company can be assessed via organising a focus group. The same issue can also be assessed through devising questionnaire to be answered by current and potential customers of the brand. The higher the level of correlation between focus group and questionnaire findings, the high the level of criterion-related validity.

4. Formative Validity refers to assessment of effectiveness of the measure in terms of providing information that can be used to improve specific aspects of the phenomenon.

Example: when developing initiatives to increase the levels of effectiveness of organisational culture if the measure is able to identify specific weaknesses of organisational culture such as employee-manager communication barriers, then the level of formative validity of the measure can be assessed as adequate.

5. Sampling Validity (similar to content validity) ensures that the area of coverage of the measure within the research area is vast. No measure is able to cover all items and elements within the phenomenon, therefore, important items and elements are selected using a specific pattern of sampling method depending on aims and objectives of the study.

Example: when assessing a leadership style exercised in a specific organisation, assessment of decision-making style would not suffice, and other issues related to leadership style such as organisational culture, personality of leaders, the nature of the industry etc. need to be taken into account as well.

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Research Validity

Validation of a new assessment tool for qualitative research articles

Affiliation.

  • 1 Telemedicine Research Unit, Frederiksberg University Hospital, Copenhagen, Denmark. [email protected]
  • PMID: 22168459
  • DOI: 10.1111/j.1365-2648.2011.05898.x

Aim: This paper presents the development and validation of a new assessment tool for qualitative research articles, which could assess trustworthiness of qualitative research articles as defined by Guba and at the same time aid clinicians in their assessment.

Background: There are more than 100 sets of proposals for quality criteria for qualitative research. However, we are not aware of an assessment tool that is validated and applicable, not only for researchers but also for clinicians with different levels of training and experience in reading research articles.

Method: In three phases from 2007 to 2009 we delevoped and tested such an assessment tool called VAKS, which is the Danish acronym for appraisal of qualitative studies. Phase 1 was to develop the tool based on a literature review and on consultation with qualitative researchers. Phase 2 was an inter-rater reliability test in which 40 health professionals participated. Phase 3 was an inter-rater reliability test among the five authors by means of five qualitative articles.

Results: The new assessment tool was based on Guba's four criteria for assessing the trustworthiness of qualitative inquiries. The nurses found the assessment tool simple to use and helpful in assessing the quality of the articles. The inter-rater agreement was acceptable, but disagreement was seen for some items.

Conclusion: We have developed an assessment tool for appraisal of qualitative research studies. Nurses with a range of formal education and experience in reading research articles are able to appraise, relatively consistently, articles based on different qualitative research designs. We hope that VAKS will be used and further developed.

© 2011 Blackwell Publishing Ltd.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study
  • Confidence Intervals
  • Health Services Research / standards*
  • Qualitative Research*
  • Reproducibility of Results
  • Research Design

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 03 May 2024

A dataset for measuring the impact of research data and their curation

  • Libby Hemphill   ORCID: orcid.org/0000-0002-3793-7281 1 , 2 ,
  • Andrea Thomer 3 ,
  • Sara Lafia 1 ,
  • Lizhou Fan 2 ,
  • David Bleckley   ORCID: orcid.org/0000-0001-7715-4348 1 &
  • Elizabeth Moss 1  

Scientific Data volume  11 , Article number:  442 ( 2024 ) Cite this article

576 Accesses

8 Altmetric

Metrics details

  • Research data
  • Social sciences

Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.

Similar content being viewed by others

research tool validation

SciSciNet: A large-scale open data lake for the science of science research

research tool validation

Data, measurement and empirical methods in the science of science

research tool validation

Interdisciplinarity revisited: evidence for research impact and dynamism

Background & summary.

Recent policy changes in funding agencies and academic journals have increased data sharing among researchers and between researchers and the public. Data sharing advances science and provides the transparency necessary for evaluating, replicating, and verifying results. However, many data-sharing policies do not explain what constitutes an appropriate dataset for archiving or how to determine the value of datasets to secondary users 1 , 2 , 3 . Questions about how to allocate data-sharing resources efficiently and responsibly have gone unanswered 4 , 5 , 6 . For instance, data-sharing policies recognize that not all data should be curated and preserved, but they do not articulate metrics or guidelines for determining what data are most worthy of investment.

Despite the potential for innovation and advancement that data sharing holds, the best strategies to prioritize datasets for preparation and archiving are often unclear. Some datasets are likely to have more downstream potential than others, and data curation policies and workflows should prioritize high-value data instead of being one-size-fits-all. Though prior research in library and information science has shown that the “analytic potential” of a dataset is key to its reuse value 7 , work is needed to implement conceptual data reuse frameworks 8 , 9 , 10 , 11 , 12 , 13 , 14 . In addition, publishers and data archives need guidance to develop metrics and evaluation strategies to assess the impact of datasets.

Several existing resources have been compiled to study the relationship between the reuse of scholarly products, such as datasets (Table  1 ); however, none of these resources include explicit information on how curation processes are applied to data to increase their value, maximize their accessibility, and ensure their long-term preservation. The CCex (Curation Costs Exchange) provides models of curation services along with cost-related datasets shared by contributors but does not make explicit connections between them or include reuse information 15 . Analyses on platforms such as DataCite 16 have focused on metadata completeness and record usage, but have not included related curation-level information. Analyses of GenBank 17 and FigShare 18 , 19 citation networks do not include curation information. Related studies of Github repository reuse 20 and Softcite software citation 21 reveal significant factors that impact the reuse of secondary research products but do not focus on research data. RD-Switchboard 22 and DSKG 23 are scholarly knowledge graphs linking research data to articles, patents, and grants, but largely omit social science research data and do not include curation-level factors. To our knowledge, other studies of curation work in organizations similar to ICPSR – such as GESIS 24 , Dataverse 25 , and DANS 26 – have not made their underlying data available for analysis.

This paper describes a dataset 27 compiled for the MICA project (Measuring the Impact of Curation Actions) led by investigators at ICPSR, a large social science data archive at the University of Michigan. The dataset was originally developed to study the impacts of data curation and archiving on data reuse. The MICA dataset has supported several previous publications investigating the intensity of data curation actions 28 , the relationship between data curation actions and data reuse 29 , and the structures of research communities in a data citation network 30 . Collectively, these studies help explain the return on various types of curatorial investments. The dataset that we introduce in this paper, which we refer to as the MICA dataset, has the potential to address research questions in the areas of science (e.g., knowledge production), library and information science (e.g., scholarly communication), and data archiving (e.g., reproducible workflows).

We constructed the MICA dataset 27 using records available at ICPSR, a large social science data archive at the University of Michigan. Data set creation involved: collecting and enriching metadata for articles indexed in the ICPSR Bibliography of Data-related Literature against the Dimensions AI bibliometric database; gathering usage statistics for studies from ICPSR’s administrative database; processing data curation work logs from ICPSR’s project tracking platform, Jira; and linking data in social science studies and series to citing analysis papers (Fig.  1 ).

figure 1

Steps to prepare MICA dataset for analysis - external sources are red, primary internal sources are blue, and internal linked sources are green.

Enrich paper metadata

The ICPSR Bibliography of Data-related Literature is a growing database of literature in which data from ICPSR studies have been used. Its creation was funded by the National Science Foundation (Award 9977984), and for the past 20 years it has been supported by ICPSR membership and multiple US federally-funded and foundation-funded topical archives at ICPSR. The Bibliography was originally launched in the year 2000 to aid in data discovery by providing a searchable database linking publications to the study data used in them. The Bibliography collects the universe of output based on the data shared in each study through, which is made available through each ICPSR study’s webpage. The Bibliography contains both peer-reviewed and grey literature, which provides evidence for measuring the impact of research data. For an item to be included in the ICPSR Bibliography, it must contain an analysis of data archived by ICPSR or contain a discussion or critique of the data collection process, study design, or methodology 31 . The Bibliography is manually curated by a team of librarians and information specialists at ICPSR who enter and validate entries. Some publications are supplied to the Bibliography by data depositors, and some citations are submitted to the Bibliography by authors who abide by ICPSR’s terms of use requiring them to submit citations to works in which they analyzed data retrieved from ICPSR. Most of the Bibliography is populated by Bibliography team members, who create custom queries for ICPSR studies performed across numerous sources, including Google Scholar, ProQuest, SSRN, and others. Each record in the Bibliography is one publication that has used one or more ICPSR studies. The version we used was captured on 2021-11-16 and included 94,755 publications.

To expand the coverage of the ICPSR Bibliography, we searched exhaustively for all ICPSR study names, unique numbers assigned to ICPSR studies, and DOIs 32 using a full-text index available through the Dimensions AI database 33 . We accessed Dimensions through a license agreement with the University of Michigan. ICPSR Bibliography librarians and information specialists manually reviewed and validated new entries that matched one or more search criteria. We then used Dimensions to gather enriched metadata and full-text links for items in the Bibliography with DOIs. We matched 43% of the items in the Bibliography to enriched Dimensions metadata including abstracts, field of research codes, concepts, and authors’ institutional information; we also obtained links to full text for 16% of Bibliography items. Based on licensing agreements, we included Dimensions identifiers and links to full text so that users with valid publisher and database access can construct an enriched publication dataset.

Gather study usage data

ICPSR maintains a relational administrative database, DBInfo, that organizes study-level metadata and information on data reuse across separate tables. Studies at ICPSR consist of one or more files collected at a single time or for a single purpose; studies in which the same variables are observed over time are grouped into series. Each study at ICPSR is assigned a DOI, and its metadata are stored in DBInfo. Study metadata follows the Data Documentation Initiative (DDI) Codebook 2.5 standard. DDI elements included in our dataset are title, ICPSR study identification number, DOI, authoring entities, description (abstract), funding agencies, subject terms assigned to the study during curation, and geographic coverage. We also created variables based on DDI elements: total variable count, the presence of survey question text in the metadata, the number of author entities, and whether an author entity was an institution. We gathered metadata for ICPSR’s 10,605 unrestricted public-use studies available as of 2021-11-16 ( https://www.icpsr.umich.edu/web/pages/membership/or/metadata/oai.html ).

To link study usage data with study-level metadata records, we joined study metadata from DBinfo on study usage information, which included total study downloads (data and documentation), individual data file downloads, and cumulative citations from the ICPSR Bibliography. We also gathered descriptive metadata for each study and its variables, which allowed us to summarize and append recoded fields onto the study-level metadata such as curation level, number and type of principle investigators, total variable count, and binary variables indicating whether the study data were made available for online analysis, whether survey question text was made searchable online, and whether the study variables were indexed for search. These characteristics describe aspects of the discoverability of the data to compare with other characteristics of the study. We used the study and series numbers included in the ICPSR Bibliography as unique identifiers to link papers to metadata and analyze the community structure of dataset co-citations in the ICPSR Bibliography 32 .

Process curation work logs

Researchers deposit data at ICPSR for curation and long-term preservation. Between 2016 and 2020, more than 3,000 research studies were deposited with ICPSR. Since 2017, ICPSR has organized curation work into a central unit that provides varied levels of curation that vary in the intensity and complexity of data enhancement that they provide. While the levels of curation are standardized as to effort (level one = less effort, level three = most effort), the specific curatorial actions undertaken for each dataset vary. The specific curation actions are captured in Jira, a work tracking program, which data curators at ICPSR use to collaborate and communicate their progress through tickets. We obtained access to a corpus of 669 completed Jira tickets corresponding to the curation of 566 unique studies between February 2017 and December 2019 28 .

To process the tickets, we focused only on their work log portions, which contained free text descriptions of work that data curators had performed on a deposited study, along with the curators’ identifiers, and timestamps. To protect the confidentiality of the data curators and the processing steps they performed, we collaborated with ICPSR’s curation unit to propose a classification scheme, which we used to train a Naive Bayes classifier and label curation actions in each work log sentence. The eight curation action labels we proposed 28 were: (1) initial review and planning, (2) data transformation, (3) metadata, (4) documentation, (5) quality checks, (6) communication, (7) other, and (8) non-curation work. We note that these categories of curation work are very specific to the curatorial processes and types of data stored at ICPSR, and may not match the curation activities at other repositories. After applying the classifier to the work log sentences, we obtained summary-level curation actions for a subset of all ICPSR studies (5%), along with the total number of hours spent on data curation for each study, and the proportion of time associated with each action during curation.

Data Records

The MICA dataset 27 connects records for each of ICPSR’s archived research studies to the research publications that use them and related curation activities available for a subset of studies (Fig.  2 ). Each of the three tables published in the dataset is available as a study archived at ICPSR. The data tables are distributed as statistical files available for use in SAS, SPSS, Stata, and R as well as delimited and ASCII text files. The dataset is organized around studies and papers as primary entities. The studies table lists ICPSR studies, their metadata attributes, and usage information; the papers table was constructed using the ICPSR Bibliography and Dimensions database; and the curation logs table summarizes the data curation steps performed on a subset of ICPSR studies.

Studies (“ICPSR_STUDIES”): 10,605 social science research datasets available through ICPSR up to 2021-11-16 with variables for ICPSR study number, digital object identifier, study name, series number, series title, authoring entities, full-text description, release date, funding agency, geographic coverage, subject terms, topical archive, curation level, single principal investigator (PI), institutional PI, the total number of PIs, total variables in data files, question text availability, study variable indexing, level of restriction, total unique users downloading study data files and codebooks, total unique users downloading data only, and total unique papers citing data through November 2021. Studies map to the papers and curation logs table through ICPSR study numbers as “STUDY”. However, not every study in this table will have records in the papers and curation logs tables.

Papers (“ICPSR_PAPERS”): 94,755 publications collected from 2000-08-11 to 2021-11-16 in the ICPSR Bibliography and enriched with metadata from the Dimensions database with variables for paper number, identifier, title, authors, publication venue, item type, publication date, input date, ICPSR series numbers used in the paper, ICPSR study numbers used in the paper, the Dimension identifier, and the Dimensions link to the publication’s full text. Papers map to the studies table through ICPSR study numbers in the “STUDY_NUMS” field. Each record represents a single publication, and because a researcher can use multiple datasets when creating a publication, each record may list multiple studies or series.

Curation logs (“ICPSR_CURATION_LOGS”): 649 curation logs for 563 ICPSR studies (although most studies in the subset had one curation log, some studies were associated with multiple logs, with a maximum of 10) curated between February 2017 and December 2019 with variables for study number, action labels assigned to work description sentences using a classifier trained on ICPSR curation logs, hours of work associated with a single log entry, and total hours of work logged for the curation ticket. Curation logs map to the study and paper tables through ICPSR study numbers as “STUDY”. Each record represents a single logged action, and future users may wish to aggregate actions to the study level before joining tables.

figure 2

Entity-relation diagram.

Technical Validation

We report on the reliability of the dataset’s metadata in the following subsections. To support future reuse of the dataset, curation services provided through ICPSR improved data quality by checking for missing values, adding variable labels, and creating a codebook.

All 10,605 studies available through ICPSR have a DOI and a full-text description summarizing what the study is about, the purpose of the study, the main topics covered, and the questions the PIs attempted to answer when they conducted the study. Personal names (i.e., principal investigators) and organizational names (i.e., funding agencies) are standardized against an authority list maintained by ICPSR; geographic names and subject terms are also standardized and hierarchically indexed in the ICPSR Thesaurus 34 . Many of ICPSR’s studies (63%) are in a series and are distributed through the ICPSR General Archive (56%), a non-topical archive that accepts any social or behavioral science data. While study data have been available through ICPSR since 1962, the earliest digital release date recorded for a study was 1984-03-18, when ICPSR’s database was first employed, and the most recent date is 2021-10-28 when the dataset was collected.

Curation level information was recorded starting in 2017 and is available for 1,125 studies (11%); approximately 80% of studies with assigned curation levels received curation services, equally distributed between Levels 1 (least intensive), 2 (moderately intensive), and 3 (most intensive) (Fig.  3 ). Detailed descriptions of ICPSR’s curation levels are available online 35 . Additional metadata are available for a subset of 421 studies (4%), including information about whether the study has a single PI, an institutional PI, the total number of PIs involved, total variables recorded is available for online analysis, has searchable question text, has variables that are indexed for search, contains one or more restricted files, and whether the study is completely restricted. We provided additional metadata for this subset of ICPSR studies because they were released within the past five years and detailed curation and usage information were available for them. Usage statistics including total downloads and data file downloads are available for this subset of studies as well; citation statistics are available for 8,030 studies (76%). Most ICPSR studies have fewer than 500 users, as indicated by total downloads, or citations (Fig.  4 ).

figure 3

ICPSR study curation levels.

figure 4

ICPSR study usage.

A subset of 43,102 publications (45%) available in the ICPSR Bibliography had a DOI. Author metadata were entered as free text, meaning that variations may exist and require additional normalization and pre-processing prior to analysis. While author information is standardized for each publication, individual names may appear in different sort orders (e.g., “Earls, Felton J.” and “Stephen W. Raudenbush”). Most of the items in the ICPSR Bibliography as of 2021-11-16 were journal articles (59%), reports (14%), conference presentations (9%), or theses (8%) (Fig.  5 ). The number of publications collected in the Bibliography has increased each decade since the inception of ICPSR in 1962 (Fig.  6 ). Most ICPSR studies (76%) have one or more citations in a publication.

figure 5

ICPSR Bibliography citation types.

figure 6

ICPSR citations by decade.

Usage Notes

The dataset consists of three tables that can be joined using the “STUDY” key as shown in Fig.  2 . The “ICPSR_PAPERS” table contains one row per paper with one or more cited studies in the “STUDY_NUMS” column. We manipulated and analyzed the tables as CSV files with the Pandas library 36 in Python and the Tidyverse packages 37 in R.

The present MICA dataset can be used independently to study the relationship between curation decisions and data reuse. Evidence of reuse for specific studies is available in several forms: usage information, including downloads and citation counts; and citation contexts within papers that cite data. Analysis may also be performed on the citation network formed between datasets and papers that use them. Finally, curation actions can be associated with properties of studies and usage histories.

This dataset has several limitations of which users should be aware. First, Jira tickets can only be used to represent the intensiveness of curation for activities undertaken since 2017, when ICPSR started using both Curation Levels and Jira. Studies published before 2017 were all curated, but documentation of the extent of that curation was not standardized and therefore could not be included in these analyses. Second, the measure of publications relies upon the authors’ clarity of data citation and the ICPSR Bibliography staff’s ability to discover citations with varying formality and clarity. Thus, there is always a chance that some secondary-data-citing publications have been left out of the bibliography. Finally, there may be some cases in which a paper in the ICSPSR bibliography did not actually obtain data from ICPSR. For example, PIs have often written about or even distributed their data prior to their archival in ICSPR. Therefore, those publications would not have cited ICPSR but they are still collected in the Bibliography as being directly related to the data that were eventually deposited at ICPSR.

In summary, the MICA dataset contains relationships between two main types of entities – papers and studies – which can be mined. The tables in the MICA dataset have supported network analysis (community structure and clique detection) 30 ; natural language processing (NER for dataset reference detection) 32 ; visualizing citation networks (to search for datasets) 38 ; and regression analysis (on curation decisions and data downloads) 29 . The data are currently being used to develop research metrics and recommendation systems for research data. Given that DOIs are provided for ICPSR studies and articles in the ICPSR Bibliography, the MICA dataset can also be used with other bibliometric databases, including DataCite, Crossref, OpenAlex, and related indexes. Subscription-based services, such as Dimensions AI, are also compatible with the MICA dataset. In some cases, these services provide abstracts or full text for papers from which data citation contexts can be extracted for semantic content analysis.

Code availability

The code 27 used to produce the MICA project dataset is available on GitHub at https://github.com/ICPSR/mica-data-descriptor and through Zenodo with the identifier https://doi.org/10.5281/zenodo.8432666 . Data manipulation and pre-processing were performed in Python. Data curation for distribution was performed in SPSS.

He, L. & Han, Z. Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech 35 , 332–342 (2017).

Article   Google Scholar  

Brickley, D., Burgess, M. & Noy, N. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference - WWW ‘19 , 1365–1375 (ACM Press, San Francisco, CA, USA, 2019).

Buneman, P., Dosso, D., Lissandrini, M. & Silvello, G. Data citation and the citation graph. Quantitative Science Studies 2 , 1399–1422 (2022).

Chao, T. C. Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. Proceedings of the American Society for Information Science and Technology 48 , 1–8 (2011).

Article   ADS   Google Scholar  

Parr, C. et al . A discussion of value metrics for data repositories in earth and environmental sciences. Data Science Journal 18 , 58 (2019).

Eschenfelder, K. R., Shankar, K. & Downey, G. The financial maintenance of social science data archives: Four case studies of long–term infrastructure work. J. Assoc. Inf. Sci. Technol. 73 , 1723–1740 (2022).

Palmer, C. L., Weber, N. M. & Cragin, M. H. The analytic potential of scientific data: Understanding re-use value. Proceedings of the American Society for Information Science and Technology 48 , 1–10 (2011).

Zimmerman, A. S. New knowledge from old data: The role of standards in the sharing and reuse of ecological data. Sci. Technol. Human Values 33 , 631–652 (2008).

Cragin, M. H., Palmer, C. L., Carlson, J. R. & Witt, M. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 , 4023–4038 (2010).

Article   ADS   CAS   Google Scholar  

Fear, K. M. Measuring and Anticipating the Impact of Data Reuse . Ph.D. thesis, University of Michigan (2013).

Borgman, C. L., Van de Sompel, H., Scharnhorst, A., van den Berg, H. & Treloar, A. Who uses the digital data archive? An exploratory study of DANS. Proceedings of the Association for Information Science and Technology 52 , 1–4 (2015).

Pasquetto, I. V., Borgman, C. L. & Wofford, M. F. Uses and reuses of scientific data: The data creators’ advantage. Harvard Data Science Review 1 (2019).

Gregory, K., Groth, P., Scharnhorst, A. & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review (2020).

York, J. Seeking equilibrium in data reuse: A study of knowledge satisficing . Ph.D. thesis, University of Michigan (2022).

Kilbride, W. & Norris, S. Collaborating to clarify the cost of curation. New Review of Information Networking 19 , 44–48 (2014).

Robinson-Garcia, N., Mongeon, P., Jeng, W. & Costas, R. DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics 11 , 841–854 (2017).

Qin, J., Hemsley, J. & Bratt, S. E. The structural shift and collaboration capacity in GenBank networks: A longitudinal study. Quantitative Science Studies 3 , 174–193 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Acuna, D. E., Yi, Z., Liang, L. & Zhuang, H. Predicting the usage of scientific datasets based on article, author, institution, and journal bibliometrics. In Smits, M. (ed.) Information for a Better World: Shaping the Global Future. iConference 2022 ., 42–52 (Springer International Publishing, Cham, 2022).

Zeng, T., Wu, L., Bratt, S. & Acuna, D. E. Assigning credit to scientific datasets using article citation networks. Journal of Informetrics 14 , 101013 (2020).

Koesten, L., Vougiouklis, P., Simperl, E. & Groth, P. Dataset reuse: Toward translating principles to practice. Patterns 1 , 100136 (2020).

Du, C., Cohoon, J., Lopez, P. & Howison, J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J. Assoc. Inf. Sci. Technol. 72 , 870–884 (2021).

Aryani, A. et al . A research graph dataset for connecting research data repositories using RD-Switchboard. Sci Data 5 , 180099 (2018).

Färber, M. & Lamprecht, D. The data set knowledge graph: Creating a linked open data source for data sets. Quantitative Science Studies 2 , 1324–1355 (2021).

Perry, A. & Netscher, S. Measuring the time spent on data curation. Journal of Documentation 78 , 282–304 (2022).

Trisovic, A. et al . Advancing computational reproducibility in the Dataverse data repository platform. In Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems , P-RECS ‘20, 15–20, https://doi.org/10.1145/3391800.3398173 (Association for Computing Machinery, New York, NY, USA, 2020).

Borgman, C. L., Scharnhorst, A. & Golshan, M. S. Digital data archives as knowledge infrastructures: Mediating data sharing and reuse. Journal of the Association for Information Science and Technology 70 , 888–904, https://doi.org/10.1002/asi.24172 (2019).

Lafia, S. et al . MICA Data Descriptor. Zenodo https://doi.org/10.5281/zenodo.8432666 (2023).

Lafia, S., Thomer, A., Bleckley, D., Akmon, D. & Hemphill, L. Leveraging machine learning to detect data curation activities. In 2021 IEEE 17th International Conference on eScience (eScience) , 149–158, https://doi.org/10.1109/eScience51609.2021.00025 (2021).

Hemphill, L., Pienta, A., Lafia, S., Akmon, D. & Bleckley, D. How do properties of data, their curation, and their funding relate to reuse? J. Assoc. Inf. Sci. Technol. 73 , 1432–44, https://doi.org/10.1002/asi.24646 (2021).

Lafia, S., Fan, L., Thomer, A. & Hemphill, L. Subdivisions and crossroads: Identifying hidden community structures in a data archive’s citation network. Quantitative Science Studies 3 , 694–714, https://doi.org/10.1162/qss_a_00209 (2022).

ICPSR. ICPSR Bibliography of Data-related Literature: Collection Criteria. https://www.icpsr.umich.edu/web/pages/ICPSR/citations/collection-criteria.html (2023).

Lafia, S., Fan, L. & Hemphill, L. A natural language processing pipeline for detecting informal data references in academic literature. Proc. Assoc. Inf. Sci. Technol. 59 , 169–178, https://doi.org/10.1002/pra2.614 (2022).

Hook, D. W., Porter, S. J. & Herzog, C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics 3 , 23, https://doi.org/10.3389/frma.2018.00023 (2018).

https://www.icpsr.umich.edu/web/ICPSR/thesaurus (2002). ICPSR. ICPSR Thesaurus.

https://www.icpsr.umich.edu/files/datamanagement/icpsr-curation-levels.pdf (2020). ICPSR. ICPSR Curation Levels.

McKinney, W. Data Structures for Statistical Computing in Python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference , 56–61 (2010).

Wickham, H. et al . Welcome to the Tidyverse. Journal of Open Source Software 4 , 1686 (2019).

Fan, L., Lafia, S., Li, L., Yang, F. & Hemphill, L. DataChat: Prototyping a conversational agent for dataset search and visualization. Proc. Assoc. Inf. Sci. Technol. 60 , 586–591 (2023).

Download references

Acknowledgements

We thank the ICPSR Bibliography staff, the ICPSR Data Curation Unit, and the ICPSR Data Stewardship Committee for their support of this research. This material is based upon work supported by the National Science Foundation under grant 1930645. This project was made possible in part by the Institute of Museum and Library Services LG-37-19-0134-19.

Author information

Authors and affiliations.

Inter-university Consortium for Political and Social Research, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill, Sara Lafia, David Bleckley & Elizabeth Moss

School of Information, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill & Lizhou Fan

School of Information, University of Arizona, Tucson, AZ, 85721, USA

Andrea Thomer

You can also search for this author in PubMed   Google Scholar

Contributions

L.H. and A.T. conceptualized the study design, D.B., E.M., and S.L. prepared the data, S.L., L.F., and L.H. analyzed the data, and D.B. validated the data. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Libby Hemphill .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hemphill, L., Thomer, A., Lafia, S. et al. A dataset for measuring the impact of research data and their curation. Sci Data 11 , 442 (2024). https://doi.org/10.1038/s41597-024-03303-2

Download citation

Received : 16 November 2023

Accepted : 24 April 2024

Published : 03 May 2024

DOI : https://doi.org/10.1038/s41597-024-03303-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research tool validation

  • Open access
  • Published: 17 August 2023

Data visualisation in scoping reviews and evidence maps on health topics: a cross-sectional analysis

  • Emily South   ORCID: orcid.org/0000-0003-2187-4762 1 &
  • Mark Rodgers 1  

Systematic Reviews volume  12 , Article number:  142 ( 2023 ) Cite this article

3614 Accesses

13 Altmetric

Metrics details

Scoping reviews and evidence maps are forms of evidence synthesis that aim to map the available literature on a topic and are well-suited to visual presentation of results. A range of data visualisation methods and interactive data visualisation tools exist that may make scoping reviews more useful to knowledge users. The aim of this study was to explore the use of data visualisation in a sample of recent scoping reviews and evidence maps on health topics, with a particular focus on interactive data visualisation.

Ovid MEDLINE ALL was searched for recent scoping reviews and evidence maps (June 2020-May 2021), and a sample of 300 papers that met basic selection criteria was taken. Data were extracted on the aim of each review and the use of data visualisation, including types of data visualisation used, variables presented and the use of interactivity. Descriptive data analysis was undertaken of the 238 reviews that aimed to map evidence.

Of the 238 scoping reviews or evidence maps in our analysis, around one-third (37.8%) included some form of data visualisation. Thirty-five different types of data visualisation were used across this sample, although most data visualisations identified were simple bar charts (standard, stacked or multi-set), pie charts or cross-tabulations (60.8%). Most data visualisations presented a single variable (64.4%) or two variables (26.1%). Almost a third of the reviews that used data visualisation did not use any colour (28.9%). Only two reviews presented interactive data visualisation, and few reported the software used to create visualisations.

Conclusions

Data visualisation is currently underused by scoping review authors. In particular, there is potential for much greater use of more innovative forms of data visualisation and interactive data visualisation. Where more innovative data visualisation is used, scoping reviews have made use of a wide range of different methods. Increased use of these more engaging visualisations may make scoping reviews more useful for a range of stakeholders.

Peer Review reports

Scoping reviews are “a type of evidence synthesis that aims to systematically identify and map the breadth of evidence available on a particular topic, field, concept, or issue” ([ 1 ], p. 950). While they include some of the same steps as a systematic review, such as systematic searches and the use of predetermined eligibility criteria, scoping reviews often address broader research questions and do not typically involve the quality appraisal of studies or synthesis of data [ 2 ]. Reasons for conducting a scoping review include the following: to map types of evidence available, to explore research design and conduct, to clarify concepts or definitions and to map characteristics or factors related to a concept [ 3 ]. Scoping reviews can also be undertaken to inform a future systematic review (e.g. to assure authors there will be adequate studies) or to identify knowledge gaps [ 3 ]. Other evidence synthesis approaches with similar aims have been described as evidence maps, mapping reviews or systematic maps [ 4 ]. While this terminology is used inconsistently, evidence maps can be used to identify evidence gaps and present them in a user-friendly (and often visual) way [ 5 ].

Scoping reviews are often targeted to an audience of healthcare professionals or policy-makers [ 6 ], suggesting that it is important to present results in a user-friendly and informative way. Until recently, there was little guidance on how to present the findings of scoping reviews. In recent literature, there has been some discussion of the importance of clearly presenting data for the intended audience of a scoping review, with creative and innovative use of visual methods if appropriate [ 7 , 8 , 9 ]. Lockwood et al. suggest that innovative visual presentation should be considered over dense sections of text or long tables in many cases [ 8 ]. Khalil et al. suggest that inspiration could be drawn from the field of data visualisation [ 7 ]. JBI guidance on scoping reviews recommends that reviewers carefully consider the best format for presenting data at the protocol development stage and provides a number of examples of possible methods [ 10 ].

Interactive resources are another option for presentation in scoping reviews [ 9 ]. Researchers without the relevant programming skills can now use several online platforms (such as Tableau [ 11 ] and Flourish [ 12 ]) to create interactive data visualisations. The benefits of using interactive visualisation in research include the ability to easily present more than two variables [ 13 ] and increased engagement of users [ 14 ]. Unlike static graphs, interactive visualisations can allow users to view hierarchical data at different levels, exploring both the “big picture” and looking in more detail ([ 15 ], p. 291). Interactive visualizations are often targeted at practitioners and decision-makers [ 13 ], and there is some evidence from qualitative research that they are valued by policy-makers [ 16 , 17 , 18 ].

Given their focus on mapping evidence, we believe that scoping reviews are particularly well-suited to visually presenting data and the use of interactive data visualisation tools. However, it is unknown how many recent scoping reviews visually map data or which types of data visualisation are used. The aim of this study was to explore the use of data visualisation methods in a large sample of recent scoping reviews and evidence maps on health topics. In particular, we were interested in the extent to which these forms of synthesis use any form of interactive data visualisation.

This study was a cross-sectional analysis of studies labelled as scoping reviews or evidence maps (or synonyms of these terms) in the title or abstract.

The search strategy was developed with help from an information specialist. Ovid MEDLINE® ALL was searched in June 2021 for studies added to the database in the previous 12 months. The search was limited to English language studies only.

The search strategy was as follows:

Ovid MEDLINE(R) ALL

(scoping review or evidence map or systematic map or mapping review or scoping study or scoping project or scoping exercise or literature mapping or evidence mapping or systematic mapping or literature scoping or evidence gap map).ab,ti.

limit 1 to english language

(202006* or 202007* or 202008* or 202009* or 202010* or 202011* or 202012* or 202101* or 202102* or 202103* or 202104* or 202105*).dt.

The search returned 3686 records. Records were de-duplicated in EndNote 20 software, leaving 3627 unique records.

A sample of these reviews was taken by screening the search results against basic selection criteria (Table 1 ). These criteria were piloted and refined after discussion between the two researchers. A single researcher (E.S.) screened the records in EPPI-Reviewer Web software using the machine-learning priority screening function. Where a second opinion was needed, decisions were checked by a second researcher (M.R.).

Our initial plan for sampling, informed by pilot searching, was to screen and data extract records in batches of 50 included reviews at a time. We planned to stop screening when a batch of 50 reviews had been extracted that included no new types of data visualisation or after screening time had reached 2 days. However, once data extraction was underway, we found the sample to be richer in terms of data visualisation than anticipated. After the inclusion of 300 reviews, we took the decision to end screening in order to ensure the study was manageable.

Data extraction

A data extraction form was developed in EPPI-Reviewer Web, piloted on 50 reviews and refined. Data were extracted by one researcher (E. S. or M. R.), with a second researcher (M. R. or E. S.) providing a second opinion when needed. The data items extracted were as follows: type of review (term used by authors), aim of review (mapping evidence vs. answering specific question vs. borderline), number of visualisations (if any), types of data visualisation used, variables/domains presented by each visualisation type, interactivity, use of colour and any software requirements.

When categorising review aims, we considered “mapping evidence” to incorporate all of the six purposes for conducting a scoping review proposed by Munn et al. [ 3 ]. Reviews were categorised as “answering a specific question” if they aimed to synthesise study findings to answer a particular question, for example on effectiveness of an intervention. We were inclusive with our definition of “mapping evidence” and included reviews with mixed aims in this category. However, some reviews were difficult to categorise (for example where aims were unclear or the stated aims did not match the actual focus of the paper) and were considered to be “borderline”. It became clear that a proportion of identified records that described themselves as “scoping” or “mapping” reviews were in fact pseudo-systematic reviews that failed to undertake key systematic review processes. Such reviews attempted to integrate the findings of included studies rather than map the evidence, and so reviews categorised as “answering a specific question” were excluded from the main analysis. Data visualisation methods for meta-analyses have been explored previously [ 19 ]. Figure  1 shows the flow of records from search results to final analysis sample.

figure 1

Flow diagram of the sampling process

Data visualisation was defined as any graph or diagram that presented results data, including tables with a visual mapping element, such as cross-tabulations and heat maps. However, tables which displayed data at a study level (e.g. tables summarising key characteristics of each included study) were not included, even if they used symbols, shading or colour. Flow diagrams showing the study selection process were also excluded. Data visualisations in appendices or supplementary information were included, as well as any in publicly available dissemination products (e.g. visualisations hosted online) if mentioned in papers.

The typology used to categorise data visualisation methods was based on an existing online catalogue [ 20 ]. Specific types of data visualisation were categorised in five broad categories: graphs, diagrams, tables, maps/geographical and other. If a data visualisation appeared in our sample that did not feature in the original catalogue, we checked a second online catalogue [ 21 ] for an appropriate term, followed by wider Internet searches. These additional visualisation methods were added to the appropriate section of the typology. The final typology can be found in Additional file 1 .

We conducted descriptive data analysis in Microsoft Excel 2019 and present frequencies and percentages. Where appropriate, data are presented using graphs or other data visualisations created using Flourish. We also link to interactive versions of some of these visualisations.

Almost all of the 300 reviews in the total sample were labelled by review authors as “scoping reviews” ( n  = 293, 97.7%). There were also four “mapping reviews”, one “scoping study”, one “evidence mapping” and one that was described as a “scoping review and evidence map”. Included reviews were all published in 2020 or 2021, with the exception of one review published in 2018. Just over one-third of these reviews ( n  = 105, 35.0%) included some form of data visualisation. However, we excluded 62 reviews that did not focus on mapping evidence from the following analysis (see “ Methods ” section). Of the 238 remaining reviews (that either clearly aimed to map evidence or were judged to be “borderline”), 90 reviews (37.8%) included at least one data visualisation. The references for these reviews can be found in Additional file 2 .

Number of visualisations

Thirty-six (40.0%) of these 90 reviews included just one example of data visualisation (Fig.  2 ). Less than a third ( n  = 28, 31.1%) included three or more visualisations. The greatest number of data visualisations in one review was 17 (all bar or pie charts). In total, 222 individual data visualisations were identified across the sample of 238 reviews.

figure 2

Number of data visualisations per review

Categories of data visualisation

Graphs were the most frequently used category of data visualisation in the sample. Over half of the reviews with data visualisation included at least one graph ( n  = 59, 65.6%). The least frequently used category was maps, with 15.6% ( n  = 14) of these reviews including a map.

Of the total number of 222 individual data visualisations, 102 were graphs (45.9%), 34 were tables (15.3%), 23 were diagrams (10.4%), 15 were maps (6.8%) and 48 were classified as “other” in the typology (21.6%).

Types of data visualisation

All of the types of data visualisation identified in our sample are reported in Table 2 . In total, 35 different types were used across the sample of reviews.

The most frequently used data visualisation type was a bar chart. Of 222 total data visualisations, 78 (35.1%) were a variation on a bar chart (either standard bar chart, stacked bar chart or multi-set bar chart). There were also 33 pie charts (14.9% of data visualisations) and 24 cross-tabulations (10.8% of data visualisations). In total, these five types of data visualisation accounted for 60.8% ( n  = 135) of all data visualisations. Figure  3 shows the frequency of each data visualisation category and type; an interactive online version of this treemap is also available ( https://public.flourish.studio/visualisation/9396133/ ). Figure  4 shows how users can further explore the data using the interactive treemap.

figure 3

Data visualisation categories and types. An interactive version of this treemap is available online: https://public.flourish.studio/visualisation/9396133/ . Through the interactive version, users can further explore the data (see Fig.  4 ). The unit of this treemap is the individual data visualisation, so multiple data visualisations within the same scoping review are represented in this map. Created with flourish.studio ( https://flourish.studio )

figure 4

Screenshots showing how users of the interactive treemap can explore the data further. Users can explore each level of the hierarchical treemap ( A Visualisation category >  B Visualisation subcategory >  C Variables presented in visualisation >  D Individual references reporting this category/subcategory/variable permutation). Created with flourish.studio ( https://flourish.studio )

Data presented

Around two-thirds of data visualisations in the sample presented a single variable ( n  = 143, 64.4%). The most frequently presented single variables were themes ( n  = 22, 9.9% of data visualisations), population ( n  = 21, 9.5%), country or region ( n  = 21, 9.5%) and year ( n  = 20, 9.0%). There were 58 visualisations (26.1%) that presented two different variables. The remaining 21 data visualisations (9.5%) presented three or more variables. Figure  5 shows the variables presented by each different type of data visualisation (an interactive version of this figure is available online).

figure 5

Variables presented by each data visualisation type. Darker cells indicate a larger number of reviews. An interactive version of this heat map is available online: https://public.flourish.studio/visualisation/10632665/ . Users can hover over each cell to see the number of data visualisations for that combination of data visualisation type and variable. The unit of this heat map is the individual data visualisation, so multiple data visualisations within a single scoping review are represented in this map. Created with flourish.studio ( https://flourish.studio )

Most reviews presented at least one data visualisation in colour ( n  = 64, 71.1%). However, almost a third ( n  = 26, 28.9%) used only black and white or greyscale.

Interactivity

Only two of the reviews included data visualisations with any level of interactivity. One scoping review on music and serious mental illness [ 22 ] linked to an interactive bubble chart hosted online on Tableau. Functionality included the ability to filter the studies displayed by various attributes.

The other review was an example of evidence mapping from the environmental health field [ 23 ]. All four of the data visualisations included in the paper were available in an interactive format hosted either by the review management software or on Tableau. The interactive versions linked to the relevant references so users could directly explore the evidence base. This was the only review that provided this feature.

Software requirements

Nine reviews clearly reported the software used to create data visualisations. Three reviews used Tableau (one of them also used review management software as discussed above) [ 22 , 23 , 24 ]. Two reviews generated maps using ArcGIS [ 25 ] or ArcMap [ 26 ]. One review used Leximancer for a lexical analysis [ 27 ]. One review undertook a bibliometric analysis using VOSviewer [ 28 ], and another explored citation patterns using CitNetExplorer [ 29 ]. Other reviews used Excel [ 30 ] or R [ 26 ].

To our knowledge, this is the first systematic and in-depth exploration of the use of data visualisation techniques in scoping reviews. Our findings suggest that the majority of scoping reviews do not use any data visualisation at all, and, in particular, more innovative examples of data visualisation are rare. Around 60% of data visualisations in our sample were simple bar charts, pie charts or cross-tabulations. There appears to be very limited use of interactive online visualisation, despite the potential this has for communicating results to a range of stakeholders. While it is not always appropriate to use data visualisation (or a simple bar chart may be the most user-friendly way of presenting the data), these findings suggest that data visualisation is being underused in scoping reviews. In a large minority of reviews, visualisations were not published in colour, potentially limiting how user-friendly and attractive papers are to decision-makers and other stakeholders. Also, very few reviews clearly reported the software used to create data visualisations. However, 35 different types of data visualisation were used across the sample, highlighting the wide range of methods that are potentially available to scoping review authors.

Our results build on the limited research that has previously been undertaken in this area. Two previous publications also found limited use of graphs in scoping reviews. Results were “mapped graphically” in 29% of scoping reviews in any field in one 2014 publication [ 31 ] and 17% of healthcare scoping reviews in a 2016 article [ 6 ]. Our results suggest that the use of data visualisation has increased somewhat since these reviews were conducted. Scoping review methods have also evolved in the last 10 years; formal guidance on scoping review conduct was published in 2014 [ 32 ], and an extension of the PRISMA checklist for scoping reviews was published in 2018 [ 33 ]. It is possible that an overall increase in use of data visualisation reflects increased quality of published scoping reviews. There is also some literature supporting our findings on the wide range of data visualisation methods that are used in evidence synthesis. An investigation of methods to identify, prioritise or display health research gaps (25/139 included studies were scoping reviews; 6/139 were evidence maps) identified 14 different methods used to display gaps or priorities, with half being “more advanced” (e.g. treemaps, radial bar plots) ([ 34 ], p. 107). A review of data visualisation methods used in papers reporting meta-analyses found over 200 different ways of displaying data [ 19 ].

Only two reviews in our sample used interactive data visualisation, and one of these was an example of systematic evidence mapping from the environmental health field rather than a scoping review (in environmental health, systematic evidence mapping explicitly involves producing a searchable database [ 35 ]). A scoping review of papers on the use of interactive data visualisation in population health or health services research found a range of examples but still limited use overall [ 13 ]. For example, the authors noted the currently underdeveloped potential for using interactive visualisation in research on health inequalities. It is possible that the use of interactive data visualisation in academic papers is restricted by academic publishing requirements; for example, it is currently difficult to incorporate an interactive figure into a journal article without linking to an external host or platform. However, we believe that there is a lot of potential to add value to future scoping reviews by using interactive data visualisation software. Few reviews in our sample presented three or more variables in a single visualisation, something which can easily be achieved using interactive data visualisation tools. We have previously used EPPI-Mapper [ 36 ] to present results of a scoping review of systematic reviews on behaviour change in disadvantaged groups, with links to the maps provided in the paper [ 37 ]. These interactive maps allowed policy-makers to explore the evidence on different behaviours and disadvantaged groups and access full publications of the included studies directly from the map.

We acknowledge there are barriers to use for some of the data visualisation software available. EPPI-Mapper and some of the software used by reviews in our sample incur a cost. Some software requires a certain level of knowledge and skill in its use. However numerous online free data visualisation tools and resources exist. We have used Flourish to present data for this review, a basic version of which is currently freely available and easy to use. Previous health research has been found to have used a range of different interactive data visualisation software, much of which does not required advanced knowledge or skills to use [ 13 ].

There are likely to be other barriers to the use of data visualisation in scoping reviews. Journal guidelines and policies may present barriers for using innovative data visualisation. For example, some journals charge a fee for publication of figures in colour. As previously mentioned, there are limited options for incorporating interactive data visualisation into journal articles. Authors may also be unaware of the data visualisation methods and tools that are available. Producing data visualisations can be time-consuming, particularly if authors lack experience and skills in this. It is possible that many authors prioritise speed of publication over spending time producing innovative data visualisations, particularly in a context where there is pressure to achieve publications.

Limitations

A limitation of this study was that we did not assess how appropriate the use of data visualisation was in our sample as this would have been highly subjective. Simple descriptive or tabular presentation of results may be the most appropriate approach for some scoping review objectives [ 7 , 8 , 10 ], and the scoping review literature cautions against “over-using” different visual presentation methods [ 7 , 8 ]. It cannot be assumed that all of the reviews that did not include data visualisation should have done so. Likewise, we do not know how many reviews used methods of data visualisation that were not well suited to their data.

We initially relied on authors’ own use of the term “scoping review” (or equivalent) to sample reviews but identified a relatively large number of papers labelled as scoping reviews that did not meet the basic definition, despite the availability of guidance and reporting guidelines [ 10 , 33 ]. It has previously been noted that scoping reviews may be undertaken inappropriately because they are seen as “easier” to conduct than a systematic review ([ 3 ], p.6), and that reviews are often labelled as “scoping reviews” while not appearing to follow any established framework or guidance [ 2 ]. We therefore took the decision to remove these reviews from our main analysis. However, decisions on how to classify review aims were subjective, and we did include some reviews that were of borderline relevance.

A further limitation is that this was a sample of published reviews, rather than a comprehensive systematic scoping review as have previously been undertaken [ 6 , 31 ]. The number of scoping reviews that are published has increased rapidly, and this would now be difficult to undertake. As this was a sample, not all relevant scoping reviews or evidence maps that would have met our criteria were included. We used machine learning to screen our search results for pragmatic reasons (to reduce screening time), but we do not see any reason that our sample would not be broadly reflective of the wider literature.

Data visualisation, and in particular more innovative examples of it, is currently underused in published scoping reviews on health topics. The examples that we have found highlight the wide range of methods that scoping review authors could draw upon to present their data in an engaging way. In particular, we believe that interactive data visualisation has significant potential for mapping the available literature on a topic. Appropriate use of data visualisation may increase the usefulness, and thus uptake, of scoping reviews as a way of identifying existing evidence or research gaps by decision-makers, researchers and commissioners of research. We recommend that scoping review authors explore the extensive free resources and online tools available for data visualisation. However, we also think that it would be useful for publishers to explore allowing easier integration of interactive tools into academic publishing, given the fact that papers are now predominantly accessed online. Future research may be helpful to explore which methods are particularly useful to scoping review users.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

Organisation formerly known as Joanna Briggs Institute

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Munn Z, Pollock D, Khalil H, Alexander L, McLnerney P, Godfrey CM, Peters M, Tricco AC. What are scoping reviews? Providing a formal definition of scoping reviews as a type of evidence synthesis. JBI Evid Synth. 2022;20:950–952.

Peters MDJ, Marnie C, Colquhoun H, Garritty CM, Hempel S, Horsley T, Langlois EV, Lillie E, O’Brien KK, Tunçalp Ӧ, et al. Scoping reviews: reinforcing and advancing the methodology and application. Syst Rev. 2021;10:263.

Article   PubMed   PubMed Central   Google Scholar  

Munn Z, Peters MDJ, Stern C, Tufanaru C, McArthur A, Aromataris E. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med Res Methodol. 2018;18:143.

Sutton A, Clowes M, Preston L, Booth A. Meeting the review family: exploring review types and associated information retrieval requirements. Health Info Libr J. 2019;36:202–22.

Article   PubMed   Google Scholar  

Miake-Lye IM, Hempel S, Shanman R, Shekelle PG. What is an evidence map? A systematic review of published evidence maps and their definitions, methods, and products. Syst Rev. 2016;5:28.

Tricco AC, Lillie E, Zarin W, O’Brien K, Colquhoun H, Kastner M, Levac D, Ng C, Sharpe JP, Wilson K, et al. A scoping review on the conduct and reporting of scoping reviews. BMC Med Res Methodol. 2016;16:15.

Khalil H, Peters MDJ, Tricco AC, Pollock D, Alexander L, McInerney P, Godfrey CM, Munn Z. Conducting high quality scoping reviews-challenges and solutions. J Clin Epidemiol. 2021;130:156–60.

Lockwood C, dos Santos KB, Pap R. Practical guidance for knowledge synthesis: scoping review methods. Asian Nurs Res. 2019;13:287–94.

Article   Google Scholar  

Pollock D, Peters MDJ, Khalil H, McInerney P, Alexander L, Tricco AC, Evans C, de Moraes ÉB, Godfrey CM, Pieper D, et al. Recommendations for the extraction, analysis, and presentation of results in scoping reviews. JBI Evidence Synthesis. 2022;10:11124.

Google Scholar  

Peters MDJ GC, McInerney P, Munn Z, Tricco AC, Khalil, H. Chapter 11: Scoping reviews (2020 version). In: Aromataris E MZ, editor. JBI Manual for Evidence Synthesis. JBI; 2020. Available from https://synthesismanual.jbi.global . Accessed 1 Feb 2023.

Tableau Public. https://www.tableau.com/en-gb/products/public . Accessed 24 January 2023.

flourish.studio. https://flourish.studio/ . Accessed 24 January 2023.

Chishtie J, Bielska IA, Barrera A, Marchand J-S, Imran M, Tirmizi SFA, Turcotte LA, Munce S, Shepherd J, Senthinathan A, et al. Interactive visualization applications in population health and health services research: systematic scoping review. J Med Internet Res. 2022;24: e27534.

Isett KR, Hicks DM. Providing public servants what they need: revealing the “unseen” through data visualization. Public Adm Rev. 2018;78:479–85.

Carroll LN, Au AP, Detwiler LT, Fu T-c, Painter IS, Abernethy NF. Visualization and analytics tools for infectious disease epidemiology: a systematic review. J Biomed Inform. 2014;51:287–298.

Lundkvist A, El-Khatib Z, Kalra N, Pantoja T, Leach-Kemon K, Gapp C, Kuchenmüller T. Policy-makers’ views on translating burden of disease estimates in health policies: bridging the gap through data visualization. Arch Public Health. 2021;79:17.

Zakkar M, Sedig K. Interactive visualization of public health indicators to support policymaking: an exploratory study. Online J Public Health Inform. 2017;9:e190–e190.

Park S, Bekemeier B, Flaxman AD. Understanding data use and preference of data visualization for public health professionals: a qualitative study. Public Health Nurs. 2021;38:531–41.

Kossmeier M, Tran US, Voracek M. Charting the landscape of graphical displays for meta-analysis and systematic reviews: a comprehensive review, taxonomy, and feature analysis. BMC Med Res Methodol. 2020;20:26.

Ribecca, S. The Data Visualisation Catalogue. https://datavizcatalogue.com/index.html . Accessed 23 November 2021.

Ferdio. Data Viz Project. https://datavizproject.com/ . Accessed 23 November 2021.

Golden TL, Springs S, Kimmel HJ, Gupta S, Tiedemann A, Sandu CC, Magsamen S. The use of music in the treatment and management of serious mental illness: a global scoping review of the literature. Front Psychol. 2021;12: 649840.

Keshava C, Davis JA, Stanek J, Thayer KA, Galizia A, Keshava N, Gift J, Vulimiri SV, Woodall G, Gigot C, et al. Application of systematic evidence mapping to assess the impact of new research when updating health reference values: a case example using acrolein. Environ Int. 2020;143: 105956.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Jayakumar P, Lin E, Galea V, Mathew AJ, Panda N, Vetter I, Haynes AB. Digital phenotyping and patient-generated health data for outcome measurement in surgical care: a scoping review. J Pers Med. 2020;10:282.

Qu LG, Perera M, Lawrentschuk N, Umbas R, Klotz L. Scoping review: hotspots for COVID-19 urological research: what is being published and from where? World J Urol. 2021;39:3151–60.

Article   CAS   PubMed   Google Scholar  

Rossa-Roccor V, Acheson ES, Andrade-Rivas F, Coombe M, Ogura S, Super L, Hong A. Scoping review and bibliometric analysis of the term “planetary health” in the peer-reviewed literature. Front Public Health. 2020;8:343.

Hewitt L, Dahlen HG, Hartz DL, Dadich A. Leadership and management in midwifery-led continuity of care models: a thematic and lexical analysis of a scoping review. Midwifery. 2021;98: 102986.

Xia H, Tan S, Huang S, Gan P, Zhong C, Lu M, Peng Y, Zhou X, Tang X. Scoping review and bibliometric analysis of the most influential publications in achalasia research from 1995 to 2020. Biomed Res Int. 2021;2021:8836395.

Vigliotti V, Taggart T, Walker M, Kusmastuti S, Ransome Y. Religion, faith, and spirituality influences on HIV prevention activities: a scoping review. PLoS ONE. 2020;15: e0234720.

van Heemskerken P, Broekhuizen H, Gajewski J, Brugha R, Bijlmakers L. Barriers to surgery performed by non-physician clinicians in sub-Saharan Africa-a scoping review. Hum Resour Health. 2020;18:51.

Pham MT, Rajić A, Greig JD, Sargeant JM, Papadopoulos A, McEwen SA. A scoping review of scoping reviews: advancing the approach and enhancing the consistency. Res Synth Methods. 2014;5:371–85.

Peters MDJ, Marnie C, Tricco AC, Pollock D, Munn Z, Alexander L, McInerney P, Godfrey CM, Khalil H. Updated methodological guidance for the conduct of scoping reviews. JBI Evid Synth. 2020;18:2119–26.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, Moher D, Peters MDJ, Horsley T, Weeks L, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169:467–73.

Nyanchoka L, Tudur-Smith C, Thu VN, Iversen V, Tricco AC, Porcher R. A scoping review describes methods used to identify, prioritize and display gaps in health research. J Clin Epidemiol. 2019;109:99–110.

Wolffe TAM, Whaley P, Halsall C, Rooney AA, Walker VR. Systematic evidence maps as a novel tool to support evidence-based decision-making in chemicals policy and risk management. Environ Int. 2019;130:104871.

Digital Solution Foundry and EPPI-Centre. EPPI-Mapper, Version 2.0.1. EPPI-Centre, UCL Social Research Institute, University College London. 2020. https://eppi.ioe.ac.uk/cms/Default.aspx?tabid=3790 .

South E, Rodgers M, Wright K, Whitehead M, Sowden A. Reducing lifestyle risk behaviours in disadvantaged groups in high-income countries: a scoping review of systematic reviews. Prev Med. 2022;154: 106916.

Download references

Acknowledgements

We would like to thank Melissa Harden, Senior Information Specialist, Centre for Reviews and Dissemination, for advice on developing the search strategy.

This work received no external funding.

Author information

Authors and affiliations.

Centre for Reviews and Dissemination, University of York, York, YO10 5DD, UK

Emily South & Mark Rodgers

You can also search for this author in PubMed   Google Scholar

Contributions

Both authors conceptualised and designed the study and contributed to screening, data extraction and the interpretation of results. ES undertook the literature searches, analysed data, produced the data visualisations and drafted the manuscript. MR contributed to revising the manuscript, and both authors read and approved the final version.

Corresponding author

Correspondence to Emily South .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1..

Typology of data visualisation methods.

Additional file 2.

References of scoping reviews included in main dataset.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

South, E., Rodgers, M. Data visualisation in scoping reviews and evidence maps on health topics: a cross-sectional analysis. Syst Rev 12 , 142 (2023). https://doi.org/10.1186/s13643-023-02309-y

Download citation

Received : 21 February 2023

Accepted : 07 August 2023

Published : 17 August 2023

DOI : https://doi.org/10.1186/s13643-023-02309-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Scoping review
  • Evidence map
  • Data visualisation

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

research tool validation

IMAGES

  1. Validation Certificate Template

    research tool validation

  2. Validation sheet

    research tool validation

  3. PPT

    research tool validation

  4. Validation Criteria Checklist

    research tool validation

  5. (PDF) Validation of a new assessment tool for qualitative research articles

    research tool validation

  6. Appendix A -Interview Questions Validation Sheet

    research tool validation

VIDEO

  1. research group at sharda University

  2. Maximizing Investments with Real-Time Tool Validation: A Guide for Cybersecurity Leaders

  3. Demystifying Software Tool Validation

  4. Developing the Research Instrument/Types and Validation

  5. Validity vs Reliability || Research ||

  6. Keeping it REAL about the NASTF General Meeting and Update

COMMENTS

  1. Validating a Questionnaire

    Generally speaking the first step in validating a survey is to establish face validity. There are two important steps in this process. First is to have experts or people who understand your topic read through your questionnaire. They should evaluate whether the questions effectively capture the topic under investigation.

  2. Designing and validating a research questionnaire

    Ranganathan P, Caduff C. Designing and validating a research questionnaire -Part 1. Perspect Clin Res. 2023;14:152-5. [ PMC free article] [ PubMed] [ Google Scholar] 2. Tavakol M, Wetzel A. Factor analysis:A means for theory and instrument development in support of construct validity. Int J Med Educ. 2020;11:245-7.

  3. The 4 Types of Validity in Research

    Rewrite and paraphrase texts instantly with our AI-powered paraphrasing tool. Try for free. ... Try for free. The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton. Revised on June 22, 2023. ... Concurrent validity is a validation strategy where the the scores of a test and the criterion ...

  4. Validated Instruments

    Study instruments and tools must be provided at the time of protocol submission. Information that establishes the validity of the instrument/tool should be included in the protocol. Information about the validation of study tools assists the IRB in its deliberations about the scientific validity of the proposed study.

  5. Method of preparing a document for survey instrument validation by

    Researchers can use any word processing tool to create the document: Open in a separate window *Method details. Introduction. ... This paper is structured as follows: Section 1 provides the introduction to the need for a validation format for research, and the fundamentals of validation and the factors involved in validation from various ...

  6. Best Practices for Developing and Validating Scales for Health, Social

    Domain identification . The first step is to articulate the domain(s) that you are endeavoring to measure. A domain or construct refers to the concept, attribute, or unobserved behavior that is the target of the study ().Therefore, the domain being examined should be decided upon and defined before any item activity ().A well-defined domain will provide a working knowledge of the phenomenon ...

  7. Development and validation of a questionnaire to measure research

    Surveys are among the most widely used tools in research impact evaluation. Quantitative approaches as surveys are suggested for accountability purposes, as the most appropriate way that calls for transparency (Guthrie et al. 2013).They provide a broad overview of the status of a body of research and supply comparable, easy-to-analyze data referring to a range of researchers and/or grants.

  8. Method of preparing a document for survey instrument validation by

    Abstract. Validation of a survey instrument is an important activity in the research process. Face validity and content validity, though being qualitative methods, are essential steps in validating how far the survey instrument can measure what it is intended for. These techniques are used in both scale development processes and a questionnaire ...

  9. Validity and Reliability of the Research Instrument; How to Test the

    Phase-2, expert validation was carried out using experts or experts to assess the instrument by filling out the validation sheet with a rating scale of 1=very poor, 2=not good, 3=fair, 4=good, and ...

  10. A Step-by-step Guide to Questionnaire Validation Research

    A Step-by-step Guide to Questionnaire Validation Research. September 2022. DOI: 10.5281/zenodo.6801209. Publisher: Institute for Clinical Research, NIH MY. ISBN: ISBN: 9781005256180. Authors ...

  11. Identification of tools used to assess the external validity of

    Background Internal and external validity are the most relevant components when critically appraising randomized controlled trials (RCTs) for systematic reviews. However, there is no gold standard to assess external validity. This might be related to the heterogeneity of the terminology as well as to unclear evidence of the measurement properties of available tools. The aim of this review was ...

  12. Member Checking: A Tool to Enhance Trustworthiness or Merely a ...

    Abstract. The trustworthiness of results is the bedrock of high quality qualitative research. Member checking, also known as participant or respondent validation, is a technique for exploring the credibility of results. Data or results are returned to participants to check for accuracy and resonance with their experiences.

  13. Development and validation of a tool to evaluate the quality of medical

    Tool Validation. Very few tools have undergone rigorous validation. Of those who have, some show good validity and reliability of the their tools,[1,8,9,34] while others show poor validation measures including poor inter-rater agreement in a wide range of tools.[2,30,31]

  14. Tutorial on how to calculating content validity of scales in medical

    The standardization of a tool in research is a necessary condition for the accuracy of research results. One of the methods of standardizing a questionnaire is to check the content validity (CV). ... Development and validation of a new tool to measure the facilitators, barriers and preferences to exercise in people with osteoporosis. BMC ...

  15. How to Validate a Research Instrument

    Protect internal validity. Internal validity refers to how well your experiment is free of outside influence that could taint its results. Thus, a research instrument that takes students' grades into account but not their developmental age is not a valid determinant of intelligence. Because the grades on a test will vary within different age ...

  16. The validity and reliability of a tool for measuring educational

    The quantitative data was collected and analyzed via three main stages for the validation process (Barak & Usher, 2019; Barak & Levenberg, 2016; Glynn, Brickman, Armstrong, & Taasoobshirazi, 2011), resulting in the development of the EITC-SRQ tool. 2.3. Research participants, setting, and data analysis

  17. Instrument, Validity, Reliability

    Validity is the extent to which an instrument measures what it is supposed to measure and performs as it is designed to perform. It is rare, if nearly impossible, that an instrument be 100% valid, so validity is generally measured in degrees. As a process, validation involves collecting and analyzing data to assess the accuracy of an instrument.

  18. Validity

    Research validity in surveys relates to the extent at which the survey measures right elements that need to be measured. In simple terms, validity refers to how well an instrument as measures what it is intended to measure. Reliability alone is not enough, measures need to be reliable, as well as, valid. For example, if a weight measuring scale ...

  19. Validation of a new assessment tool for qualitative research articles

    Abstract. Aim: This paper presents the development and validation of a new assessment tool for qualitative research articles, which could assess trustworthiness of qualitative research articles as defined by Guba and at the same time aid clinicians in their assessment. Background: There are more than 100 sets of proposals for quality criteria ...

  20. Educator's blueprint: A how‐to guide for collecting validity evidence

    Surveys are descriptive assessment tools. Like other assessment tools, the validity and reliability of the data obtained from surveys depend, in large part, on the rigor of the development process. Without validity evidence, data from surveys may lack meaning, leading to uncertainty as to how well the survey truly measures the intended constructs.

  21. A dataset for measuring the impact of research data and their ...

    This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation ...

  22. Data visualisation in scoping reviews and evidence maps on health

    Scoping reviews and evidence maps are forms of evidence synthesis that aim to map the available literature on a topic and are well-suited to visual presentation of results. A range of data visualisation methods and interactive data visualisation tools exist that may make scoping reviews more useful to knowledge users. The aim of this study was to explore the use of data visualisation in a ...

  23. Validating Assessment Tools in Simulation

    Validation of these assessment tools is essential to ensure that they are valid and reliable. Validity refers to whether any measuring instrument measures what it is intended to measure. Additionally, reliability is part of the validity assessment and refers to the consistent or reproducible results of an assessment tool. The assessment tool ...

  24. Su1543 VALIDATION OF THE LIVER RISK SCORE IN A NATIONALLY

    @article{Pelton2024Su1543VO, title={Su1543 VALIDATION OF THE LIVER RISK SCORE IN A NATIONALLY REPRESENTATIVE SAMPLE: A PROMISING NOVEL TOOL FOR PREDICTING FIBROSIS AND LIVER-RELATED OUTCOMES}, author={Matt R. Pelton and Aayush Visaria and Eshani N. Goradia and Wenlong Feng and Ankoor H. Patel and Justin Zhuo and Alexander T. Lalos and Keerthana ...

  25. DeepES: Deep learning-based enzyme screening to identify ...

    Progress in sequencing technology has led to determination of large numbers of protein sequences, and large enzyme databases are now available. Although many computational tools for enzyme annotation were developed, sequence information is unavailable for many enzymes, known as orphan enzymes. These orphan enzymes hinder sequence similarity-based functional annotation, leading gaps in ...

  26. Validity, reliability, and generalizability in qualitative research

    Validity in qualitative research means "appropriateness" of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the ...