- Privacy Policy
Home » Validity – Types, Examples and Guide
Validity – Types, Examples and Guide
Table of Contents
Validity is a fundamental concept in research, referring to the extent to which a test, measurement, or study accurately reflects or assesses the specific concept that the researcher is attempting to measure. Ensuring validity is crucial as it determines the trustworthiness and credibility of the research findings.
Research Validity
Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications.
How to Ensure Validity in Research
Ensuring validity in research involves several strategies:
- Clear Operational Definitions : Define variables clearly and precisely.
- Use of Reliable Instruments : Employ measurement tools that have been tested for reliability.
- Pilot Testing : Conduct preliminary studies to refine the research design and instruments.
- Triangulation : Use multiple methods or sources to cross-verify results.
- Control Variables : Control extraneous variables that might influence the outcomes.
Types of Validity
Validity is categorized into several types, each addressing different aspects of measurement accuracy.
Internal Validity
Internal validity refers to the degree to which the results of a study can be attributed to the treatments or interventions rather than other factors. It is about ensuring that the study is free from confounding variables that could affect the outcome.
External Validity
External validity concerns the extent to which the research findings can be generalized to other settings, populations, or times. High external validity means the results are applicable beyond the specific context of the study.
Construct Validity
Construct validity evaluates whether a test or instrument measures the theoretical construct it is intended to measure. It involves ensuring that the test is truly assessing the concept it claims to represent.
Content Validity
Content validity examines whether a test covers the entire range of the concept being measured. It ensures that the test items represent all facets of the concept.
Criterion Validity
Criterion validity assesses how well one measure predicts an outcome based on another measure. It is divided into two types:
- Predictive Validity : How well a test predicts future performance.
- Concurrent Validity : How well a test correlates with a currently existing measure.
Face Validity
Face validity refers to the extent to which a test appears to measure what it is supposed to measure, based on superficial inspection. While it is the least scientific measure of validity, it is important for ensuring that stakeholders believe in the test’s relevance.
Importance of Validity
Validity is crucial because it directly affects the credibility of research findings. Valid results ensure that conclusions drawn from research are accurate and can be trusted. This, in turn, influences the decisions and policies based on the research.
Examples of Validity
- Internal Validity : A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases.
- External Validity : A study on educational interventions that can be applied to different schools across various regions.
- Construct Validity : A psychological test that accurately measures depression levels.
- Content Validity : An exam that covers all topics taught in a course.
- Criterion Validity : A job performance test that predicts future job success.
Where to Write About Validity in A Thesis
In a thesis, the methodology section should include discussions about validity. Here, you explain how you ensured the validity of your research instruments and design. Additionally, you may discuss validity in the results section, interpreting how the validity of your measurements affects your findings.
Applications of Validity
Validity has wide applications across various fields:
- Education : Ensuring assessments accurately measure student learning.
- Psychology : Developing tests that correctly diagnose mental health conditions.
- Market Research : Creating surveys that accurately capture consumer preferences.
Limitations of Validity
While ensuring validity is essential, it has its limitations:
- Complexity : Achieving high validity can be complex and resource-intensive.
- Context-Specific : Some validity types may not be universally applicable across all contexts.
- Subjectivity : Certain types of validity, like face validity, involve subjective judgments.
By understanding and addressing these aspects of validity, researchers can enhance the quality and impact of their studies, leading to more reliable and actionable results.
About the author
Muhammad Hassan
Researcher, Academic Writer, Web developer
You may also like
Reliability – Types, Examples and Guide
Test-Retest Reliability – Methods, Formula and...
Split-Half Reliability – Methods, Examples and...
Face Validity – Methods, Types, Examples
External Validity – Threats, Examples and Types
Reliability Vs Validity
Validity In Psychology Research: Types & Examples
Saul McLeod, PhD
Editor-in-Chief for Simply Psychology
BSc (Hons) Psychology, MRes, PhD, University of Manchester
Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.
Learn about our Editorial Process
Olivia Guy-Evans, MSc
Associate Editor for Simply Psychology
BSc (Hons) Psychology, MSc Psychology of Education
Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.
In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.
Validity can be categorized into different types based on internal and external validity .
The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).
Internal and External Validity In Research
Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.
In other words, there is a causal relationship between the independent and dependent variables .
Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.
External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).
External validity can be improved by setting experiments more naturally and using random sampling to select participants.
Types of Validity In Psychology
Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.
- Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
- Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.
Face Validity
Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.
Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).
A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.
For example:
- The test is extremely suitable for a given purpose
- The test is very suitable for that purpose;
- The test is adequate
- The test is inadequate
- The test is irrelevant and, therefore, unsuitable
It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.
Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).
The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.
It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.
Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.
A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.
However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.
It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.
For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).
This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.
Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.
Construct Validity
Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.
Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity
Construct validity does not concern the simple, factual question of whether a test measures an attribute.
Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).
To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .
Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.
The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.
Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.
Convergent validity
Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.
It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.
For example, suppose there are two different scales used to measure self-esteem:
Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.
If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.
Concurrent Validity (i.e., occurring at the same time)
Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.
It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.
If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.
Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.
Predictive Validity
Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.
For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.
Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.
Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.
Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.
Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.
Reliability and Validity
- Reference work entry
- pp 1643–1644
- Cite this reference work entry
- Yori Gidron 3
1176 Accesses
2 Citations
These two concepts are the basis for assessment in most scientific work in medical and social sciences. Reliability refers to the degree of consistency in measurement and to the lack of error. There are several types of indices of reliability. Internal reliability (measured by Cronbach’s alpha) is a measure of repeatability of a measure. In psychometrics, a questionnaire of, for example, 10 items, is said to be reliable if its internal reliability coefficient is at least 0.70. This reflects approximately the mean correlation between each score on each item, with all remaining item scores, repeated across all items. Methodologically, this reflects a measure of repeatability, a basic premise of science. Another type of reliability is inter-rater reliability, which refers to the degree of agreement between two or more observers, evaluating a patient’s behavior, for example. Thus, in the original type A behavior interview, which currently places more emphasis on hostility,...
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
References and Readings
Barefoot, J. C., Dahlstrom, W. G., & Williams, R. B., Jr. (1983). Hostility, CHD incidence, and total mortality: A 25-year follow-up study of 255 physicians. Psychosomatic Medicine, 45 , 59–63.
PubMed CAS Google Scholar
Del Greco, L., Walop, W., & McCarthy, R. H. (1987). Questionnaire development: 2. Validity and reliability. CMAJ, 136 , 699–700.
PubMed Google Scholar
Download references
Author information
Authors and affiliations.
Faculty of Medicine and Pharmacy, Free University of Brussels (VUB), 103, Laarbeeklaan, Jette, 1090, Belgium
Dr. Yori Gidron
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Yori Gidron .
Editor information
Editors and affiliations.
Behavioral Medicine Research Center, Department of Psychology, University of Miami, Miami, FL, USA
Marc D. Gellman
Cardiovascular Safety, Quintiles, Durham, NC, USA
J. Rick Turner
Rights and permissions
Reprints and permissions
Copyright information
© 2013 Springer Science+Business Media, New York
About this entry
Cite this entry.
Gidron, Y. (2013). Reliability and Validity. In: Gellman, M.D., Turner, J.R. (eds) Encyclopedia of Behavioral Medicine. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-1005-9_1549
Download citation
DOI : https://doi.org/10.1007/978-1-4419-1005-9_1549
Publisher Name : Springer, New York, NY
Print ISBN : 978-1-4419-1004-2
Online ISBN : 978-1-4419-1005-9
eBook Packages : Medicine Reference Module Medicine
Share this entry
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- CBE Life Sci Educ
- v.15(1); Spring 2016
Contemporary Test Validity in Theory and Practice: A Primer for Discipline-Based Education Researchers
Todd d. reeves.
*Educational Technology, Research and Assessment, Northern Illinois University, DeKalb, IL 60115
Gili Marbach-Ad
† College of Computer, Mathematical and Natural Sciences, University of Maryland, College Park, MD 20742
This essay offers a contemporary social science perspective on test validity and the validation process. The instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to discipline-based education researchers, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.
Most discipline-based education researchers (DBERs) were formally trained in the methods of scientific disciplines such as biology, chemistry, and physics, rather than social science disciplines such as psychology and education. As a result, DBERs may have never taken specific courses in the social science research methodology—either quantitative or qualitative—on which their scholarship often relies so heavily. One particular aspect of (quantitative) social science research that differs markedly from disciplines such as biology and chemistry is the instrumentation used to quantify phenomena. In response, this Research Methods essay offers a contemporary social science perspective on test validity and the validation process. The instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to DBERs, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.
INTRODUCTION
The field of discipline-based education research ( Singer et al ., 2012 ) has emerged in response to long-standing calls to advance the status of U.S. science education at the postsecondary level (e.g., Boyer Commission on Educating Undergraduates in the Research University, 1998 ; National Research Council, 2003 ; American Association for the Advancement of Science, 2011 ). Discipline-based education research applies scientific principles to study postsecondary science education processes and outcomes systematically to improve the scientific enterprise. In particular, this field has made significant progress with respect to the study of 1) active-learning pedagogies (e.g., Freeman et al. , 2014 ); 2) interventions to support those pedagogies among both faculty (e.g., Brownell and Tanner, 2012 ) and graduate teaching assistants (e.g., Schussler et al. , 2015 ); and 3) undergraduate research experiences (e.g., Auchincloss et al. , 2014 ).
Most discipline-based education researchers (DBERs) were formally trained in the methods of scientific disciplines such as biology, chemistry, and physics, rather than social science disciplines such as psychology and education. As a result, DBERs may have never taken specific courses in the social science research methodology—either quantitative or qualitative—on which their scholarship often relies so heavily ( Singer et al. , 2012 ). While the same principles of science ground all these fields, the specific methods used and some criteria for methodological and scientific rigor differ along disciplinary lines.
One particular aspect of (quantitative) social science research that differs markedly from research in disciplines such as biology and chemistry is the instrumentation used to quantify phenomena. Instrumentation is a critical aspect of research methodology, because it provides the raw materials input to statistical analyses and thus serves as a basis for credible conclusions and research-based educational practice ( Opfer et al ., 2012 ; Campbell and Nehm, 2013 ). A notable feature of social science instrumentation is that it generally targets variables that are latent, that is, variables that are not directly observable but instead must be inferred through observable behavior ( Bollen, 2002 ). For example, to elicit evidence of cognitive beliefs, which are not observable directly, respondents are asked to report their level of agreement (e.g., “strongly disagree,” “disagree,” “agree,” “strongly agree”) with textually presented statements (e.g., “I like science,” “Science is fun,” and “I look forward to science class”). Even a multiple-choice final examination does not directly observe the phenomenon of interest (e.g., student knowledge). As such, compared with work in traditional scientific disciplines, in the social sciences, more of an inferential leap is often required between the derivation of a score and its intended interpretation ( Opfer et al ., 2012 ).
Instruments designed to elicit evidence of variables of interest to DBERs have proliferated in recent years. Some well-known examples include the Experimental Design Ability Test (EDAT; Sirum and Humburg, 2011 ); the Genetics Concept Assessment ( Smith et al. , 2008 ); the Classroom Undergraduate Research Experience survey ( Denofrio et al. , 2007 ); and the Classroom Observation Protocol for Undergraduate STEM ( Smith et al. , 2013 ). However, available instruments vary widely in their quality and nuance ( Opfer et al. , 2012 ; Singer et al. , 2012 ; Campbell and Nehm, 2013 ), necessitating understanding on the part of DBERs of how to evaluate instruments for use in their research. Practitioners, too, should know how to evaluate and select high-quality instruments for program evaluation and/or assessment purposes. Where high-quality instruments do not already exist for use in one’s context, which is commonplace ( Opfer et al ., 2012 ), they need to be developed, and corresponding empirical validity evidence needs to be gathered in accord with contemporary standards.
In response, this Research Methods essay offers a contemporary social science perspective on test validity and the validation process. It is intended to offer a primer for DBERs who may not have received formal training on the subject. Using examples from discipline-based education research, the instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to DBERs, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.
TEST VALIDITY AND THE TEST VALIDATION PROCESS
A test is a sample of behavior gathered in order to draw an inference about some domain or construct within a particular population (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education [ AERA, APA, and NCME], 2014 ). 1 In the social sciences, the domain about which an inference is desired is typically a latent (unobservable) variable. For example, the STEM GTA-Teaching Self-Efficacy Scale ( DeChenne et al. , 2012 ) was developed to support inferences about the degree to which a graduate teaching assistant believes he or she is capable of 1) cultivating an effective learning environment and 2) implementing particular instructional strategies. As another example, the inference drawn from an introductory biology final exam is typically about the degree to which a student understands content covered over some extensive unit of instruction. While beliefs or conceptual knowledge are not directly accessible, what can be observed is the sample of behavior the test elicits, such as test-taker responses to questions or responses to rating scales. Diverse forms of instrumentation are used in discipline-based education research ( Singer et al. , 2012 ). Notable subcategories of instruments include self-report (e.g., attitudinal and belief scales) and more objective measures (e.g., concept inventories, standardized observation protocols, and final exams). By the definition of “test” above, any of these instrument types can be conceived as tests—though the focus here is only on instruments that yield quantitative data, that is, scores.
The paramount consideration in the evaluation of any test’s quality is validity: “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” ( Angoff, 1988 ; AERA, APA, and NCME, 2014 , p. 11). 2 , 3 In evaluating test validity, the focus is not on the test itself, but rather the proposed inference(s) drawn on the basis of the test’s score(s). Noteworthy in the validity definition above is that validity is a matter of degree (“the inferences supported by this test have a high or low degree of validity”), rather than a dichotomous character (e.g., “the inferences supported by this test are or are not valid”).
Assessment validation is theorized as an iterative process in which the test developer constructs an evidence-based argument for the intended test-based score interpretations in a particular population ( Kane, 1992 ; Messick, 1995 ). An example validity argument claim is that the test’s content (e.g., questions, items) is representative of the domain targeted by the test (e.g., body of knowledge/skills). With this argument-based approach, claims within the validity argument are substantiated with various forms of relevant evidence. Altogether, the goal of test validation is to accumulate over time a comprehensive body of relevant evidence to support each intended score interpretation within a particular population (i.e., whether the scores should in fact be interpreted to mean what the developer intends them to mean).
CATEGORIES OF TEST VALIDITY EVIDENCE
Historically, test validity theory in the social sciences recognized several categorically different “types” of validity (e.g., “content validity,” “criterion validity”). However, contemporary validity theory posits that test validity is a unitary (single) concept. Rather than providing evidence of each “type” of validity, the charge for test developers is to construct a cohesive argument for the validity of test score–based inferences that integrates different forms of validity evidence. The categories of validity evidence include evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations with other variables, and evidence based on the consequences of testing ( AERA, APA, and NCME, 2014 ). Figure 1 provides a graphical representation of the categories and subcategories of validity evidence.
Categories of evidence used to argue for the validity of test score interpretations and uses ( AERA, APA, and NCME, 2014 ).
Validity evidence based on test content concerns “the relationship between the content of a test and the construct it is intended to measure” ( AERA, APA, and NCME, 2014 , p. 14). Such validity evidence concerns the match between the domain purportedly measured by (e.g., diagnostic microscopy skills) and the content of the test (e.g., the specific slides examined by the test taker). For example, if a test is intended to elicit evidence of students’ understanding of the key principles of evolution by means of natural selection (e.g., variation, heredity, differential fitness), the test should fully represent those principles in the sample of behavior it elicits. As a concrete example from the literature, in the development of the Host-Pathogen Interaction (HPI) concept inventory, Marbach-Ad et al. (2009) explicitly mapped each test item to one of 13 HPI concepts intended to be assessed by their instrument. Content validity evidence alone is insufficient for establishing a high degree of validity; it should be combined with other forms of evidence to yield a strong evidence-based validity argument marked by relevancy, accuracy, and sufficiency.
In practice, providing validity evidence based on test content involves evaluating and documenting content representativeness. One standard approach to collecting evidence of content representativeness is to submit the test to external systematic review by subject matter–area experts (e.g., biology faculty) and to document such reviews (as well as revisions made on their basis). External reviews focus on the adequacy of the test’s overall elicited sample of behavior in representing the domain assessed and any corresponding subdomains, as well as the relevance or irrelevance of particular questions/items to the domain. We refer the reader to Webb (2006) for a comprehensive and sophisticated framework for evaluating different dimensions of domain–test content alignment.
Another approach used to design a test, so as to support and document construct representativeness, is to employ a “table of specifications” (e.g., Fives and DiDonato-Barnes, 2013 ). A table of specifications (or test blueprint) is a tool for designing a test that classifies test content along two dimensions, a content dimension and a cognitive dimension. The content dimension pertains to the different aspects of the construct one intends to measure. In a classroom setting, aspects of the construct are typically defined by behavioral/instructional objectives (i.e., students will analyze phylogenetic trees). The cognitive dimension represents the level of cognitive processing or thinking called for by test components (e.g., knowledge, comprehension, analysis). Within a table of specifications, one indicates the number/percent of test questions or items for each aspect of the construct at each cognitive level. Often, one also provides a summary measure of the number of items pertaining to each content area (regardless of cognitive demand) and at each cognitive level (regardless of content). Instead of or in addition to the number of items, one can also indicate the number/percent of available points for each content area and cognitive level. Because a table of specifications indicates how test components represent the construct one intends to measure, it serves as one source of validity evidence based on test content. Table 1 presents an example table of specifications for a test concerning the principles of evolution by means of natural selection.
Example table of specifications for evolution by means of natural selection test showing numbers of test items pertaining to each content area at each cognitive level and total number of items per content area and cognitive level
Cognitive process | ||||
---|---|---|---|---|
Content (behavioral objective) | Comprehension | Application | Analysis | Total |
1. Students will define evolution by means of natural selection. | 1 | 1 | ||
2. Students will define key principles of evolution by means of natural selection (e.g., heredity, differential fitness). | 5 | 5 | ||
3. Students will compute measures of absolute and relative fitness. | 5 | 5 | ||
4. Students will compare evolution by means of natural selection with earlier evolution theories. | 3 | 3 | ||
5. Student will analyze phylogenetic trees. | 4 | 4 | ||
Total | 6 | 5 | 7 | 18 |
Evidence of validity based on response processes concerns “the fit between the construct and the detailed nature of the performance or response actually engaged in by test takers” ( AERA, APA, and NCME, 2014 , p. 15). For example, if a test purportedly elicits evidence of undergraduate students’ critical evaluative thinking concerning evidence-based scientific arguments, during the test the student should be engaged in the cognitive process of examining argument claims, evidence, and warrants, and the relevance, accuracy, and sufficiency of evidence. Most often one gathers such evidence through respondent think-aloud procedures. During think alouds, respondents verbally explain and rationalize their thought processes and responses concurrently during test completion. One particular method commonly used by professional test vendors to gather response process–based validity evidence is cognitive labs, which involve both concurrent and retrospective verbal reporting by respondents ( Willis, 1999 ; Zucker et al ., 2004 ). As an example from the literature, developers of the HPI concept inventory asked respondents to provide open-ended responses to ensure that their reasons for selecting a particular response option (e.g., “B”) were consistent with the developer’s intentions, that is, the student indeed held the particular alternative conception presented in response option B ( Marbach-Ad et al. , 2009 ). Think alouds are formalized via structured protocols, and the elicited think-aloud data are recorded, transcribed, analyzed, and interpreted to shed light on validity.
Evidence based on internal structure concerns “the degree to which the relationships among test item and test components conform to the construct on which the proposed test score interpretations are based” ( AERA, APA, and NCME, 2014 , p. 16). 4 For instance, suppose a professor plans to teach one topic (eukaryotes) using small-group active-learning instruction and another topic (prokaryotes) through lecture instruction; and he or she wants to make within-class comparisons of the effectiveness of these methods. As an outcome measure, a test may be designed to support inferences about the two specific aspects of biology content (e.g., characteristics of prokaryotic and eukaryotic cells). Collection of evidence based on internal structure seeks to confirm empirically whether the scores reflect the (in this case two) distinct domains targeted by the test ( Messick, 1995 ). In practice, one can formally establish the fidelity of test scores to their theorized internal structure through methodological techniques such as factor analysis, item response theory, and Rasch modeling ( Harman, 1960 ; Rasch, 1960 ; Embretson and Reise, 2013 ). With factor analysis, for example, item intercorrelations are analyzed to determine whether particular item responses cluster together (i.e., whether scores from components of the test related to one aspect of the domain [e.g., questions about prokaryotes] are more interrelated with one another than they are with scores derived from other components of the test [e.g., questions about eukaryotes]).
Item response theory and Rasch models hypothesize that the probability of a particular response to a test item is a function of the respondent’s ability (in terms of what is being measured) and characteristics of the item (e.g., difficulty, discrimination, pseudo-guessing). Examining test score internal structure with such models involves examining whether such model-based predictions bear out in the observed data. There are a variety of such models for use with test questions with different (or different combinations of) response formats such as the Rasch rating-scale model ( Andrich, 1978 ) and the Rasch partial-credit Rasch model ( Wright and Masters, 1982 ).
Validity evidence based on relations with other variables concerns “the relationship of test scores to variables external to the test” ( AERA, APA, and NCME, 2014 , p. 16). The collection of this form of validity evidence centers on examining how test scores are related to both measures of the same or similar constructs and measures of distinct and different constructs (i.e., respectively termed “convergent validity” and “discriminant validity” 5 evidence). In other words, such evidence pertains to how scores relate to other variables as would be theoretically expected. For example, if a new self-report instrument purports to measure experimental design skills, scores should correlate highly with an existing measure of experimental design ability such as the EDAT ( Sirum and Humburg, 2011 ). On the other hand, scores derived from this self-report instrument should be considerably less correlated or uncorrelated with scores from a personality measure such as the Minnesota Multiphasic Personality Inventory ( Greene, 2000 ). As another discipline-based education research example, Nehm and Schonfeld (2008) collected discriminant validity evidence by relating scores from both the Conceptual Inventory of Natural Section (CINS) and the Open Response Instrument (ORI), which both purport to assess understanding of and conceptions concerning natural selection, and a geology test of knowledge about rocks.
A subcategory of evidence based on relations with other variables is evidence related to test-criterion relationships, which concerns how test scores are related to some other nontest indicator or outcome either at the same time (so-called concurrent validity evidence) or in the future (so-called predictive validity evidence). For instance, developers of a new biostatistics test might examine how scores from the test correlate as expected with professor ability judgments or mathematics course grade point average at the same point in time; alternatively, the developer might follow tested individuals over time to examine how scores relate to the probability of successfully completing biostatistics course work. As another example, given prior research on self-efficacy, scores from instruments that probe teaching self-efficacy should be related to respondents’ levels of teacher training and experience ( Prieto and Altmaier, 1994 ; Prieto and Meyers, 1999 ).
Examination of how test scores are related or not to other variables as expected is often associational in nature (e.g., correlational analysis). There are also two other specific methods for eliciting such validity evidence. The first is to examine score differences between theoretically different groups (e.g., whether scientists’ and nonscientists’ scores from an experimental design test differ systematically on average)—the “known groups method.” The second is to examine whether scores increase or decrease as expected in response to an intervention ( Hattie and Cooksey, 1984 ; AERA, APA, and NCME, 2014 ). For example, Marbach-Ad et al. (2009 , 2010 ) examined HPI concept inventory score differences between majors and nonmajors and students in introductory and upper-level courses. To inform the collection of validity evidence based on relations with other variables, individuals should consult the literature to formulate a theory around how good measures of the construct should relate to different variables. One should also note that the quality of such validity evidence hinges on the quality (e.g., validity) of measures of external variables.
Finally, validity evidence based on the consequences of testing concerns the “soundness of proposed interpretations [of test scores] for their intended uses” ( AERA, APA, and NCME, 2014 , p. 19) and the value implications and social consequences of testing ( Messick, 1994 , 1995 ). Such evidence pertains to both the intended and unintended effects of test score interpretation and use ( Linn, 1991 ; Messick, 1995 ). Example intended consequences of test use would include motivating students, better-targeted instruction, and populating a special program with only those students who are in need of the program (if those are the intended purposes of test use). An example of an unintended consequence of test use would be significant reduction in instructional time because of overly time-consuming test administration (assuming, of course, that this would not be a desired outcome) or drop out of particular student populations because of an excessively difficult test administered early in a course. In K–12 settings, a classic example of an unintended consequence of testing is the “narrowing of the curriculum” that occurred as a result of the No Child Left Behind Act testing regime; when faced with annual tests focused only on particular content areas (i.e., English/language arts and mathematics), schools and teachers focused more on tested content and less on nontested content such as science, social studies, art, and music (e.g., Berliner, 2011 ). Evidence based on the consequences of a test is often gathered via surveys, interviews, and focus groups administered with test users.
TEST VALIDITY ARGUMENT EXAMPLE
In this section, we provide an example validity argument for a test designed to elicit evidence of students’ skills in analyzing the elements of evidence-based scientific arguments. This hypothetical test presents text-based arguments concerning scientific topics (e.g., global climate change, natural selection) to students, who then directly interact with the texts to identify their elements (i.e., claims, reasons, and warrants). The test is intended to support inferences about 1) students’ overall evidence-based science argument-element analysis skills; 2) students’ skills in identifying particular evidence-based science argument elements (e.g., claims); and 3) errors made when students identify particular argument elements (e.g., evidence). Additionally, the test is intended to 4) support instructional decision-making to improve science teaching and learning. The validity argument claims undergirding this example test’s score interpretations and uses (and the categories of validity evidence advanced to substantiate each) are shown in Table 2 .
Example validity argument and validation approach for a test of students’ ability to analyze the elements of evidence-based scientific arguments showing argument claims and subclaims concerning the validity of the intended test score interpretations and uses and relevant validity evidence used to substantiate those claims
Validity argument claims and subclaims | Relevant validity evidence based on |
---|---|
1. The overall score represents a student’s current level of argument-element analysis skills, because: | – |
a single higher-order construct (i.e., argument-element analysis skill) underlies all item responses. | Internal structure |
the overall score is distinct from content knowledge and thinking dispositions. | Relations with other variables |
the items represent a variety of arguments and argument elements. | Test content |
items engage respondents in the cognitive process of argument-element analysis. | Response processes |
the overall score is highly related to other argument analysis measures and less related to content knowledge and thinking disposition measures. | Relations with other variables |
2. A subscore (e.g., claim identification) represents a student’s current level of argument-element identification skill, because: | – |
each subscore is distinct from other subscores and the total score (the internal structure is multidimensional and hierarchical). | Internal structure |
the items represent a variety of arguments and particular argument elements (e.g., claims). | Test content |
the subscore is distinct from content knowledge and thinking dispositions. | Relations with other variables |
items engage respondents in the cognitive process of identifying a particular element argument (e.g., claims). | Response processes |
subscores are highly related to other argument analysis measures and less related to content knowledge and thinking disposition measures. | Relations with other variables |
3. Error indicators can be interpreted to represent students’ current errors made in identifying particular argument elements, because when students misclassify an element in the task, they are making cognitive errors. | Response processes |
4. Use of the test will facilitate improved argument instruction and student learning, because: | – |
teachers report that the test is useful and easy to use and have positive attitudes toward it. | Consequences of testing |
teachers report using the test to improve argument instruction. | Consequences of testing |
teachers report that the provided information is timely. | Consequences of testing |
teachers learn about argumentation with test use. | Consequences of testing |
students learn about argumentation with test use | Consequences of testing |
any unintended consequences of test use do not outweigh intended consequences. | Consequences of testing |
ANALYSIS OF CINS VALIDITY EVIDENCE
The example validity argument provided in the preceding section was intended to model the validity argument formulation process for readers who intend to develop an instrument. However, in many cases, an existing instrument (or one of several existing instruments) needs to be selected for use in one’s context. The use of an existing instrument for research or practice requires thoughtful analysis of extant validity evidence available for an instrument’s score interpretations and uses. Therefore, in this section, we use validity theory as outlined in the Standards for Educational and Psychological Testing to analyze the validity evidence for a particular instrument, the CINS.
As reported in Anderson et al . (2002) , the CINS is purported to measure “conceptual understanding of natural selection” (as well as alternative conceptions of particular relevant ideas diagnostically) in undergraduate non–biology majors before instruction (p. 953). In their initial publication of the instrument, the authors supplied several forms of validity evidence for the intended score interpretations and uses. In terms of validity evidence related to test content, the authors argued that test content was aligned with Mayr’s (1982) five facts and three inferences about evolution by means of natural selection, and two other key concepts, the origin of variation and the origin of species. Two test items were mapped to each of these 10 concepts. Similarly, distractor (incorrect) multiple-choice responses were based on theory and research about students’ nonscientific, or alternative, conceptions of these ideas. Content-related validity evidence was also provided through reviews of test items by biology professors.
Evidence based on test-taker response processes was elicited through cognitive interviews (think alouds) conducted with a small sample of students ( Anderson et al. , 2002 ). The authors provided validity evidence based on internal structure using principal components analysis, which is similar to factor analysis. In terms of validity evidence based on test-score relations with other variables, the authors examined correlations between CINS scores and scores derived from interviews. While Anderson and colleagues did note that a paper and pencil–based test would be more logistically feasible than interview-based assessment methods, validity evidence based on the consequences of testing was not formally provided.
Anderson et al. ’s (2002) paper did present a variety of forms of validity evidence concerning the CINS instrument. However, and underscoring the continuous nature of test validation, subsequent work has built upon their work and provided additional evidence. For example, in light of concerns that the primary earlier source of validity evidence was correlations between CINS scores and scores based on oral interviews in a very small sample, Nehm and Schonfeld (2008) provided additional validity evidence based on relations with other variables. For example, Nehm and Schonfeld (2008) examined CINS score relations with two other instruments purported to assess the same construct (convergent validity evidence) and with a measure of an unrelated construct (discriminant validity evidence). Nehm and Schonfeld also expanded the body of CINS validity evidence based on internal structure by analyzing data using the Rasch model. The authors’ reporting of CINS administration times similarly shed light on possible consequences of testing. The evolution of validity evidence for the CINS noted here certainly speaks to the iterative and ongoing nature of instrument validation processes. With this in mind, future work might examine CINS scores’ internal structure vis-à-vis diagnostic classification models (see Rupp and Templin, 2008 ), since CINS is purportedly a diagnostic test.
TEST VALIDITY THREATS
The two primary threats to test score validity are construct underrepresentation and construct-irrelevant variance. Construct underrepresentation is “the degree to which a test fails to capture important aspects of the construct” ( AERA, APA, and NCME, 2014 ; p. 12). In other words, construct underrepresentation involves failing to elicit a representative sample of behavior from test takers (e.g., responses to multiple-choice questions) relative to the universe of possible relevant behaviors that might be observed. While it is neither necessary nor feasible to ask respondents to engage in every single possible relevant behavior, it is crucial that the behavior sampled by the test is sufficiently representative of the construct at large. If a test does not fully and adequately sample behavior related to the targeted domain, the test score’s meaning in actuality would be narrower than is intended.
Content underrepresentation can be mitigated by initiating test design with a thorough analysis and conception of the domain targeted by the test ( Mislevy et al. , 2003 ; Opfer et al. , 2012 ). Knowledge of the construct, and variables that are related or not related to the construct, can also inform the validation process ( Mislevy et al. , 2003 ). Beginning test design with a thorough conception of the construct one intends to measure is analogous to the course design approach known as “backward design”; with backward design one first identifies what one wants students to know/be able to do after instruction (learning outcomes) and then designs a course to get students there ( Wiggins and McTighe, 2005 ). Other strategies to promote construct representation include building a test based on a table of specifications; submitting a text to external expert content review (as both noted above); and employing a sufficient number of test items to ensure good representation of domain content.
Besides construct representation, the other primary threat to test score validity is construct-irrelevant variance—“the degree to which test scores are affected by processes that are extraneous to the test’s intended purpose” ( AERA, APA, and NCME, 2014 , p. 12). Construct-irrelevant variance is test score variation caused systematically by factors other than (or in addition to) those intended; in other words, some part of the reason why one received a “high” or “low” score is due to irrelevant reasons. Two common examples of this are: English skills affecting test scores for non–native English speakers on tests written in English; and computer skills affecting test scores for tests administered via computer. Another example would be if items on a science teaching self-efficacy self-report instrument are written so generally that the scores represent not science teaching–specific self-efficacy but self-efficacy in general. It is critical to mitigate such threats through test design processes (e.g., minimizing test linguistic load). One can often identify potential threats in the course of a thorough analysis of the construct/domain done at early design stages. During test validation one should also disconfirm such threats wherein scores are driven by irrelevant factors; practitioners often conduct factor, correlational, and differential item functioning analyses toward this end.
Systematic research on postsecondary science teaching and learning and evaluation of local innovations by practitioners hinges on the availability and use of sound instrumentation. Unfortunately, the field of discipline-based education research lacks sufficient existing and high-quality instruments for use in all of these efforts ( Opfer et al. , 2012 ; Singer et al. , 2012 ; Campbell and Nehm, 2013 ). DBERs and practitioners furthermore do not typically have formal training that equips them to evaluate and select existing instruments or develop and validate their own instruments for needed purposes. This essay reviewed contemporary test validity and validation theory for DBERs and practitioners in hopes of equipping them with such knowledge. 6
This essay was chiefly intended for two audiences: 1) those who will develop new instruments; and 2) those who will evaluate and select from among existing instruments. Here, we summarily denote the implications of this essay for members of these two populations. First, it behooves those developing and publishing their own instruments to explicitly frame, construct, and report an evidence-based validity argument for their proposed instruments’ intended score interpretations and uses. This argument should rely on multiple forms of validity evidence and specify the test-taker and user populations for which that argument pertains. If faced with space constraints in journal articles, test manuals or technical reports can be written to detail such validity evidence and made available to the scholarly community.
Like any argument, an evidence-based argument formulated during test validation should be characterized by relevancy, accuracy, and sufficiency. As such, validity arguments should be held up to scientific scrutiny before a test’s operational use. The quality of a validity argument hinges on a number of factors discussed in this essay. Examples include the alignment of the validity argument claims with intended test score interpretations and uses; the representativeness of the samples from which validity evidence is gathered to the intended test-taker population; the relevance of the expertise held by content reviewers; and the technical quality of external measures. A final point to emphasize is that validation is an ongoing process; additional validity evidence may need to be gathered as theory concerning a construct evolves or as the community seeks to use an instrument with new populations.
Second, potential test users should be critical in their evaluation of existing instruments, and should not merely assume a strong validity argument exists for an instrument’s score interpretations and uses with a particular population. Potential users should look to the instrumentation (or methods) sections of published articles for key information, such as whether the test was developed based on a sound theoretical conception of construct, whether the test underwent external content review, and whether scores correlate with other measures as they theoretically should, among other things. One may have to contact an author for such information. Altogether, such practices should advance the quality of measurement within the realm of discipline-based education research.
Acknowledgments
The authors thank Drs. Beth Schussler and Ross Nehm and two anonymous reviewers for their constructive feedback on an earlier version of this article.
1 A test cannot be “stamped” valid for all purposes and test-taker populations; validity evidence needs to be gathered with respect to all intended instrument uses.
2 While other key dimensions for evaluating an instrument’s quality include reliability (i.e., test score consistency) and utility (i.e., feasibility; AERA, APA, and NCME, 2014 ), the focus here is on validity only.
3 While this essay allies with test validity theory as codified in the Standards for Educational and Psychological Testing ( AERA, APA, and NCME, 2014 ), the reader will note that there are alternative conceptions of validity as well ( Lissitz and Samuelsen, 2007 ).
4 This source of evidence has been termed “substantive validity” ( Messick, 1995 ).
5 This is not to be confused with item discrimination, a test item property pertaining to how an item’s scores relate to overall test performance.
6 While our focus is on instruments comprising sets of questions or items intended to elicit evidence of a particular construct or constructs, many of the ideas here apply also to questionnaire (survey) validation. For example, the developer of a questionnaire may interrogate how respondents interpret and formulate a response to a particular question as validity evidence based on response processes.
- American Association for the Advancement of Science. Vision and Change in Undergraduate Biology Education: A Call to Action. Washington, DC: 2011. [ Google Scholar ]
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: 2014. [ Google Scholar ]
- Anderson DL, Fisher KM, Norman GJ. Development and evaluation of the conceptual inventory of natural science. J Res Sci Teach. 2002; 39 :952–978. [ Google Scholar ]
- Andrich DA. A rating formulation for ordered response categories. Psychometrika. 1978; 43 :561–573. [ Google Scholar ]
- Angoff WH. Validity: an evolving concept. In: Wainer H, Braun H, editors. Test Validity. Hillsdale, NJ: Erlbaum; 1988. pp. 19–32. [ Google Scholar ]
- Auchincloss LC, Laursen SL, Branchaw JL, Eagan K, Graham M, Hanauer DI, Lawrie G, McLinn CM, Palaez N, Rowland S, et al. Assessment of course-based undergraduate research experiences: a meeting report. CBE Life Sci Educ. 2014; 13 :29–40. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Berliner D. Rational responses to high stakes testing: the case of curriculum narrowing and the harm that follows. Cambridge J Educ. 2011; 41 :287–302. [ Google Scholar ]
- Bollen KA. Latent variables in psychology and the social sciences. Annu Rev Psychol. 2002; 53 :605–634. [ PubMed ] [ Google Scholar ]
- Boyer Commission on Educating Undergraduates in the Research University. Reinventing Undergraduate Education: A Blueprint for America’s Research Universities. Stony Brook: State University of New York; 1998. [ Google Scholar ]
- Brownell SE, Tanner KD. Barriers to faculty pedagogical change: lack of training, time, incentives, and … tensions with professional identity. CBE Life Sci Educ. 2012; 11 :339–346. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Campbell CE, Nehm RH. A critical analysis of assessment quality in genomics and bioinformatics education research. CBE Life Sci Educ. 2013; 12 :530–541. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- DeChenne SE, Enochs LG, Needham M. Science, technology, engineering, and mathematics graduate teaching assistants teaching self-efficacy. J Scholarship Teach Learn. 2012; 12 :102–123. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Denofrio LA, Russell B, Lopatto D, Lu Y. Linking student interests to science curricula. Science. 2007; 318 :1872–1873. [ PubMed ] [ Google Scholar ]
- Embretson SE, Reise SP. Item Response Theory. Mahwah, NJ: 2013. [ Google Scholar ]
- Fives H, DiDonato-Barnes N. Classroom test construction: the power of a table of specifications. Pract Assess Res Eval. 2013; 18 :2–7. [ Google Scholar ]
- Freeman S, Eddy SL, McDonough M, Smith MK, Wenderoth MP, Okoroafor N, Jordt H. Active learning increases student performance in science, engineering, and mathematics. Proc Natl Acad Sci USA. 2014; 111 :8410–8415. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Greene RL. The MMPI-2: An Interpretive Manual. Boston, MA: 2000. [ Google Scholar ]
- Harman HH. Modern Factor Analysis. Oxford, UK: 1960. [ Google Scholar ]
- Hattie J, Cooksey RW. Procedures for assessing the validities of tests using the “known-groups” method. Appl Psychol Meas. 1984; 8 :295–305. [ Google Scholar ]
- Kane MT. An argument-based approach to validity. Psychol Bull. 1992; 112 :527–535. [ Google Scholar ]
- Linn RL. Complex, performance-based assessment: expectations and validation criteria. Educ Researcher. 1991; 20 :15–21. [ Google Scholar ]
- Lissitz RW, Samuelsen K. A suggested change in terminology and emphasis regarding validity and education. Educ Researcher. 2007; 36 :437–448. [ Google Scholar ]
- Marbach-Ad G, Briken V, El-Sayed NM, Frauwirth K, Fredericksen B, Hutcheson S, Gao LY, Joseph S, Lee VT, McIver KS, et al. Assessing student understanding of host pathogen interactions using a concept inventory. J Microbiol Biol Educ. 2009; 10 :43–50. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Marbach-Ad G, McAdams K, Benson S, Briken V, Cathcart L, Chase M, El-Sayed N, Frauwirth K, Fredericksen B, Joseph S, et al. A model for using a concept inventory as a tool for students’ assessment and faculty professional development. CBE Life Sci Educ. 2010; 9 :408–436. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Mayr E. The Growth of Biological Thought: Diversity, Evolution and Inheritance. Cambridge, MA: Harvard University Press; 1982. [ Google Scholar ]
- Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educ Researcher. 1994; 23 :13–23. [ Google Scholar ]
- Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas. 1995; 14 :5–8. [ Google Scholar ]
- Mislevy RJ, Steinberg LS, Almond RG. On the structure of educational assessments. Measurement. 2003; 1 :3–62. [ Google Scholar ]
- National Research Council. BIO2010: Transforming Undergraduate Education for Future Research Biologists. Washington, DC: National Academies Press; 2003. [ PubMed ] [ Google Scholar ]
- Nehm RH, Schonfeld IS. Measuring knowledge of natural selection: a comparison of the CINS, an open-response instrument, and an oral interview. J Res Sci Teach. 2008; 45 :1131–1160. [ Google Scholar ]
- Opfer JE, Nehm RH, Ha M. Cognitive foundations for science assessment design: knowing what students know about evolution. J Res Sci Teach. 2012; 49 :744–777. [ Google Scholar ]
- Prieto LR, Altmaier EM. The relationship of prior training and previous teaching experience to self-efficacy among graduate teaching assistants. Res High Educ. 1994; 35 :481–497. [ Google Scholar ]
- Prieto LR, Meyers SA. Effects of training and supervision on the self-efficacy of psychology graduate teaching assistants. Teach Psychol. 1999; 26 :264–266. [ Google Scholar ]
- Rasch G. Probabalistic Models for Some Intelligence and Achievement Tests. Copenhagen: Danish Institute for Educational Research; 1960. [ Google Scholar ]
- Rupp AA, Templin JL. Unique characteristics of diagnostic classification models: a comprehensive review of the current state-of-the-art. Measurement. 2008; 6 :219–262. [ Google Scholar ]
- Schussler EE, Reed Q, Marbach-Ad G, Miller K, Ferzli M. Preparing biology graduate teaching assistants for their roles as instructors: an assessment of institutional approaches. CBE Life Sci Educ. 2015; 14 :ar31. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Singer SR, Nielsen NR, Schweingruber HA. Discipline-based Education Research: Understanding and Improving Learning in Undergraduate Science and Engineering. Washington, DC: National Academies Press; 2012. [ Google Scholar ]
- Sirum K, Humburg J. The Experimental Design Ability Test (EDAT) Bioscene. 2011; 37 :8–16. [ Google Scholar ]
- Smith MK, Jones FHM, Gilbert SL, Wieman CE. The Classroom Observation Protocol for Undergraduate STEM (COPUS): A new instrument to characterize university STEM classroom practices. CBE Life Sci Educ. 2013; 12 :618–627. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Smith MK, Wood WB, Knight JK. The Genetics Concept Assessment: a new concept inventory for gauging student understanding of genetics. CBE Life Sci Educ. 2008; 7 :422–430. [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Webb N. Identifying content for student achievement tests. In: Downing SM, Haladyna TM, editors. Handbook of Test Development. Mahwah, NJ: Erlbaum; 2006. pp. 155–180. [ Google Scholar ]
- Wiggins GP, McTighe J. Understanding by Design. Alexandria, VA: 2005. [ Google Scholar ]
- Willis GB. Cognitive Interviewing: A “How To” Guide. Research Triangle Park, NC: 1999. [ Google Scholar ]
- Wright BD, Masters GN. Rating Scale Analysis. Chicago, IL: MESA; 1982. [ Google Scholar ]
- Zucker S, Sassman S, Case BJ. Cognitive Labs. 2004. http://images.pearsonassessments.com/images/tmrs/tmrs_rg/CognitiveLabs.pdf (accessed 29 August 2015)
Have a language expert improve your writing
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
- Knowledge Base
Methodology
Reliability vs. Validity in Research | Difference, Types and Examples
Published on July 3, 2019 by Fiona Middleton . Revised on June 22, 2023.
Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique. or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt
It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research . Failing to do so can lead to several types of research bias and seriously affect your work.
Reliability | Validity | |
---|---|---|
What does it tell you? | The extent to which the results can be reproduced when the research is repeated under the same conditions. | The extent to which the results really measure what they are supposed to measure. |
How is it assessed? | By checking the consistency of results across time, across different observers, and across parts of the test itself. | By checking how well the results correspond to established theories and other measures of the same concept. |
How do they relate? | A reliable measurement is not always valid: the results might be , but they’re not necessarily correct. | A valid measurement is generally reliable: if a test produces accurate results, they should be reproducible. |
Table of contents
Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis, other interesting articles.
Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.
What is reliability?
Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.
What is validity?
Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.
High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.
If the thermometer shows different temperatures each time, even though you have carefully controlled conditions to ensure the sample’s temperature stays the same, the thermometer is probably malfunctioning, and therefore its measurements are not valid.
However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.
Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.
Prevent plagiarism. Run a free check.
Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.
Types of reliability
Different types of reliability can be estimated through various statistical methods.
Type of reliability | What does it assess? | Example |
---|---|---|
The consistency of a measure : do you get the same results when you repeat the measurement? | A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks or months apart and give the same answers, this indicates high test-retest reliability. | |
The consistency of a measure : do you get the same results when different people conduct the same measurement? | Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective). | |
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? | You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency. |
Types of validity
The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.
Type of validity | What does it assess? | Example |
---|---|---|
The adherence of a measure to of the concept being measured. | A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and ). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity. | |
The extent to which the measurement of the concept being measured. | A test that aims to measure a class of students’ level of Spanish contains reading, writing and speaking components, but no listening component. Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish. | |
The extent to which the result of a measure corresponds to of the same concept. | A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity. |
To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalizability of the results).
The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.
Ensuring validity
If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data.
- Choose appropriate methods of measurement
Ensure that your method and measurement technique are high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.
For example, to collect data on a personality trait, you could use a standardized questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or findings of previous studies, and the questions should be carefully and precisely worded.
- Use appropriate sampling methods to select your subjects
To produce valid and generalizable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population. Failing to do so can lead to sampling bias and selection bias .
Ensuring reliability
Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible .
- Apply your methods consistently
Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.
For example, if you are conducting interviews or observations , clearly define how specific behaviors or responses will be counted, and make sure questions are phrased the same way each time. Failing to do so can lead to errors such as omitted variable bias or information bias .
- Standardize the conditions of your research
When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.
For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions, preferably in a properly randomized setting. Failing to do so can lead to a placebo effect , Hawthorne effect , or other demand characteristics . If participants can guess the aims or objectives of a study, they may attempt to act in more socially desirable ways.
It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper . Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.
Section | Discuss |
---|---|
What have other researchers done to devise and improve methods that are reliable and valid? | |
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions and measuring techniques. | |
If you calculate reliability and validity, state these values alongside your main results. | |
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not? | |
If reliability and validity were a big problem for your findings, it might be helpful to mention this here. |
If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.
- Normal distribution
- Degrees of freedom
- Null hypothesis
- Discourse analysis
- Control groups
- Mixed methods research
- Non-probability sampling
- Quantitative research
- Ecological validity
Research bias
- Rosenthal effect
- Implicit bias
- Cognitive bias
- Selection bias
- Negativity bias
- Status quo bias
Cite this Scribbr article
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
Middleton, F. (2023, June 22). Reliability vs. Validity in Research | Difference, Types and Examples. Scribbr. Retrieved August 27, 2024, from https://www.scribbr.com/methodology/reliability-vs-validity/
Is this article helpful?
Fiona Middleton
Other students also liked, what is quantitative research | definition, uses & methods, data collection | definition, methods & examples, what is your plagiarism score.
- How it works
Reliability and Validity – Definitions, Types & Examples
Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023
A researcher must test the collected data before making any conclusion. Every research design needs to be concerned with reliability and validity to measure the quality of the research.
What is Reliability?
Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.
Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.
Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.
What is the Validity?
Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid.
If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid.
Example: Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.
Example: Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.
Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.
Example: If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.
Internal Vs. External Validity
One of the key features of randomised designs is that they have significantly high internal and external validity.
Internal validity is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the variables .
Example: age, level, height, and grade.
External validity is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.
Also, read about Inductive vs Deductive reasoning in this article.
Looking for reliable dissertation support?
We hear you.
- Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
- Get different dissertation services at ResearchProspect and score amazing grades!
Threats to Interval Validity
Threat | Definition | Example |
---|---|---|
Confounding factors | Unexpected events during the experiment that are not a part of treatment. | If you feel the increased weight of your experiment participants is due to lack of physical activity, but it was actually due to the consumption of coffee with sugar. |
Maturation | The influence on the independent variable due to passage of time. | During a long-term experiment, subjects may feel tired, bored, and hungry. |
Testing | The results of one test affect the results of another test. | Participants of the first experiment may react differently during the second experiment. |
Instrumentation | Changes in the instrument’s collaboration | Change in the may give different results instead of the expected results. |
Statistical regression | Groups selected depending on the extreme scores are not as extreme on subsequent testing. | Students who failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier. |
Selection bias | Choosing comparison groups without randomisation. | A group of trained and efficient teachers is selected to teach children communication skills instead of randomly selecting them. |
Experimental mortality | Due to the extension of the time of the experiment, participants may leave the experiment. | Due to multi-tasking and various competition levels, the participants may leave the competition because they are dissatisfied with the time-extension even if they were doing well. |
Threats of External Validity
Threat | Definition | Example |
---|---|---|
Reactive/interactive effects of testing | The participants of the pre-test may get awareness about the next experiment. The treatment may not be effective without the pre-test. | Students who got failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier. |
Selection of participants | A group of participants selected with specific characteristics and the treatment of the experiment may work only on the participants possessing those characteristics | If an experiment is conducted specifically on the health issues of pregnant women, the same treatment cannot be given to male participants. |
How to Assess Reliability and Validity?
Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through various statistical methods depending on the types of validity, as explained below:
Types of Reliability
Type of reliability | What does it measure? | Example |
---|---|---|
Test-Retests | It measures the consistency of the results at different points of time. It identifies whether the results are the same after repeated measures. | Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from a various group of participants, it means the validity of the questionnaire and product is high as it has high test-retest reliability. |
Inter-Rater | It measures the consistency of the results at the same time by different raters (researchers) | Suppose five researchers measure the academic performance of the same student by incorporating various questions from all the academic subjects and submit various results. It shows that the questionnaire has low inter-rater reliability. |
Parallel Forms | It measures Equivalence. It includes different forms of the same test performed on the same participants. | Suppose the same researcher conducts the two different forms of tests on the same topic and the same students. The tests could be written and oral tests on the same topic. If results are the same, then the parallel-forms reliability of the test is high; otherwise, it’ll be low if the results are different. |
Inter-Term | It measures the consistency of the measurement. | The results of the same tests are split into two halves and compared with each other. If there is a lot of difference in results, then the inter-term reliability of the test is low. |
Types of Validity
As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity.
Type of reliability | What does it measure? | Example |
---|---|---|
Content validity | It shows whether all the aspects of the test/measurement are covered. | A language test is designed to measure the writing and reading skills, listening, and speaking skills. It indicates that a test has high content validity. |
Face validity | It is about the validity of the appearance of a test or procedure of the test. | The type of included in the question paper, time, and marks allotted. The number of questions and their categories. Is it a good question paper to measure the academic performance of students? |
Construct validity | It shows whether the test is measuring the correct construct (ability/attribute, trait, skill) | Is the test conducted to measure communication skills is actually measuring communication skills? |
Criterion validity | It shows whether the test scores obtained are similar to other measures of the same concept. | The results obtained from a prefinal exam of graduate accurately predict the results of the later final exam. It shows that the test has high criterion validity. |
Does your Research Methodology Have the Following?
- Great Research/Sources
- Perfect Language
- Accurate Sources
If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.
How to Increase Reliability?
- Use an appropriate questionnaire to measure the competency level.
- Ensure a consistent environment for participants
- Make the participants familiar with the criteria of assessment.
- Train the participants appropriately.
- Analyse the research items regularly to avoid poor performance.
How to Increase Validity?
Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:
- The reactivity should be minimised at the first concern.
- The Hawthorne effect should be reduced.
- The respondents should be motivated.
- The intervals between the pre-test and post-test should not be lengthy.
- Dropout rates should be avoided.
- The inter-rater reliability should be ensured.
- Control and experimental groups should be matched with each other.
How to Implement Reliability and Validity in your Thesis?
According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:
IMAGES
COMMENTS
present state of affairs, to show that many of these are irrelevant, and to offer a simple, clear, and workable alternative. It is our. intent to convince the reader that most of the validity ...
Examples of Validity. Internal Validity: A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases. External Validity: A study on educational interventions that can be applied to different schools across various regions. Construct Validity: A psychological test that accurately measures depression levels.
Although validity in qualitative research has been widely reflected upon in the methodological literature (and is still often subject of debate), the link with evaluation research is underexplored. In this article, I will explore aspects of validity of qualitative research with the explicit objective of connecting them with aspects of evaluation.
The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton.Revised on June 22, 2023. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid.
Following Stevens, test results give an operational definition of attributes, qualifying any test as valid by definition. Following the representational theory of measurement, an attribute is defined by an empirical relational structure and a corresponding measurement model.
In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it's intended to measure. It ensures that the research findings are genuine and not due to extraneous factors. Validity can be categorized into different types, including construct validity (measuring the intended abstract trait), internal validity (ensuring causal conclusions ...
In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of "individual" is seen differently between humanistic and positive psychologists due to differing philosophical perspectives: Where humanistic psychologists believe "individual" is a ...
These few authors argue that the broad and abstract concepts of reliability and validity can be applied to all research because the goal of finding plausible and credible outcome explanations is central to all research (Hammersley, 1992; Kuzel & Engel, 2001; Yin, 1994). We are concerned, nonetheless, that the focus on evaluation strategies that ...
This reliability can indicate stability of a measure and of the psychological phenomenon being assessed. Validity of instruments refers to the degree to which instruments assess the construct they aim to measure. To test this, researchers have developed several types of validity indices. Face validity reflects the degree to which items of a ...
Abstract. This chapter first sets out the book's purpose, namely to further define validity and to explore the factors that should be considered when evaluating claims from research and assessment. It then discusses validity theory and its philosophical foundations, with connections between the philosophical foundations and specific ways ...
However, contemporary validity theory posits that test validity is a unitary (single) concept. Rather than providing evidence of each "type" of validity, the charge for test developers is to construct a cohesive argument for the validity of test score-based inferences that integrates different forms of validity evidence.
Variations in Definition. Here is a sample of the definitions of the terms 'reliability' and 'validity' to be found in the methodological literature [2]: (1) Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the ...
As an alternative, a Rosetta stone approach that helps people understand and translate different definitions and conceptions of validity would enable evaluators to discuss and interact around these ideas with more clarity, without imposing new terminology on the originators of different conceptions and terminology. 4.1.2. Means versus ends
Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...
Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...
Abstract. The 1999 Standards for Educational and Psychological Testing defines validity as the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. Although quite explicit, there are ways in which this definition lacks precision, consistency, and clarity. The history of validity has taught us that ambiguity risks oversimplification ...
Validity and the research process. Beverly Hills: Sage Publications. The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research. Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites.
precise in quite distinct ways by different authors, and there is no settled con-sensus on a preferred account. Now, whether a particular measuring instru-ment—for example, a happiness questionnaire—is considered valid/invalid has not only research consequences, but also policy implications (Angner 2011b). Thus, validity's definition matters.
Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge. Studies must be conducted in environments ...
In highly litigious cultures, such as the US, this issue takes on a different complexion for the assessment industry than it does in other countries. In response to these issues, some of the authors have sought to use the logic of science to seek better definitions of validity or better arguments for particular claims relating to validity.
Reliability and Validity are measures that are used to ensure the study is measuring the right variables in the study objectives and that same results are obtained whenever the research is ...
33 Employee Value: Combining Utility Analysis with Strategic Human Resource Management Research to Yield Strong Theory Notes. Notes. 34 "Retooling" Evidence-Based Staffing: Extending the Validation ... In this chapter we first set the stage by focusing on the concept of validity, documenting key changes over time in how the term is used and ...