• Privacy Policy

Research Method

Home » Validity – Types, Examples and Guide

Validity – Types, Examples and Guide

Table of Contents

Validity

Validity is a fundamental concept in research, referring to the extent to which a test, measurement, or study accurately reflects or assesses the specific concept that the researcher is attempting to measure. Ensuring validity is crucial as it determines the trustworthiness and credibility of the research findings.

Research Validity

Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications.

How to Ensure Validity in Research

Ensuring validity in research involves several strategies:

  • Clear Operational Definitions : Define variables clearly and precisely.
  • Use of Reliable Instruments : Employ measurement tools that have been tested for reliability.
  • Pilot Testing : Conduct preliminary studies to refine the research design and instruments.
  • Triangulation : Use multiple methods or sources to cross-verify results.
  • Control Variables : Control extraneous variables that might influence the outcomes.

Types of Validity

Validity is categorized into several types, each addressing different aspects of measurement accuracy.

Internal Validity

Internal validity refers to the degree to which the results of a study can be attributed to the treatments or interventions rather than other factors. It is about ensuring that the study is free from confounding variables that could affect the outcome.

External Validity

External validity concerns the extent to which the research findings can be generalized to other settings, populations, or times. High external validity means the results are applicable beyond the specific context of the study.

Construct Validity

Construct validity evaluates whether a test or instrument measures the theoretical construct it is intended to measure. It involves ensuring that the test is truly assessing the concept it claims to represent.

Content Validity

Content validity examines whether a test covers the entire range of the concept being measured. It ensures that the test items represent all facets of the concept.

Criterion Validity

Criterion validity assesses how well one measure predicts an outcome based on another measure. It is divided into two types:

  • Predictive Validity : How well a test predicts future performance.
  • Concurrent Validity : How well a test correlates with a currently existing measure.

Face Validity

Face validity refers to the extent to which a test appears to measure what it is supposed to measure, based on superficial inspection. While it is the least scientific measure of validity, it is important for ensuring that stakeholders believe in the test’s relevance.

Importance of Validity

Validity is crucial because it directly affects the credibility of research findings. Valid results ensure that conclusions drawn from research are accurate and can be trusted. This, in turn, influences the decisions and policies based on the research.

Examples of Validity

  • Internal Validity : A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases.
  • External Validity : A study on educational interventions that can be applied to different schools across various regions.
  • Construct Validity : A psychological test that accurately measures depression levels.
  • Content Validity : An exam that covers all topics taught in a course.
  • Criterion Validity : A job performance test that predicts future job success.

Where to Write About Validity in A Thesis

In a thesis, the methodology section should include discussions about validity. Here, you explain how you ensured the validity of your research instruments and design. Additionally, you may discuss validity in the results section, interpreting how the validity of your measurements affects your findings.

Applications of Validity

Validity has wide applications across various fields:

  • Education : Ensuring assessments accurately measure student learning.
  • Psychology : Developing tests that correctly diagnose mental health conditions.
  • Market Research : Creating surveys that accurately capture consumer preferences.

Limitations of Validity

While ensuring validity is essential, it has its limitations:

  • Complexity : Achieving high validity can be complex and resource-intensive.
  • Context-Specific : Some validity types may not be universally applicable across all contexts.
  • Subjectivity : Certain types of validity, like face validity, involve subjective judgments.

By understanding and addressing these aspects of validity, researchers can enhance the quality and impact of their studies, leading to more reliable and actionable results.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Reliability

Reliability – Types, Examples and Guide

Test-Retest Reliability

Test-Retest Reliability – Methods, Formula and...

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

Face Validity

Face Validity – Methods, Types, Examples

External Validity

External Validity – Threats, Examples and Types

Reliability Vs Validity

Reliability Vs Validity

Validity In Psychology Research: Types & Examples

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.

Validity can be categorized into different types based on internal and external validity .

The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).

Internal and External Validity In Research

Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.

In other words, there is a causal relationship between the independent and dependent variables .

Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.

External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).

External validity can be improved by setting experiments more naturally and using random sampling to select participants.

Types of Validity In Psychology

Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.

  • Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
  • Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.

table showing the different types of validity

Face Validity

Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.

Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).

A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.

For example:

  • The test is extremely suitable for a given purpose
  • The test is very suitable for that purpose;
  • The test is adequate
  • The test is inadequate
  • The test is irrelevant and, therefore, unsuitable

It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.

Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).

The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.

It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.

Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.

A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.

However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.

It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.

For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).

This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.

Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.

Construct Validity

Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.

Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity

Construct validity does not concern the simple, factual question of whether a test measures an attribute.

Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).

To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .

Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.

The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.

Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.

Convergent validity

Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.

It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.

For example, suppose there are two different scales used to measure self-esteem:

Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.

If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.

Concurrent Validity (i.e., occurring at the same time)

Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.

It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.

If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.

Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.

Predictive Validity

Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.

For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.

Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.

Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.

Print Friendly, PDF & Email

Reliability and Validity

  • Reference work entry
  • pp 1643–1644
  • Cite this reference work entry

definition of validity in research by different authors

  • Yori Gidron 3  

1176 Accesses

2 Citations

These two concepts are the basis for assessment in most scientific work in medical and social sciences. Reliability refers to the degree of consistency in measurement and to the lack of error. There are several types of indices of reliability. Internal reliability (measured by Cronbach’s alpha) is a measure of repeatability of a measure. In psychometrics, a questionnaire of, for example, 10 items, is said to be reliable if its internal reliability coefficient is at least 0.70. This reflects approximately the mean correlation between each score on each item, with all remaining item scores, repeated across all items. Methodologically, this reflects a measure of repeatability, a basic premise of science. Another type of reliability is inter-rater reliability, which refers to the degree of agreement between two or more observers, evaluating a patient’s behavior, for example. Thus, in the original type A behavior interview, which currently places more emphasis on hostility,...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References and Readings

Barefoot, J. C., Dahlstrom, W. G., & Williams, R. B., Jr. (1983). Hostility, CHD incidence, and total mortality: A 25-year follow-up study of 255 physicians. Psychosomatic Medicine, 45 , 59–63.

PubMed   CAS   Google Scholar  

Del Greco, L., Walop, W., & McCarthy, R. H. (1987). Questionnaire development: 2. Validity and reliability. CMAJ, 136 , 699–700.

PubMed   Google Scholar  

Download references

Author information

Authors and affiliations.

Faculty of Medicine and Pharmacy, Free University of Brussels (VUB), 103, Laarbeeklaan, Jette, 1090, Belgium

Dr. Yori Gidron

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yori Gidron .

Editor information

Editors and affiliations.

Behavioral Medicine Research Center, Department of Psychology, University of Miami, Miami, FL, USA

Marc D. Gellman

Cardiovascular Safety, Quintiles, Durham, NC, USA

J. Rick Turner

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, New York

About this entry

Cite this entry.

Gidron, Y. (2013). Reliability and Validity. In: Gellman, M.D., Turner, J.R. (eds) Encyclopedia of Behavioral Medicine. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-1005-9_1549

Download citation

DOI : https://doi.org/10.1007/978-1-4419-1005-9_1549

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-1004-2

Online ISBN : 978-1-4419-1005-9

eBook Packages : Medicine Reference Module Medicine

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • CBE Life Sci Educ
  • v.15(1); Spring 2016

Contemporary Test Validity in Theory and Practice: A Primer for Discipline-Based Education Researchers

Todd d. reeves.

*Educational Technology, Research and Assessment, Northern Illinois University, DeKalb, IL 60115

Gili Marbach-Ad

† College of Computer, Mathematical and Natural Sciences, University of Maryland, College Park, MD 20742

This essay offers a contemporary social science perspective on test validity and the validation process. The instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to discipline-based education researchers, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.

Most discipline-based education researchers (DBERs) were formally trained in the methods of scientific disciplines such as biology, chemistry, and physics, rather than social science disciplines such as psychology and education. As a result, DBERs may have never taken specific courses in the social science research methodology—either quantitative or qualitative—on which their scholarship often relies so heavily. One particular aspect of (quantitative) social science research that differs markedly from disciplines such as biology and chemistry is the instrumentation used to quantify phenomena. In response, this Research Methods essay offers a contemporary social science perspective on test validity and the validation process. The instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to DBERs, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.

INTRODUCTION

The field of discipline-based education research ( Singer et al ., 2012 ) has emerged in response to long-standing calls to advance the status of U.S. science education at the postsecondary level (e.g., Boyer Commission on Educating Undergraduates in the Research University, 1998 ; National Research Council, 2003 ; American Association for the Advancement of Science, 2011 ). Discipline-based education research applies scientific principles to study postsecondary science education processes and outcomes systematically to improve the scientific enterprise. In particular, this field has made signifi­cant progress with respect to the study of 1) active-learning pedagogies (e.g., Freeman et al. , 2014 ); 2) interventions to support those pedagogies among both faculty (e.g., Brownell and Tanner, 2012 ) and graduate teaching assistants (e.g., Schussler et al. , 2015 ); and 3) undergraduate research experiences (e.g., Auchincloss et al. , 2014 ).

Most discipline-based education researchers (DBERs) were formally trained in the methods of scientific disciplines such as biology, chemistry, and physics, rather than social science disciplines such as psychology and education. As a result, DBERs may have never taken specific courses in the social science research methodology—either quantitative or qualitative—on which their scholarship often relies so heavily ( Singer et al. , 2012 ). While the same principles of science ground all these fields, the specific methods used and some criteria for methodological and scientific rigor differ along disciplinary lines.

One particular aspect of (quantitative) social science research that differs markedly from research in disciplines such as biology and chemistry is the instrumentation used to quantify phenomena. Instrumentation is a critical aspect of research methodology, because it provides the raw materials input to statistical analyses and thus serves as a basis for credible conclusions and research-based educational practice ( Opfer et al ., 2012 ; Campbell and Nehm, 2013 ). A notable feature of social science instrumentation is that it generally targets variables that are latent, that is, variables that are not directly observable but instead must be inferred through observable behavior ( Bollen, 2002 ). For example, to elicit evidence of cognitive beliefs, which are not observable directly, respondents are asked to report their level of agreement (e.g., “strongly disagree,” “disagree,” “agree,” “strongly agree”) with textually presented statements (e.g., “I like science,” “Science is fun,” and “I look forward to science class”). Even a multiple-choice final examination does not directly observe the phenomenon of interest (e.g., student knowledge). As such, compared with work in traditional scientific disciplines, in the social sciences, more of an inferential leap is often required between the derivation of a score and its intended interpretation ( Opfer et al ., 2012 ).

Instruments designed to elicit evidence of variables of interest to DBERs have proliferated in recent years. Some well-known examples include the Experimental Design Ability Test (EDAT; Sirum and Humburg, 2011 ); the Genetics Concept Assessment ( Smith et al. , 2008 ); the Classroom Undergraduate Research Experience survey ( Denofrio et al. , 2007 ); and the Classroom Observation Protocol for Undergraduate STEM ( Smith et al. , 2013 ). However, available instruments vary widely in their quality and nuance ( Opfer et al. , 2012 ; Singer et al. , 2012 ; Campbell and Nehm, 2013 ), necessitating understanding on the part of DBERs of how to evaluate instruments for use in their research. Practitioners, too, should know how to evaluate and select high-quality instruments for program evaluation and/or assessment purposes. Where high-quality instruments do not already exist for use in one’s context, which is commonplace ( Opfer et al ., 2012 ), they need to be developed, and corresponding empirical validity evidence needs to be gathered in accord with contemporary standards.

In response, this Research Methods essay offers a contemporary social science perspective on test validity and the validation process. It is intended to offer a primer for DBERs who may not have received formal training on the subject. Using examples from discipline-based education research, the instructional piece explores the concepts of test validity, the validation process, validity evidence, and key threats to validity. The essay also includes an in-depth example of a validity argument and validation approach for a test of student argument analysis. In addition to DBERs, this essay should benefit practitioners (e.g., lab directors, faculty members) in the development, evaluation, and/or selection of instruments for their work assessing students or evaluating pedagogical innovations.

TEST VALIDITY AND THE TEST VALIDATION PROCESS

A test is a sample of behavior gathered in order to draw an inference about some domain or construct within a particular population (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education [ AERA, APA, and NCME], 2014 ). 1 In the social sciences, the domain about which an inference is desired is typically a latent (unobservable) variable. For example, the STEM GTA-Teaching Self-Efficacy Scale ( DeChenne et al. , 2012 ) was developed to support inferences about the degree to which a graduate teaching assistant believes he or she is capable of 1) cultivating an effective learning environment and 2) implementing particular instructional strategies. As another example, the inference drawn from an introductory biology final exam is typically about the degree to which a student understands content covered over some extensive unit of instruction. While beliefs or conceptual knowledge are not directly accessible, what can be observed is the sample of behavior the test elicits, such as test-taker responses to questions or responses to rating scales. Diverse forms of instrumentation are used in discipline-based education research ( Singer et al. , 2012 ). Notable subcategories of instruments include self-report (e.g., attitudinal and belief scales) and more objective measures (e.g., concept inventories, standardized observation protocols, and final exams). By the definition of “test” above, any of these instrument types can be conceived as tests—though the focus here is only on instruments that yield quantitative data, that is, scores.

The paramount consideration in the evaluation of any test’s quality is validity: “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” ( Angoff, 1988 ; AERA, APA, and NCME, 2014 , p. 11). 2 , 3 In evaluating test validity, the focus is not on the test itself, but rather the proposed inference(s) drawn on the basis of the test’s score(s). Noteworthy in the validity definition above is that validity is a matter of degree (“the inferences supported by this test have a high or low degree of validity”), rather than a dichotomous character (e.g., “the inferences supported by this test are or are not valid”).

Assessment validation is theorized as an iterative process in which the test developer constructs an evidence-based argument for the intended test-based score interpretations in a particular population ( Kane, 1992 ; Messick, 1995 ). An example validity argument claim is that the test’s content (e.g., questions, items) is representative of the domain targeted by the test (e.g., body of knowledge/skills). With this argument-based approach, claims within the validity argument are substantiated with various forms of relevant evidence. Altogether, the goal of test validation is to accumulate over time a comprehensive body of relevant evidence to support each intended score interpretation within a particular population (i.e., whether the scores should in fact be interpreted to mean what the developer intends them to mean).

CATEGORIES OF TEST VALIDITY EVIDENCE

Historically, test validity theory in the social sciences recognized several categorically different “types” of validity (e.g., “content validity,” “criterion validity”). However, contemporary validity theory posits that test validity is a unitary (single) concept. Rather than providing evidence of each “type” of validity, the charge for test developers is to construct a cohesive argument for the validity of test score–based inferences that integrates different forms of validity evidence. The categories of validity evidence include evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations with other variables, and evidence based on the consequences of testing ( AERA, APA, and NCME, 2014 ). Figure 1 provides a graphical representation of the categories and subcategories of validity evidence.

An external file that holds a picture, illustration, etc.
Object name is rm1fig1.jpg

Categories of evidence used to argue for the validity of test score interpretations and uses ( AERA, APA, and NCME, 2014 ).

Validity evidence based on test content concerns “the relationship between the content of a test and the construct it is intended to measure” ( AERA, APA, and NCME, 2014 , p. 14). Such validity evidence concerns the match between the domain purportedly measured by (e.g., diagnostic microscopy skills) and the content of the test (e.g., the specific slides examined by the test taker). For example, if a test is intended to elicit evidence of students’ understanding of the key principles of evolution by means of natural selection (e.g., variation, heredity, differential fitness), the test should fully represent those principles in the sample of behavior it elicits. As a concrete example from the literature, in the development of the Host-Pathogen Interaction (HPI) concept inventory, Marbach-Ad et al. (2009) explicitly mapped each test item to one of 13 HPI concepts intended to be assessed by their instrument. Content validity evidence alone is insufficient for establishing a high degree of validity; it should be combined with other forms of evidence to yield a strong evidence-based validity argument marked by relevancy, accuracy, and sufficiency.

In practice, providing validity evidence based on test content involves evaluating and documenting content representativeness. One standard approach to collecting evidence of content representativeness is to submit the test to external systematic review by subject matter–area experts (e.g., biology faculty) and to document such reviews (as well as revisions made on their basis). External reviews focus on the adequacy of the test’s overall elicited sample of behavior in representing the domain assessed and any corresponding subdomains, as well as the relevance or irrelevance of particular questions/items to the domain. We refer the reader to Webb (2006) for a comprehensive and sophisticated framework for evaluating different dimensions of domain–test content alignment.

Another approach used to design a test, so as to support and document construct representativeness, is to employ a “table of specifications” (e.g., Fives and DiDonato-Barnes, 2013 ). A table of specifications (or test blueprint) is a tool for designing a test that classifies test content along two dimensions, a content dimension and a cognitive dimension. The content dimension pertains to the different aspects of the construct one intends to measure. In a classroom setting, aspects of the construct are typically defined by behavioral/instructional objectives (i.e., students will analyze phylogenetic trees). The cognitive dimension represents the level of cognitive processing or thinking called for by test components (e.g., knowledge, comprehension, analysis). Within a table of specifications, one indicates the number/percent of test questions or items for each aspect of the construct at each cognitive level. Often, one also provides a summary measure of the number of items pertaining to each content area (regardless of cognitive demand) and at each cognitive level (regardless of content). Instead of or in addition to the number of items, one can also indicate the number/percent of available points for each content area and cognitive level. Because a table of specifications indicates how test components represent the construct one intends to measure, it serves as one source of validity evidence based on test content. Table 1 presents an example table of specifications for a test concerning the principles of evolution by means of natural selection.

Example table of specifications for evolution by means of natural selection test showing numbers of test items pertaining to each content area at each cognitive level and total number of items per content area and cognitive level

Cognitive process
Content (behavioral objective)ComprehensionApplicationAnalysisTotal
1. Students will define evolution by means of natural selection.11
2. Students will define key principles of evolution by means of natural selection (e.g., heredity, differential fitness).55
3. Students will compute measures of absolute and relative fitness.55
4. Students will compare evolution by means of natural selection with earlier evolution theories.33
5. Student will analyze phylogenetic trees.44
Total65718

Evidence of validity based on response processes concerns “the fit between the construct and the detailed nature of the performance or response actually engaged in by test takers” ( AERA, APA, and NCME, 2014 , p. 15). For example, if a test purportedly elicits evidence of undergraduate students’ critical evaluative thinking concerning evidence-based scientific arguments, during the test the student should be engaged in the cognitive process of examining argument claims, evidence, and warrants, and the relevance, accuracy, and sufficiency of evidence. Most often one gathers such evidence through respondent think-aloud procedures. During think alouds, respondents verbally explain and rationalize their thought processes and responses concurrently during test completion. One particular method commonly used by professional test vendors to gather response process–based validity evidence is cognitive labs, which involve both concurrent and retrospective verbal reporting by respondents ( Willis, 1999 ; Zucker et al ., 2004 ). As an example from the literature, developers of the HPI concept inventory asked respondents to provide open-ended responses to ensure that their reasons for selecting a particular response option (e.g., “B”) were consistent with the developer’s intentions, that is, the student indeed held the particular alternative conception presented in response option B ( Marbach-Ad et al. , 2009 ). Think alouds are formalized via structured protocols, and the elicited think-aloud data are recorded, transcribed, analyzed, and interpreted to shed light on validity.

Evidence based on internal structure concerns “the degree to which the relationships among test item and test components conform to the construct on which the proposed test score interpretations are based” ( AERA, APA, and NCME, 2014 , p. 16). 4 For instance, suppose a professor plans to teach one topic (eukaryotes) using small-group active-learning instruction and another topic (prokaryotes) through lecture instruction; and he or she wants to make within-class comparisons of the effectiveness of these methods. As an outcome measure, a test may be designed to support inferences about the two specific aspects of biology content (e.g., characteristics of prokaryotic and eukaryotic cells). Collection of evidence based on internal structure seeks to confirm empirically whether the scores reflect the (in this case two) distinct domains targeted by the test ( Messick, 1995 ). In practice, one can formally establish the fidelity of test scores to their theorized internal structure through methodological techniques such as factor analysis, item response theory, and Rasch modeling ( Harman, 1960 ; Rasch, 1960 ; Embretson and Reise, 2013 ). With factor analysis, for example, item intercorrelations are analyzed to determine whether particular item responses cluster together (i.e., whether scores from components of the test related to one aspect of the domain [e.g., questions about prokaryotes] are more interrelated with one another than they are with scores derived from other components of the test [e.g., questions about eukaryotes]).

Item response theory and Rasch models hypothesize that the probability of a particular response to a test item is a function of the respondent’s ability (in terms of what is being measured) and characteristics of the item (e.g., difficulty, discrimination, pseudo-guessing). Examining test score internal structure with such models involves examining whether such model-based predictions bear out in the observed data. There are a variety of such models for use with test questions with different (or different combinations of) response formats such as the Rasch rating-scale model ( Andrich, 1978 ) and the Rasch partial-credit Rasch model ( Wright and Masters, 1982 ).

Validity evidence based on relations with other variables concerns “the relationship of test scores to variables external to the test” ( AERA, APA, and NCME, 2014 , p. 16). The collection of this form of validity evidence centers on examining how test scores are related to both measures of the same or similar constructs and measures of distinct and different constructs (i.e., respectively termed “convergent validity” and “discriminant validity” 5 evidence). In other words, such evidence pertains to how scores relate to other variables as would be theoretically expected. For example, if a new self-report instrument purports to measure experimental design skills, scores should correlate highly with an existing measure of experimental design ability such as the EDAT ( Sirum and Humburg, 2011 ). On the other hand, scores derived from this self-report instrument should be considerably less correlated or uncorrelated with scores from a personality measure such as the Minnesota Multiphasic Personality Inventory ( Greene, 2000 ). As another discipline-based education research example, Nehm and Schonfeld (2008) collected discriminant validity evidence by relating scores from both the Conceptual Inventory of Natural Section (CINS) and the Open Response Instrument (ORI), which both purport to assess understanding of and conceptions concerning natural selection, and a geology test of knowledge about rocks.

A subcategory of evidence based on relations with other variables is evidence related to test-criterion relationships, which concerns how test scores are related to some other nontest indicator or outcome either at the same time (so-called concurrent validity evidence) or in the future (so-called predictive validity evidence). For instance, developers of a new biostatistics test might examine how scores from the test correlate as expected with professor ability judgments or mathematics course grade point average at the same point in time; alternatively, the developer might follow tested individuals over time to examine how scores relate to the probability of successfully completing biostatistics course work. As another example, given prior research on self-efficacy, scores from instruments that probe teaching self-efficacy should be related to respondents’ levels of teacher training and experience ( Prieto and Altmaier, 1994 ; Prieto and Meyers, 1999 ).

Examination of how test scores are related or not to other variables as expected is often associational in nature (e.g., correlational analysis). There are also two other specific methods for eliciting such validity evidence. The first is to examine score differences between theoretically different groups (e.g., whether scientists’ and nonscientists’ scores from an experimental design test differ systematically on average)—the “known groups method.” The second is to examine whether scores increase or decrease as expected in response to an intervention ( Hattie and Cooksey, 1984 ; AERA, APA, and NCME, 2014 ). For example, Marbach-Ad et al. (2009 , 2010 ) examined HPI concept inventory score differences between majors and nonmajors and students in introductory and upper-level courses. To inform the collection of validity evidence based on relations with other variables, individuals should consult the literature to formulate a theory around how good measures of the construct should relate to different variables. One should also note that the quality of such validity evidence hinges on the quality (e.g., validity) of measures of external variables.

Finally, validity evidence based on the consequences of testing concerns the “soundness of proposed interpretations [of test scores] for their intended uses” ( AERA, APA, and NCME, 2014 , p. 19) and the value implications and social consequences of testing ( Messick, 1994 , 1995 ). Such evidence pertains to both the intended and unintended effects of test score interpretation and use ( Linn, 1991 ; Messick, 1995 ). Example intended consequences of test use would include motivating students, better-targeted instruction, and populating a special program with only those students who are in need of the program (if those are the intended purposes of test use). An example of an unintended consequence of test use would be significant reduction in instructional time because of overly time-consuming test administration (assuming, of course, that this would not be a desired outcome) or drop out of particular student populations because of an excessively difficult test administered early in a course. In K–12 settings, a classic example of an unintended consequence of testing is the “narrowing of the curriculum” that occurred as a result of the No Child Left Behind Act testing regime; when faced with annual tests focused only on particular content areas (i.e., English/language arts and mathematics), schools and teachers focused more on tested content and less on nontested content such as science, social studies, art, and music (e.g., Berliner, 2011 ). Evidence based on the consequences of a test is often gathered via surveys, interviews, and focus groups administered with test users.

TEST VALIDITY ARGUMENT EXAMPLE

In this section, we provide an example validity argument for a test designed to elicit evidence of students’ skills in analyzing the elements of evidence-based scientific arguments. This hypothetical test presents text-based arguments concerning scientific topics (e.g., global climate change, natural selection) to students, who then directly interact with the texts to identify their elements (i.e., claims, reasons, and warrants). The test is intended to support inferences about 1) students’ overall evidence-based science argument-element analysis skills; 2) students’ skills in identifying particular evidence-based science argument elements (e.g., claims); and 3) errors made when students identify particular argument elements (e.g., evidence). Additionally, the test is intended to 4) support instructional decision-making to improve science teaching and learning. The validity argument claims undergirding this example test’s score interpretations and uses (and the categories of validity evidence advanced to substantiate each) are shown in Table 2 .

Example validity argument and validation approach for a test of students’ ability to analyze the elements of evidence-based scientific arguments showing argument claims and subclaims concerning the validity of the intended test score interpretations and uses and relevant validity evidence used to substantiate those claims

Validity argument claims and subclaimsRelevant validity evidence based on
1. The overall score represents a student’s current level of argument-element analysis skills, because:
   a single higher-order construct (i.e., argument-element analysis skill) underlies all item responses.Internal structure
   the overall score is distinct from content knowledge and thinking dispositions.Relations with other variables
   the items represent a variety of arguments and argument elements.Test content
   items engage respondents in the cognitive process of argument-element analysis.Response processes
   the overall score is highly related to other argument analysis measures and less related to content knowledge and thinking disposition measures.Relations with other variables
2. A subscore (e.g., claim identification) represents a student’s current level of argument-element identification skill, because:
   each subscore is distinct from other subscores and the total score (the internal structure is multidimensional and hierarchical).Internal structure
   the items represent a variety of arguments and particular argument elements (e.g., claims).Test content
   the subscore is distinct from content knowledge and thinking dispositions.Relations with other variables
   items engage respondents in the cognitive process of identifying a particular element argument (e.g., claims).Response processes
   subscores are highly related to other argument analysis measures and less related to content knowledge and thinking disposition measures.Relations with other variables
3. Error indicators can be interpreted to represent students’ current errors made in identifying particular argument elements, because when students misclassify an element in the task, they are making cognitive errors.Response processes
4. Use of the test will facilitate improved argument instruction and student learning, because:
   teachers report that the test is useful and easy to use and have positive attitudes toward it.Consequences of testing
   teachers report using the test to improve argument instruction.Consequences of testing
   teachers report that the provided information is timely.Consequences of testing
   teachers learn about argumentation with test use.Consequences of testing
   students learn about argumentation with test useConsequences of testing
   any unintended consequences of test use do not outweigh intended consequences.Consequences of testing

ANALYSIS OF CINS VALIDITY EVIDENCE

The example validity argument provided in the preceding section was intended to model the validity argument formulation process for readers who intend to develop an instrument. However, in many cases, an existing instrument (or one of several existing instruments) needs to be selected for use in one’s context. The use of an existing instrument for research or practice requires thoughtful analysis of extant validity evidence available for an instrument’s score interpretations and uses. Therefore, in this section, we use validity theory as outlined in the Standards for Educational and Psychological Testing to analyze the validity evidence for a particular instrument, the CINS.

As reported in Anderson et al . (2002) , the CINS is purported to measure “conceptual understanding of natural selection” (as well as alternative conceptions of particular relevant ideas diagnostically) in undergraduate non–biology majors before instruction (p. 953). In their initial publication of the instrument, the authors supplied several forms of validity evidence for the intended score interpretations and uses. In terms of validity evidence related to test content, the authors argued that test content was aligned with Mayr’s (1982) five facts and three inferences about evolution by means of natural selection, and two other key concepts, the origin of variation and the origin of species. Two test items were mapped to each of these 10 concepts. Similarly, distractor (incorrect) multiple-choice responses were based on theory and research about students’ nonscientific, or alternative, conceptions of these ideas. Content-related validity evidence was also provided through reviews of test items by biology professors.

Evidence based on test-taker response processes was elicited through cognitive interviews (think alouds) conducted with a small sample of students ( Anderson et al. , 2002 ). The authors provided validity evidence based on internal structure using principal components analysis, which is similar to factor analysis. In terms of validity evidence based on test-score relations with other variables, the authors examined correlations between CINS scores and scores derived from interviews. While Anderson and colleagues did note that a paper and pencil–based test would be more logistically feasible than interview-based assessment methods, validity evidence based on the consequences of testing was not formally provided.

Anderson et al. ’s (2002) paper did present a variety of forms of validity evidence concerning the CINS instrument. However, and underscoring the continuous nature of test validation, subsequent work has built upon their work and provided additional evidence. For example, in light of concerns that the primary earlier source of validity evidence was correlations between CINS scores and scores based on oral interviews in a very small sample, Nehm and Schonfeld (2008) provided additional validity evidence based on relations with other variables. For example, Nehm and Schonfeld (2008) examined CINS score relations with two other instruments purported to assess the same construct (convergent validity evidence) and with a measure of an unrelated construct (discriminant validity evidence). Nehm and Schonfeld also expanded the body of CINS validity evidence based on internal structure by analyzing data using the Rasch model. The authors’ reporting of CINS administration times similarly shed light on possible consequences of testing. The evolution of validity evidence for the CINS noted here certainly speaks to the iterative and ongoing nature of instrument validation processes. With this in mind, future work might examine CINS scores’ internal structure vis-à-vis diagnostic classification models (see Rupp and Templin, 2008 ), since CINS is purportedly a diagnostic test.

TEST VALIDITY THREATS

The two primary threats to test score validity are construct underrepresentation and construct-irrelevant variance. Construct underrepresentation is “the degree to which a test fails to capture important aspects of the construct” ( AERA, APA, and NCME, 2014 ; p. 12). In other words, construct underrepresentation involves failing to elicit a representative sample of behavior from test takers (e.g., responses to multiple-choice questions) relative to the universe of possible relevant behaviors that might be observed. While it is neither necessary nor feasible to ask respondents to engage in every single possible relevant behavior, it is crucial that the behavior sampled by the test is sufficiently representative of the construct at large. If a test does not fully and adequately sample behavior related to the targeted domain, the test score’s meaning in actuality would be narrower than is intended.

Content underrepresentation can be mitigated by initiating test design with a thorough analysis and conception of the domain targeted by the test ( Mislevy et al. , 2003 ; Opfer et al. , 2012 ). Knowledge of the construct, and variables that are related or not related to the construct, can also inform the validation process ( Mislevy et al. , 2003 ). Beginning test design with a thorough conception of the construct one intends to measure is analogous to the course design approach known as “backward design”; with backward design one first identifies what one wants students to know/be able to do after instruction (learning outcomes) and then designs a course to get students there ( Wiggins and McTighe, 2005 ). Other strategies to promote construct representation include building a test based on a table of specifications; submitting a text to external expert content review (as both noted above); and employing a sufficient number of test items to ensure good representation of domain content.

Besides construct representation, the other primary threat to test score validity is construct-irrelevant variance—“the degree to which test scores are affected by processes that are extraneous to the test’s intended purpose” ( AERA, APA, and NCME, 2014 , p. 12). Construct-irrelevant variance is test score variation caused systematically by factors other than (or in addition to) those intended; in other words, some part of the reason why one received a “high” or “low” score is due to irrelevant reasons. Two common examples of this are: English skills affecting test scores for non–native English speakers on tests written in English; and computer skills affecting test scores for tests administered via computer. Another example would be if items on a science teaching self-efficacy self-report instrument are written so generally that the scores represent not science teaching–specific self-efficacy but self-efficacy in general. It is critical to mitigate such threats through test design processes (e.g., minimizing test linguistic load). One can often identify potential threats in the course of a thorough analysis of the construct/domain done at early design stages. During test validation one should also disconfirm such threats wherein scores are driven by irrelevant factors; practitioners often conduct factor, correlational, and differential item functioning analyses toward this end.

Systematic research on postsecondary science teaching and learning and evaluation of local innovations by practitioners hinges on the availability and use of sound instrumentation. Unfortunately, the field of discipline-based education research lacks sufficient existing and high-quality instruments for use in all of these efforts ( Opfer et al. , 2012 ; Singer et al. , 2012 ; Campbell and Nehm, 2013 ). DBERs and practitioners furthermore do not typically have formal training that equips them to evaluate and select existing instruments or develop and validate their own instruments for needed purposes. This essay reviewed contemporary test validity and validation theory for DBERs and practitioners in hopes of equipping them with such knowledge. 6

This essay was chiefly intended for two audiences: 1) those who will develop new instruments; and 2) those who will evaluate and select from among existing instruments. Here, we summarily denote the implications of this essay for members of these two populations. First, it behooves those developing and publishing their own instruments to explicitly frame, construct, and report an evidence-based validity argument for their proposed instruments’ intended score interpretations and uses. This argument should rely on multiple forms of validity evidence and specify the test-taker and user populations for which that argument pertains. If faced with space constraints in journal articles, test manuals or technical reports can be written to detail such validity evidence and made available to the scholarly community.

Like any argument, an evidence-based argument formulated during test validation should be characterized by relevancy, accuracy, and sufficiency. As such, validity arguments should be held up to scientific scrutiny before a test’s operational use. The quality of a validity argument hinges on a number of factors discussed in this essay. Examples include the alignment of the validity argument claims with intended test score interpretations and uses; the representativeness of the samples from which validity evidence is gathered to the intended test-taker population; the relevance of the expertise held by content reviewers; and the technical quality of external measures. A final point to emphasize is that validation is an ongoing process; additional validity evidence may need to be gathered as theory concerning a construct evolves or as the community seeks to use an instrument with new populations.

Second, potential test users should be critical in their evaluation of existing instruments, and should not merely assume a strong validity argument exists for an instrument’s score interpretations and uses with a particular population. Potential users should look to the instrumentation (or methods) sections of published articles for key information, such as whether the test was developed based on a sound theoretical conception of construct, whether the test underwent external content review, and whether scores correlate with other measures as they theoretically should, among other things. One may have to contact an author for such information. Altogether, such practices should advance the quality of measurement within the realm of discipline-based education research.

Acknowledgments

The authors thank Drs. Beth Schussler and Ross Nehm and two anonymous reviewers for their constructive feedback on an earlier version of this article.

1 A test cannot be “stamped” valid for all purposes and test-taker populations; validity evidence needs to be gathered with respect to all intended instrument uses.

2 While other key dimensions for evaluating an instrument’s quality include reliability (i.e., test score consistency) and utility (i.e., feasibility; AERA, APA, and NCME, 2014 ), the focus here is on validity only.

3 While this essay allies with test validity theory as codified in the Standards for Educational and Psychological Testing ( AERA, APA, and NCME, 2014 ), the reader will note that there are alternative conceptions of validity as well ( Lissitz and Samuelsen, 2007 ).

4 This source of evidence has been termed “substantive validity” ( Messick, 1995 ).

5 This is not to be confused with item discrimination, a test item property pertaining to how an item’s scores relate to overall test performance.

6 While our focus is on instruments comprising sets of questions or items intended to elicit evidence of a particular construct or constructs, many of the ideas here apply also to questionnaire (survey) validation. For example, the developer of a questionnaire may interrogate how respondents interpret and formulate a response to a particular question as validity evidence based on response processes.

  • American Association for the Advancement of Science. Vision and Change in Undergraduate Biology Education: A Call to Action. Washington, DC: 2011. [ Google Scholar ]
  • American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: 2014. [ Google Scholar ]
  • Anderson DL, Fisher KM, Norman GJ. Development and evaluation of the conceptual inventory of natural science. J Res Sci Teach. 2002; 39 :952–978. [ Google Scholar ]
  • Andrich DA. A rating formulation for ordered response categories. Psychometrika. 1978; 43 :561–573. [ Google Scholar ]
  • Angoff WH. Validity: an evolving concept. In: Wainer H, Braun H, editors. Test Validity. Hillsdale, NJ: Erlbaum; 1988. pp. 19–32. [ Google Scholar ]
  • Auchincloss LC, Laursen SL, Branchaw JL, Eagan K, Graham M, Hanauer DI, Lawrie G, McLinn CM, Palaez N, Rowland S, et al. Assessment of course-based undergraduate research experiences: a meeting report. CBE Life Sci Educ. 2014; 13 :29–40. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Berliner D. Rational responses to high stakes testing: the case of curriculum narrowing and the harm that follows. Cambridge J Educ. 2011; 41 :287–302. [ Google Scholar ]
  • Bollen KA. Latent variables in psychology and the social sciences. Annu Rev Psychol. 2002; 53 :605–634. [ PubMed ] [ Google Scholar ]
  • Boyer Commission on Educating Undergraduates in the Research University. Reinventing Undergraduate Education: A Blueprint for America’s Research Universities. Stony Brook: State University of New York; 1998. [ Google Scholar ]
  • Brownell SE, Tanner KD. Barriers to faculty pedagogical change: lack of training, time, incentives, and … tensions with professional identity. CBE Life Sci Educ. 2012; 11 :339–346. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Campbell CE, Nehm RH. A critical analysis of assessment quality in genomics and bioinformatics education research. CBE Life Sci Educ. 2013; 12 :530–541. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • DeChenne SE, Enochs LG, Needham M. Science, technology, engineering, and mathematics graduate teaching assistants teaching self-efficacy. J Scholarship Teach Learn. 2012; 12 :102–123. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Denofrio LA, Russell B, Lopatto D, Lu Y. Linking student interests to science curricula. Science. 2007; 318 :1872–1873. [ PubMed ] [ Google Scholar ]
  • Embretson SE, Reise SP. Item Response Theory. Mahwah, NJ: 2013. [ Google Scholar ]
  • Fives H, DiDonato-Barnes N. Classroom test construction: the power of a table of specifications. Pract Assess Res Eval. 2013; 18 :2–7. [ Google Scholar ]
  • Freeman S, Eddy SL, McDonough M, Smith MK, Wenderoth MP, Okoroafor N, Jordt H. Active learning increases student performance in science, engineering, and mathematics. Proc Natl Acad Sci USA. 2014; 111 :8410–8415. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Greene RL. The MMPI-2: An Interpretive Manual. Boston, MA: 2000. [ Google Scholar ]
  • Harman HH. Modern Factor Analysis. Oxford, UK: 1960. [ Google Scholar ]
  • Hattie J, Cooksey RW. Procedures for assessing the validities of tests using the “known-groups” method. Appl Psychol Meas. 1984; 8 :295–305. [ Google Scholar ]
  • Kane MT. An argument-based approach to validity. Psychol Bull. 1992; 112 :527–535. [ Google Scholar ]
  • Linn RL. Complex, performance-based assessment: expectations and validation criteria. Educ Researcher. 1991; 20 :15–21. [ Google Scholar ]
  • Lissitz RW, Samuelsen K. A suggested change in terminology and emphasis regarding validity and education. Educ Researcher. 2007; 36 :437–448. [ Google Scholar ]
  • Marbach-Ad G, Briken V, El-Sayed NM, Frauwirth K, Fredericksen B, Hutcheson S, Gao LY, Joseph S, Lee VT, McIver KS, et al. Assessing student understanding of host pathogen interactions using a concept inventory. J Microbiol Biol Educ. 2009; 10 :43–50. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Marbach-Ad G, McAdams K, Benson S, Briken V, Cathcart L, Chase M, El-Sayed N, Frauwirth K, Fredericksen B, Joseph S, et al. A model for using a concept inventory as a tool for students’ assessment and faculty professional development. CBE Life Sci Educ. 2010; 9 :408–436. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mayr E. The Growth of Biological Thought: Diversity, Evolution and Inheritance. Cambridge, MA: Harvard University Press; 1982. [ Google Scholar ]
  • Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educ Researcher. 1994; 23 :13–23. [ Google Scholar ]
  • Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas. 1995; 14 :5–8. [ Google Scholar ]
  • Mislevy RJ, Steinberg LS, Almond RG. On the structure of educational assessments. Measurement. 2003; 1 :3–62. [ Google Scholar ]
  • National Research Council. BIO2010: Transforming Undergraduate Education for Future Research Biologists. Washington, DC: National Academies Press; 2003. [ PubMed ] [ Google Scholar ]
  • Nehm RH, Schonfeld IS. Measuring knowledge of natural selection: a comparison of the CINS, an open-response instrument, and an oral interview. J Res Sci Teach. 2008; 45 :1131–1160. [ Google Scholar ]
  • Opfer JE, Nehm RH, Ha M. Cognitive foundations for science assessment design: knowing what students know about evolution. J Res Sci Teach. 2012; 49 :744–777. [ Google Scholar ]
  • Prieto LR, Altmaier EM. The relationship of prior training and previous teaching experience to self-efficacy among graduate teaching assistants. Res High Educ. 1994; 35 :481–497. [ Google Scholar ]
  • Prieto LR, Meyers SA. Effects of training and supervision on the self-efficacy of psychology graduate teaching assistants. Teach Psychol. 1999; 26 :264–266. [ Google Scholar ]
  • Rasch G. Probabalistic Models for Some Intelligence and Achievement Tests. Copenhagen: Danish Institute for Educational Research; 1960. [ Google Scholar ]
  • Rupp AA, Templin JL. Unique characteristics of diagnostic classification models: a comprehensive review of the current state-of-the-art. Measurement. 2008; 6 :219–262. [ Google Scholar ]
  • Schussler EE, Reed Q, Marbach-Ad G, Miller K, Ferzli M. Preparing biology graduate teaching assistants for their roles as instructors: an assessment of institutional approaches. CBE Life Sci Educ. 2015; 14 :ar31. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Singer SR, Nielsen NR, Schweingruber HA. Discipline-based Education Research: Understanding and Improving Learning in Undergraduate Science and Engineering. Washington, DC: National Academies Press; 2012. [ Google Scholar ]
  • Sirum K, Humburg J. The Experimental Design Ability Test (EDAT) Bioscene. 2011; 37 :8–16. [ Google Scholar ]
  • Smith MK, Jones FHM, Gilbert SL, Wieman CE. The Classroom Observation Protocol for Undergraduate STEM (COPUS): A new instrument to characterize university STEM classroom practices. CBE Life Sci Educ. 2013; 12 :618–627. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Smith MK, Wood WB, Knight JK. The Genetics Concept Assessment: a new concept inventory for gauging student understanding of genetics. CBE Life Sci Educ. 2008; 7 :422–430. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Webb N. Identifying content for student achievement tests. In: Downing SM, Haladyna TM, editors. Handbook of Test Development. Mahwah, NJ: Erlbaum; 2006. pp. 155–180. [ Google Scholar ]
  • Wiggins GP, McTighe J. Understanding by Design. Alexandria, VA: 2005. [ Google Scholar ]
  • Willis GB. Cognitive Interviewing: A “How To” Guide. Research Triangle Park, NC: 1999. [ Google Scholar ]
  • Wright BD, Masters GN. Rating Scale Analysis. Chicago, IL: MESA; 1982. [ Google Scholar ]
  • Zucker S, Sassman S, Case BJ. Cognitive Labs. 2004. http://images.pearsonassessments.com/images/tmrs/tmrs_rg/CognitiveLabs.pdf (accessed 29 August 2015)

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Reliability vs. Validity in Research | Difference, Types and Examples

Published on July 3, 2019 by Fiona Middleton . Revised on June 22, 2023.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique. or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research . Failing to do so can lead to several types of research bias and seriously affect your work.

Reliability vs validity
Reliability Validity
What does it tell you? The extent to which the results can be reproduced when the research is repeated under the same conditions. The extent to which the results really measure what they are supposed to measure.
How is it assessed? By checking the consistency of results across time, across different observers, and across parts of the test itself. By checking how well the results correspond to established theories and other measures of the same concept.
How do they relate? A reliable measurement is not always valid: the results might be , but they’re not necessarily correct. A valid measurement is generally reliable: if a test produces accurate results, they should be reproducible.

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis, other interesting articles.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

If the thermometer shows different temperatures each time, even though you have carefully controlled conditions to ensure the sample’s temperature stays the same, the thermometer is probably malfunctioning, and therefore its measurements are not valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism. Run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example
The consistency of a measure : do you get the same results when you repeat the measurement? A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks or months apart and give the same answers, this indicates high test-retest reliability.
The consistency of a measure : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective).
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

Type of validity What does it assess? Example
The adherence of a measure to  of the concept being measured. A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and ). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity.
The extent to which the measurement  of the concept being measured. A test that aims to measure a class of students’ level of Spanish contains reading, writing and speaking components, but no listening component.  Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish.
The extent to which the result of a measure corresponds to of the same concept. A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalizability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data.

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardized questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid and generalizable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession).  Ensure that you have enough participants and that they are representative of the population. Failing to do so can lead to sampling bias and selection bias .

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible .

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations , clearly define how specific behaviors or responses will be counted, and make sure questions are phrased the same way each time. Failing to do so can lead to errors such as omitted variable bias or information bias .

  • Standardize the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions, preferably in a properly randomized setting. Failing to do so can lead to a placebo effect , Hawthorne effect , or other demand characteristics . If participants can guess the aims or objectives of a study, they may attempt to act in more socially desirable ways.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper . Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods that are reliable and valid?
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions and measuring techniques.
If you calculate reliability and validity, state these values alongside your main results.
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not?
If reliability and validity were a big problem for your findings, it might be helpful to mention this here.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). Reliability vs. Validity in Research | Difference, Types and Examples. Scribbr. Retrieved August 27, 2024, from https://www.scribbr.com/methodology/reliability-vs-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, what is quantitative research | definition, uses & methods, data collection | definition, methods & examples, what is your plagiarism score.

  • How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threat Definition Example
Confounding factors Unexpected events during the experiment that are not a part of treatment. If you feel the increased weight of your experiment participants is due to lack of physical activity, but it was actually due to the consumption of coffee with sugar.
Maturation The influence on the independent variable due to passage of time. During a long-term experiment, subjects may feel tired, bored, and hungry.
Testing The results of one test affect the results of another test. Participants of the first experiment may react differently during the second experiment.
Instrumentation Changes in the instrument’s collaboration Change in the   may give different results instead of the expected results.
Statistical regression Groups selected depending on the extreme scores are not as extreme on subsequent testing. Students who failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection bias Choosing comparison groups without randomisation. A group of trained and efficient teachers is selected to teach children communication skills instead of randomly selecting them.
Experimental mortality Due to the extension of the time of the experiment, participants may leave the experiment. Due to multi-tasking and various competition levels, the participants may leave the competition because they are dissatisfied with the time-extension even if they were doing well.

Threats of External Validity

Threat Definition Example
Reactive/interactive effects of testing The participants of the pre-test may get awareness about the next experiment. The treatment may not be effective without the pre-test. Students who got failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection of participants A group of participants selected with specific characteristics and the treatment of the experiment may work only on the participants possessing those characteristics If an experiment is conducted specifically on the health issues of pregnant women, the same treatment cannot be given to male participants.

How to Assess Reliability and Validity?

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Type of reliability What does it measure? Example
Test-Retests It measures the consistency of the results at different points of time. It identifies whether the results are the same after repeated measures. Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from a various group of participants, it means the validity of the questionnaire and product is high as it has high test-retest reliability.
Inter-Rater It measures the consistency of the results at the same time by different raters (researchers) Suppose five researchers measure the academic performance of the same student by incorporating various questions from all the academic subjects and submit various results. It shows that the questionnaire has low inter-rater reliability.
Parallel Forms It measures Equivalence. It includes different forms of the same test performed on the same participants. Suppose the same researcher conducts the two different forms of tests on the same topic and the same students. The tests could be written and oral tests on the same topic. If results are the same, then the parallel-forms reliability of the test is high; otherwise, it’ll be low if the results are different.
Inter-Term It measures the consistency of the measurement. The results of the same tests are split into two halves and compared with each other. If there is a lot of difference in results, then the inter-term reliability of the test is low.

Types of Validity

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Type of reliability What does it measure? Example
Content validity It shows whether all the aspects of the test/measurement are covered. A language test is designed to measure the writing and reading skills, listening, and speaking skills. It indicates that a test has high content validity.
Face validity It is about the validity of the appearance of a test or procedure of the test. The type of   included in the question paper, time, and marks allotted. The number of questions and their categories. Is it a good question paper to measure the academic performance of students?
Construct validity It shows whether the test is measuring the correct construct (ability/attribute, trait, skill) Is the test conducted to measure communication skills is actually measuring communication skills?
Criterion validity It shows whether the test scores obtained are similar to other measures of the same concept. The results obtained from a prefinal exam of graduate accurately predict the results of the later final exam. It shows that the test has high criterion validity.

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Segments Explanation
All the planning about reliability and validity will be discussed here, including the chosen samples and size and the techniques used to measure reliability and validity.
Please talk about the level of reliability and validity of your results and their influence on values.
Discuss the contribution of other researchers to improve reliability and validity.

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

A hypothesis is a research question that has to be proved correct or incorrect through hypothesis testing – a scientific approach to test a hypothesis.

Disadvantages of primary research – It can be expensive, time-consuming and take a long time to complete if it involves face-to-face contact with customers.

Content analysis is used to identify specific words, patterns, concepts, themes, phrases, or sentences within the content in the recorded communication.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

definition of validity in research by different authors

Understanding Reliability and Validity

These related research issues ask us to consider whether we are studying what we think we are studying and whether the measures we use are consistent.

Reliability

Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. Without the agreement of independent observers able to replicate research procedures, or the ability to use research tools and procedures that yield consistent measurements, researchers would be unable to satisfactorily draw conclusions, formulate theories, or make claims about the generalizability of their research. In addition to its important role in research, reliability is critical for many parts of our lives, including manufacturing, medicine, and sports.

Reliability is such an important concept that it has been defined in terms of its application to a wide range of activities. For researchers, four key types of reliability are:

Equivalency Reliability

Equivalency reliability is the extent to which two items measure identical concepts at an identical level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one another to highlight the degree of relationship or association. In quantitative studies and particularly in experimental studies, a correlation coefficient, statistically referred to as r , is used to show the strength of the correlation between a dependent variable (the subject under study), and one or more independent variables , which are manipulated to determine effects on the dependent variable. An important consideration is that equivalency reliability is concerned with correlational, not causal, relationships.

For example, a researcher studying university English students happened to notice that when some students were studying for finals, their holiday shopping began. Intrigued by this, the researcher attempted to observe how often, or to what degree, this these two behaviors co-occurred throughout the academic year. The researcher used the results of the observations to assess the correlation between studying throughout the academic year and shopping for gifts. The researcher concluded there was poor equivalency reliability between the two actions. In other words, studying was not a reliable predictor of shopping for gifts.

Stability Reliability

Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

An example of stability reliability would be the method of maintaining weights used by the U.S. Bureau of Standards. Platinum objects of fixed weight (one kilogram, one pound, etc...) are kept locked away. Once a year they are taken out and weighed, allowing scales to be reset so they are "weighing" accurately. Keeping track of how much the scales are off from year to year establishes a stability reliability for these instruments. In this instance, the platinum weights themselves are assumed to have a perfectly fixed stability reliability.

Internal Consistency

Internal consistency is the extent to which tests or procedures assess the same characteristic, skill or quality. It is a measure of the precision between the observers or of the measuring instruments used in a study. This type of reliability often helps researchers interpret data and predict the value of scores and the limits of the relationship among variables.

For example, a researcher designs a questionnaire to find out about college students' dissatisfaction with a particular textbook. Analyzing the internal consistency of the survey items dealing with dissatisfaction will reveal the extent to which items on the questionnaire focus on the notion of dissatisfaction.

Interrater Reliability

Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Interrater reliability addresses the consistency of the implementation of a rating system.

A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability.

Related Information: Reliability Example

An example of the importance of reliability is the use of measuring devices in Olympic track and field events. For the vast majority of people, ordinary measuring rulers and their degree of accuracy are reliable enough. However, for an Olympic event, such as the discus throw, the slightest variation in a measuring device -- whether it is a tape, clock, or other device -- could mean the difference between the gold and silver medals. Additionally, it could mean the difference between a new world record and outright failure to qualify for an event. Olympic measuring devices, then, must be reliable from one throw or race to another and from one competition to another. They must also be reliable when used in different parts of the world, as temperature, air pressure, humidity, interpretation, or other variables might affect their readings.

Validity refers to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure. While reliability is concerned with the accuracy of the actual measuring instrument or procedure, validity is concerned with the study's success at measuring what the researchers set out to measure.

Researchers should be concerned with both external and internal validity. External validity refers to the extent to which the results of a study are generalizable or transferable. (Most discussions of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include a reference here to transferability because many qualitative research studies are not designed to be generalized.)

Internal validity refers to (1) the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured) and (2) the extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.

Scholars discuss several types of internal validity. For brief discussions of several types of internal validity, click on the items below:

Face Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support (Fink, 1995).

Criterion Related Validity

Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.

For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.

Construct Validity

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.

Construct validity can be broken down into two sub-categories: Convergent validity and discriminate validity. Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related. Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested (Carmines & Zeller, p. 23).

Content Validity

Content Validity is based on the extent to which a measurement reflects the specific intended domain of content (Carmines & Zeller, 1991, p.20).

Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Related Information: Validity Example

Many recreational activities of high school students involve driving cars. A researcher, wanting to measure whether recreational activities have a negative effect on grade point average in high school students, might conduct a survey asking how many students drive to school and then attempt to find a correlation between these two factors. Because many students might use their cars for purposes other than or in addition to recreation (e.g., driving to work after school, driving to school rather than walking or taking a bus), this research study might prove invalid. Even if a strong correlation was found between driving and grade point average, driving to school in and of itself would seem to be an invalid measure of recreational activity.

The challenges of achieving reliability and validity are among the most difficult faced by researchers. In this section, we offer commentaries on these challenges.

Difficulties of Achieving Reliability

It is important to understand some of the problems concerning reliability which might arise. It would be ideal to reliably measure, every time, exactly those things which we intend to measure. However, researchers can go to great lengths and make every attempt to ensure accuracy in their studies, and still deal with the inherent difficulties of measuring particular events or behaviors. Sometimes, and particularly in studies of natural settings, the only measuring device available is the researcher's own observations of human interaction or human reaction to varying stimuli. As these methods are ultimately subjective in nature, results may be unreliable and multiple interpretations are possible. Three of these inherent difficulties are quixotic reliability, diachronic reliability and synchronic reliability.

Quixotic reliability refers to the situation where a single manner of observation consistently, yet erroneously, yields the same result. It is often a problem when research appears to be going well. This consistency might seem to suggest that the experiment was demonstrating perfect stability reliability. This, however, would not be the case.

For example, if a measuring device used in an Olympic competition always read 100 meters for every discus throw, this would be an example of an instrument consistently, yet erroneously, yielding the same result. However, quixotic reliability is often more subtle in its occurrences than this. For example, suppose a group of German researchers doing an ethnographic study of American attitudes ask questions and record responses. Parts of their study might produce responses which seem reliable, yet turn out to measure felicitous verbal embellishments required for "correct" social behavior. Asking Americans, "How are you?" for example, would in most cases, elicit the token, "Fine, thanks." However, this response would not accurately represent the mental or physical state of the respondents.

Diachronic reliability refers to the stability of observations over time. It is similar to stability reliability in that it deals with time. While this type of reliability is appropriate to assess features that remain relatively unchanged over time, such as landscape benchmarks or buildings, the same level of reliability is more difficult to achieve with socio-cultural phenomena.

For example, in a follow-up study one year later of reading comprehension in a specific group of school children, diachronic reliability would be hard to achieve. If the test were given to the same subjects a year later, many confounding variables would have impacted the researchers' ability to reproduce the same circumstances present at the first test. The final results would almost assuredly not reflect the degree of stability sought by the researchers.

Synchronic reliability refers to the similarity of observations within the same time frame; it is not about the similarity of things observed. Synchronic reliability, unlike diachronic reliability, rarely involves observations of identical things. Rather, it concerns itself with particularities of interest to the research.

For example, a researcher studies the actions of a duck's wing in flight and the actions of a hummingbird's wing in flight. Despite the fact that the researcher is studying two distinctly different kinds of wings, the action of the wings and the phenomenon produced is the same.

Comments on a Flawed, Yet Influential Study

An example of the dangers of generalizing from research that is inconsistent, invalid, unreliable, and incomplete is found in the Time magazine article, "On A Screen Near You: Cyberporn" (De Witt, 1995). This article relies on a study done at Carnegie Mellon University to determine the extent and implications of online pornography. Inherent to the study are methodological problems of unqualified hypotheses and conclusions, unsupported generalizations and a lack of peer review.

Ignoring the functional problems that manifest themselves later in the study, it seems that there are a number of ethical problems within the article. The article claims to be an exhaustive study of pornography on the Internet, (it was anything but exhaustive), it resembles a case study more than anything else. Marty Rimm, author of the undergraduate paper that Time used as a basis for the article, claims the paper was an "exhaustive study" of online pornography when, in fact, the study based most of its conclusions about pornography on the Internet on the "descriptions of slightly more than 4,000 images" (Meeks, 1995, p. 1). Some USENET groups see hundreds of postings in a day.

Considering the thousands of USENET groups, 4,000 images no longer carries the authoritative weight that its author intended. The real problem is that the study (an undergraduate paper similar to a second-semester composition assignment) was based not on pornographic images themselves, but on the descriptions of those images. This kind of reduction detracts significantly from the integrity of the final claims made by the author. In fact, this kind of research is commensurate with doing a study of the content of pornographic movies based on the titles of the movies, then making sociological generalizations based on what those titles indicate. (This is obviously a problem with a number of types of validity, because Rimm is not studying what he thinks he is studying, but instead something quite different. )

The author of the Time article, Philip Elmer De Witt writes, "The research team at CMU has undertaken the first systematic study of pornography on the Information Superhighway" (Godwin, 1995, p. 1). His statement is problematic in at least three ways. First, the research team actually consisted of a few of Rimm's undergraduate friends with no methodological training whatsoever. Additionally, no mention of the degree of interrater reliability is made. Second, this systematic study is actually merely a "non-randomly selected subset of commercial bulletin-board systems that focus on selling porn" (Godwin, p. 6). As pornography vending is actually just a small part of the whole concerning the use of pornography on the Internet, the entire premise of this study's content validity is firmly called into question. Finally, the use of the term "Information Superhighway" is a false assessment of what in actuality is only a few USENET groups and BBSs (Bulletin Board System), which make up only a small fraction of the entire "Information Superhighway" traffic. Essentially, what is here is yet another violation of content validity.

De Witt is quoted as saying: "In an 18-month study, the team surveyed 917,410 sexually-explicit pictures, descriptions, short-stories and film clips. On those USENET newsgroups where digitized images are stored, 83.5 percent of the pictures were pornographic" (De Witt 40).

Statistically, some interesting contradictions arise. The figure 917,410 was taken from adult-oriented BBSs--none came from actual USENET groups or the Internet itself. This is a glaring discrepancy. Out of the 917,410 files, 212,114 are only descriptions (Hoffman & Novak, 1995, p.2). The question is, how many actual images did the "researchers" see?

"Between April and July 1994, the research team downloaded all available images (3,254)...the team encountered technical difficulties with 13 percent of these images...This left a total of 2,830 images for analysis" (p. 2). This means that out of 917,410 files discussed in this study, 914,580 of them were not even pictures! As for the 83.5 percent figure, this is actually based on "17 alt.binaries groups that Rimm considered pornographic" (p. 2).

In real terms, 17 USENET groups is a fraction of a percent of all USENET groups available. Worse yet, Time claimed that "...only about 3 percent of all messages on the USENET [represent pornographic material], while the USENET itself represents 11.5 percent of the traffic on the Internet" (De Witt, p. 40).

Time neglected to carry the interpretation of this data out to its logical conclusion, which is that less than half of 1 percent (3 percent of 11 percent) of the images on the Internet are associated with newsgroups that contain pornographic imagery. Furthermore, of this half percent, an unknown but even smaller percentage of the messages in newsgroups that are 'associated with pornographic imagery', actually contained pornographic material (Hoffman & Novak, p. 3).

Another blunder can be seen in the avoidance of peer-review, which suggests that there was some political interests being served in having the study become a Time cover story. Marty Rimm contracted the Georgetown Law Review and Time in an agreement to publish his study as long as they kept it under lock and key. During the months before publication, many interested scholars and professionals tried in vain to obtain a copy of the study in order to check it for flaws. De Witt justified not letting such peer-review take place, and also justified the reliability and validity of the study, on the grounds that because the Georgetown Law Review had accepted it, it was therefore reliable and valid, and needed no peer-review. What he didn't know, was that law reviews are not edited by professionals, but by "third year law students" (Godwin, p. 4).

There are many consequences of the failure to subject such a study to the scrutiny of peer review. If it was Rimm's desire to publish an article about on-line pornography in a manner that legitimized his article, yet escaped the kind of critical review the piece would have to undergo if published in a scholarly journal of computer-science, engineering, marketing, psychology, or communications. What better venue than a law journal? A law journal article would have the added advantage of being taken seriously by law professors, lawyers, and legally-trained policymakers. By virtue of where it appeared, it would automatically be catapulted into the center of the policy debate surrounding online censorship and freedom of speech (Godwin).

Herein lies the dangerous implication of such a study: Because the questions surrounding pornography are of such immediate political concern, the study was placed in the forefront of the U.S. domestic policy debate over censorship on the Internet, (an integral aspect of current anti-First Amendment legislation) with little regard for its validity or reliability.

On June 26, the day the article came out, Senator Grassley, (co-sponsor of the anti-porn bill, along with Senator Dole) began drafting a speech that was to be delivered that very day in the Senate, using the study as evidence. The same day, at the same time, Mike Godwin posted on WELL (Whole Earth 'Lectronic Link, a forum for professionals on the Internet) what turned out to be the overstatement of the year: "Philip's story is an utter disaster, and it will damage the debate about this issue because we will have to spend lots of time correcting misunderstandings that are directly attributable to the story" (Meeks, p. 7).

As Godwin was writing this, Senator Grassley was speaking to the Senate: "Mr. President, I want to repeat that: 83.5 percent of the 900,000 images reviewed--these are all on the Internet--are pornographic, according to the Carnegie-Mellon study" ( p. 7). Several days later, Senator Dole was waving the magazine in front of the Senate like a battle flag.

Donna Hoffman, professor at Vanderbilt University, summed up the dangerous political implications by saying, "The critically important national debate over First Amendment rights and restrictions of information on the Internet and other emerging media requires facts and informed opinion, not hysteria" (p.1).

In addition to the hysteria, Hoffman sees a plethora of other problems with the study. "Because the content analysis and classification scheme are 'black boxes,'" Hoffman said, "because no reliability and validity results are presented, because no statistical testing of the differences both within and among categories for different types of listings has been performed, and because not a single hypothesis has been tested, formally or otherwise, no conclusions should be drawn until the issues raised in this critique are resolved" (p. 4).

However, the damage has already been done. This questionable research by an undergraduate engineering major has been generalized to such an extent that even the U.S. Senate, and in particular Senators Grassley and Dole, have been duped, albeit through the strength of their own desires to see only what they wanted to see.

Annotated Bibliography

American Psychological Association. (1985). Standards for educational and psychological testing. Washington, DC: Author.

This work on focuses on reliability, validity and the standards that testers need to achieve in order to ensure accuracy.

Babbie, E.R. & Huitt, R.E. (1979). The practice of social research 2nd ed . Belmont, CA: Wadsworth Publishing.

An overview of social research and its applications.

Beauchamp, T. L., Faden, R.R., Wallace, Jr., R.J. & Walters, L . ( 1982). Ethical issues in social science research. Baltimore and London: The Johns Hopkins University Press.

A systematic overview of ethical issues in Social Science Research written by researchers with firsthand familiarity with the situations and problems researchers face in their work. This book raises several questions of how reliability and validity can be affected by ethics.

Borman, K.M. et al. (1986). Ethnographic and qualitative research design and why it doesn't work. American behavioral scientist 30 , 42-57.

The authors pose questions concerning threats to qualitative research and suggest solutions.

Bowen, K. A. (1996, Oct. 12). The sin of omission -punishable by death to internal validity: An argument for integration of quantitative research methods to strengthen internal validity. Available: http://trochim.human.cornell.edu/gallery/bowen/hss691.htm

An entire Web site that examines the merits of integrating qualitative and quantitative research methodologies through triangulation. The author argues that improving the internal validity of social science will be the result of such a union.

Brinberg, D. & McGrath, J.E. (1985). Validity and the research process . Beverly Hills: Sage Publications.

The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research.

Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites. Available: http://www.oise.on.ca/~jfbussieres/issue.html

This Web page examines the validity of museum Web sites which calls into question the validity of Web-based resources in general. Addresses the issue that all Websites should be examined with skepticism about the validity of the information contained within them.

Campbell, D. T. & Stanley, J.C. (1963). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

An overview of experimental research that includes pre-experimental designs, controls for internal validity, and tables listing sources of invalidity in quasi-experimental designs. Reference list and examples.

Carmines, E. G. & Zeller, R.A. (1991). Reliability and validity assessment . Newbury Park: Sage Publications.

An introduction to research methodology that includes classical test theory, validity, and methods of assessing reliability.

Carroll, K. M. (1995). Methodological issues and problems in the assessment of substance use. Psychological Assessment, Sep. 7 n3 , 349-58.

Discusses methodological issues in research involving the assessment of substance abuse. Introduces strategies for avoiding problems with the reliability and validity of methods.

Connelly, F. M. & Clandinin, D.J. (1990). Stories of experience and narrative inquiry. Educational Researcher 19:5 , 2-12.

A survey of narrative inquiry that outlines criteria, methods, and writing forms. It includes a discussion of risks and dangers in narrative studies, as well as a research agenda for curricula and classroom studies.

De Witt, P.E. (1995, July 3). On a screen near you: Cyberporn. Time, 38-45.

This is an exhaustive Carnegie Mellon study of online pornography by Marty Rimm, electrical engineering student.

Fink, A., ed. (1995). The survey Handbook, v.1 .Thousand Oaks, CA: Sage.

A guide to survey; this is the first in a series referred to as the "survey kit". It includes bibliograpgical references. Addresses survey design, analysis, reporting surveys and how to measure the validity and reliability of surveys.

Fink, A., ed. (1995). How to measure survey reliability and validity v. 7 . Thousand Oaks, CA: Sage.

This volume seeks to select and apply reliability criteria and select and apply validity criteria. The fundamental principles of scaling and scoring are considered.

Godwin, M. (1995, July). JournoPorn, dissection of the Time article. Available: http://www.hotwired.com

A detailed critique of Time magazine's Cyberporn , outlining flaws of methodology as well as exploring the underlying assumptions of the article.

Hambleton, R.K. & Zaal, J.N., eds. (1991). Advances in educational and psychological testing . Boston: Kluwer Academic.

Information on the concepts of reliability and validity in psychology and education.

Harnish, D.L. (1992). Human judgment and the logic of evidence: A critical examination of research methods in special education transition literature . In D.L. Harnish et al. eds., Selected readings in transition.

This article investigates threats to validity in special education research.

Haynes, N. M. (1995). How skewed is 'the bell curve'? Book Product Reviews . 1-24.

This paper claims that R.J. Herrnstein and C. Murray's The Bell Curve: Intelligence and Class Structure in American Life does not have scientific merit and claims that the bell curve is an unreliable measure of intelligence.

Healey, J. F. (1993). Statistics: A tool for social research, 3rd ed . Belmont: Wadsworth Publishing.

Inferential statistics, measures of association, and multivariate techniques in statistical analysis for social scientists are addressed.

Helberg, C. (1996, Oct.12). Pitfalls of data analysis (or how to avoid lies and damned lies). Available: http//maddog/fammed.wisc.edu/pitfalls/

A discussion of things researchers often overlook in their data analysis and how statistics are often used to skew reliability and validity for the researchers purposes.

Hoffman, D. L. and Novak, T.P. (1995, July). A detailed critique of the Time article: Cyberporn. Available: http://www.hotwired.com

A methodological critique of the Time article that uncovers some of the fundamental flaws in the statistics and the conclusions made by De Witt.

Huitt, William G. (1998). Internal and External Validity . http://www.valdosta.peachnet.edu/~whuitt/psy702/intro/valdgn.html

A Web document addressing key issues of external and internal validity.

Jones, J. E. & Bearley, W.L. (1996, Oct 12). Reliability and validity of training instruments. Organizational Universe Systems. Available: http://ous.usa.net/relval.htm

The authors discuss the reliability and validity of training design in a business setting. Basic terms are defined and examples provided.

Cultural Anthropology Methods Journal. (1996, Oct. 12). Available: http://www.lawrence.edu/~bradleyc/cam.html

An online journal containing articles on the practical application of research methods when conducting qualitative and quantitative research. Reliability and validity are addressed throughout.

Kirk, J. & Miller, M. M. (1986). Reliability and validity in qualitative research. Beverly Hills: Sage Publications.

This text describes objectivity in qualitative research by focusing on the issues of validity and reliability in terms of their limitations and applicability in the social and natural sciences.

Krakower, J. & Niwa, S. (1985). An assessment of validity and reliability of the institutinal perfarmance survey . Boulder, CO: National center for higher education management systems.

Educational surveys and higher education research and the efeectiveness of organization.

Lauer, J. M. & Asher, J.W. (1988). Composition Research. New York: Oxford University Press.

A discussion of empirical designs in the context of composition research as a whole.

Laurent, J. et al. (1992, Mar.) Review of validity research on the stanford-binet intelligence scale: 4th Ed. Psychological Assessment . 102-112.

This paper looks at the results of construct and criterion- related validity studies to determine if the SB:FE is a valid measure of intelligence.

LeCompte, M. D., Millroy, W.L., & Preissle, J. eds. (1992). The handbook of qualitative research in education. San Diego: Academic Press.

A compilation of the range of methodological and theoretical qualitative inquiry in the human sciences and education research. Numerous contributing authors apply their expertise to discussing a wide variety of issues pertaining to educational and humanities research as well as suggestions about how to deal with problems when conducting research.

McDowell, I. & Newell, C. (1987). Measuring health: A guide to rating scales and questionnaires . New York: Oxford University Press.

This gives a variety of examples of health measurement techniques and scales and discusses the validity and reliability of important health measures.

Meeks, B. (1995, July). Muckraker: How Time failed. Available: http://www.hotwired.com

A step-by-step outline of the events which took place during the researching, writing, and negotiating of the Time article of 3 July, 1995 titled: On A Screen Near You: Cyberporn .

Merriam, S. B. (1995). What can you tell from an N of 1?: Issues of validity and reliability in qualitative research. Journal of Lifelong Learning v4 , 51-60.

Addresses issues of validity and reliability in qualitative research for education. Discusses philosophical assumptions underlying the concepts of internal validity, reliability, and external validity or generalizability. Presents strategies for ensuring rigor and trustworthiness when conducting qualitative research.

Morris, L.L, Fitzgibbon, C.T., & Lindheim, E. (1987). How to measure performance and use tests. In J.L. Herman (Ed.), Program evaluation kit (2nd ed.). Newbury Park, CA: Sage.

Discussion of reliability and validity as it pertyains to measuring students' performance.

Murray, S., et al. (1979, April). Technical issues as threats to internal validity of experimental and quasi-experimental designs. San Francisco: University of California. 8-12.

(From Yang et al. bibliography--unavailable as of this writing.)

Russ-Eft, D. F. (1980). Validity and reliability in survey research. American Institutes for Research in the Behavioral Sciences August , 227 151.

An investigation of validity and reliability in survey research with and overview of the concepts of reliability and validity. Specific procedures for measuring sources of error are suggested as well as general suggestions for improving the reliability and validity of survey data. A extensive annotated bibliography is provided.

Ryser, G. R. (1994). Developing reliable and valid authentic assessments for the classroom: Is it possible? Journal of Secondary Gifted Education Fall, v6 n1 , 62-66.

Defines the meanings of reliability and validity as they apply to standardized measures of classroom assessment. This article defines reliability as scorability and stability and validity is seen as students' ability to use knowledge authentically in the field.

Schmidt, W., et al. (1982). Validity as a variable: Can the same certification test be valid for all students? Institute for Research on Teaching July, ED 227 151.

A technical report that presents specific criteria for judging content, instructional and curricular validity as related to certification tests in education.

Scholfield, P. (1995). Quantifying language. A researcher's and teacher's guide to gathering language data and reducing it to figures . Bristol: Multilingual Matters.

A guide to categorizing, measuring, testing, and assessing aspects of language. A source for language-related practitioners and researchers in conjunction with other resources on research methods and statistics. Questions of reliability, and validity are also explored.

Scriven, M. (1993). Hard-Won Lessons in Program Evaluation . San Francisco: Jossey-Bass Publishers.

A common sense approach for evaluating the validity of various educational programs and how to address specific issues facing evaluators.

Shou, P. (1993, Jan.). The singer loomis inventory of personality: A review and critique. [Paper presented at the Annual Meeting of the Southwest Educational Research Association.]

Evidence for reliability and validity are reviewed. A summary evaluation suggests that SLIP (developed by two Jungian analysts to allow examination of personality from the perspective of Jung's typology) appears to be a useful tool for educators and counselors.

Sutton, L.R. (1992). Community college teacher evaluation instrument: A reliability and validity study . Diss. Colorado State University.

Studies of reliability and validity in occupational and educational research.

Thompson, B. & Daniel, L.G. (1996, Oct.). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and psychological measurement v. 56 , 741-745.

Editorial board members of Educational and Psychological Measurement generated bibliography of definitive publications of measurement research. Many articles are directly related to reliability and validity.

Thompson, E. Y., et al. (1995). Overview of qualitative research . Diss. Colorado State University.

A discussion of strengths and weaknesses of qualitative research and its evolution and adaptation. Appendices and annotated bibliography.

Traver, C. et al. (1995). Case Study . Diss. Colorado State University.

This presentation gives an overview of case study research, providing definitions and a brief history and explanation of how to design research.

Trochim, William M. K. (1996) External validity. (. Available: http://trochim.human.cornell.edu/kb/EXTERVAL.htm

A comprehensive treatment of external validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Introduction to validity. (. Available: hhttp://trochim.human.cornell.edu/kb/INTROVAL.htm

An introduction to validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Reliability. (. Available: http://trochim.human.cornell.edu/kb/reltypes.htm

A comprehensive treatment of reliability found in William Trochim's online text about research methods and issues.

Validity. (1996, Oct. 12). Available: http://vislab-www.nps.navy.mil/~haga/validity.html

A source for definitions of various forms and types of reliability and validity.

Vinsonhaler, J. F., et al. (1983, July). Improving diagnostic reliability in reading through training. Institute for Research on Teaching ED 237 934.

This technical report investigates the practical application of a program intended to improve the diagnoses of reading deficient students. Here, reliability is assumed and a pragmatic answer to a specific educational problem is suggested as a result.

Wentland, E. J. & Smith, K.W. (1993). Survey responses: An evaluation of their validity . San Diego: Academic Press.

This book looks at the factors affecting response validity (or the accuracy of self-reports in surveys) and provides several examples with varying accuracy levels.

Wiget, A. (1996). Father juan greyrobe: Reconstructing tradition histories, and the reliability and validity of uncorroborated oral tradition. Ethnohistory 43:3 , 459-482.

This paper presents a convincing argument for the validity of oral histories in ethnographic research where at least some of the evidence can be corroborated through written records.

Yang, G. H., et al. (1995). Experimental and quasi-experimental educational research . Diss. Colorado State University.

This discussion defines experimentation and considers the rhetorical issues and advantages and disadvantages of experimental research. Annotated bibliography.

Yarroch, W. L. (1991, Sept.). The Implications of content versus validity on science tests. Journal of Research in Science Teaching , 619-629.

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed to look at qualitative comparisons between different factors.

Yin, R. K. (1989). Case study research: Design and methods . London: Sage Publications.

This book discusses the design process of case study research, including collection of evidence, composing the case study report, and designing single and multiple case studies.

Related Links

Internal Validity Tutorial. An interactive tutorial on internal validity.

http://server.bmod.athabascau.ca/html/Validity/index.shtml

Howell, Jonathan, Paul Miller, Hyun Hee Park, Deborah Sattler, Todd Schack, Eric Spery, Shelley Widhalm, & Mike Palmquist. (2005). Reliability and Validity. Writing@CSU . Colorado State University. https://writing.colostate.edu/guides/guide.cfm?guideid=66

Validity in research: a guide to measuring the right things

Last updated

27 February 2023

Reviewed by

Cathy Heath

Short on time? Get an AI generated summary of this article instead

Validity is necessary for all types of studies ranging from market validation of a business or product idea to the effectiveness of medical trials and procedures. So, how can you determine whether your research is valid? This guide can help you understand what validity is, the types of validity in research, and the factors that affect research validity.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

  • What is validity?

In the most basic sense, validity is the quality of being based on truth or reason. Valid research strives to eliminate the effects of unrelated information and the circumstances under which evidence is collected. 

Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge.

Studies must be conducted in environments that don't sway the results to achieve and maintain validity. They can be compromised by asking the wrong questions or relying on limited data. 

Why is validity important in research?

Research is used to improve life for humans. Every product and discovery, from innovative medical breakthroughs to advanced new products, depends on accurate research to be dependable. Without it, the results couldn't be trusted, and products would likely fail. Businesses would lose money, and patients couldn't rely on medical treatments. 

While wasting money on a lousy product is a concern, lack of validity paints a much grimmer picture in the medical field or producing automobiles and airplanes, for example. Whether you're launching an exciting new product or conducting scientific research, validity can determine success and failure.

  • What is reliability?

Reliability is the ability of a method to yield consistency. If the same result can be consistently achieved by using the same method to measure something, the measurement method is said to be reliable. For example, a thermometer that shows the same temperatures each time in a controlled environment is reliable.

While high reliability is a part of measuring validity, it's only part of the puzzle. If the reliable thermometer hasn't been properly calibrated and reliably measures temperatures two degrees too high, it doesn't provide a valid (accurate) measure of temperature. 

Similarly, if a researcher uses a thermometer to measure weight, the results won't be accurate because it's the wrong tool for the job. 

  • How are reliability and validity assessed?

While measuring reliability is a part of measuring validity, there are distinct ways to assess both measurements for accuracy. 

How is reliability measured?

These measures of consistency and stability help assess reliability, including:

Consistency and stability of the same measure when repeated multiple times and conditions

Consistency and stability of the measure across different test subjects

Consistency and stability of results from different parts of a test designed to measure the same thing

How is validity measured?

Since validity refers to how accurately a method measures what it is intended to measure, it can be difficult to assess the accuracy. Validity can be estimated by comparing research results to other relevant data or theories.

The adherence of a measure to existing knowledge of how the concept is measured

The ability to cover all aspects of the concept being measured

The relation of the result in comparison with other valid measures of the same concept

  • What are the types of validity in a research design?

Research validity is broadly gathered into two groups: internal and external. Yet, this grouping doesn't clearly define the different types of validity. Research validity can be divided into seven distinct groups.

Face validity : A test that appears valid simply because of the appropriateness or relativity of the testing method, included information, or tools used.

Content validity : The determination that the measure used in research covers the full domain of the content.

Construct validity : The assessment of the suitability of the measurement tool to measure the activity being studied.

Internal validity : The assessment of how your research environment affects measurement results. This is where other factors can’t explain the extent of an observed cause-and-effect response.

External validity : The extent to which the study will be accurate beyond the sample and the level to which it can be generalized in other settings, populations, and measures.

Statistical conclusion validity: The determination of whether a relationship exists between procedures and outcomes (appropriate sampling and measuring procedures along with appropriate statistical tests).

Criterion-related validity : A measurement of the quality of your testing methods against a criterion measure (like a “gold standard” test) that is measured at the same time.

  • Examples of validity

Like different types of research and the various ways to measure validity, examples of validity can vary widely. These include:

A questionnaire may be considered valid because each question addresses specific and relevant aspects of the study subject.

In a brand assessment study, researchers can use comparison testing to verify the results of an initial study. For example, the results from a focus group response about brand perception are considered more valid when the results match that of a questionnaire answered by current and potential customers.

A test to measure a class of students' understanding of the English language contains reading, writing, listening, and speaking components to cover the full scope of how language is used.

  • Factors that affect research validity

Certain factors can affect research validity in both positive and negative ways. By understanding the factors that improve validity and those that threaten it, you can enhance the validity of your study. These include:

Random selection of participants vs. the selection of participants that are representative of your study criteria

Blinding with interventions the participants are unaware of (like the use of placebos)

Manipulating the experiment by inserting a variable that will change the results

Randomly assigning participants to treatment and control groups to avoid bias

Following specific procedures during the study to avoid unintended effects

Conducting a study in the field instead of a laboratory for more accurate results

Replicating the study with different factors or settings to compare results

Using statistical methods to adjust for inconclusive data

What are the common validity threats in research, and how can their effects be minimized or nullified?

Research validity can be difficult to achieve because of internal and external threats that produce inaccurate results. These factors can jeopardize validity.

History: Events that occur between an early and later measurement

Maturation: The passage of time in a study can include data on actions that would have naturally occurred outside of the settings of the study

Repeated testing: The outcome of repeated tests can change the outcome of followed tests

Selection of subjects: Unconscious bias which can result in the selection of uniform comparison groups

Statistical regression: Choosing subjects based on extremes doesn't yield an accurate outcome for the majority of individuals

Attrition: When the sample group is diminished significantly during the course of the study

Maturation: When subjects mature during the study, and natural maturation is awarded to the effects of the study

While some validity threats can be minimized or wholly nullified, removing all threats from a study is impossible. For example, random selection can remove unconscious bias and statistical regression. 

Researchers can even hope to avoid attrition by using smaller study groups. Yet, smaller study groups could potentially affect the research in other ways. The best practice for researchers to prevent validity threats is through careful environmental planning and t reliable data-gathering methods. 

  • How to ensure validity in your research

Researchers should be mindful of the importance of validity in the early planning stages of any study to avoid inaccurate results. Researchers must take the time to consider tools and methods as well as how the testing environment matches closely with the natural environment in which results will be used.

The following steps can be used to ensure validity in research:

Choose appropriate methods of measurement

Use appropriate sampling to choose test subjects

Create an accurate testing environment

How do you maintain validity in research?

Accurate research is usually conducted over a period of time with different test subjects. To maintain validity across an entire study, you must take specific steps to ensure that gathered data has the same levels of accuracy. 

Consistency is crucial for maintaining validity in research. When researchers apply methods consistently and standardize the circumstances under which data is collected, validity can be maintained across the entire study.

Is there a need for validation of the research instrument before its implementation?

An essential part of validity is choosing the right research instrument or method for accurate results. Consider the thermometer that is reliable but still produces inaccurate results. You're unlikely to achieve research validity without activities like calibration, content, and construct validity.

  • Understanding research validity for more accurate results

Without validity, research can't provide the accuracy necessary to deliver a useful study. By getting a clear understanding of validity in research, you can take steps to improve your research skills and achieve more accurate results.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 22 August 2024

Last updated: 5 February 2023

Last updated: 16 August 2024

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

IMAGES

  1. Validity and Reliability in Research- Types and Differences 2023

    definition of validity in research by different authors

  2. School essay: Components of valid research

    definition of validity in research by different authors

  3. PPT

    definition of validity in research by different authors

  4. Types of Validity in Research with Examples & Steps

    definition of validity in research by different authors

  5. Reliability vs. Validity in Research

    definition of validity in research by different authors

  6. Types of Validity in Research

    definition of validity in research by different authors

COMMENTS

  1. (PDF) The Concept of Validity

    present state of affairs, to show that many of these are irrelevant, and to offer a simple, clear, and workable alternative. It is our. intent to convince the reader that most of the validity ...

  2. Validity

    Examples of Validity. Internal Validity: A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases. External Validity: A study on educational interventions that can be applied to different schools across various regions. Construct Validity: A psychological test that accurately measures depression levels.

  3. Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and

    Although validity in qualitative research has been widely reflected upon in the methodological literature (and is still often subject of debate), the link with evaluation research is underexplored. In this article, I will explore aspects of validity of qualitative research with the explicit objective of connecting them with aspects of evaluation.

  4. The 4 Types of Validity in Research

    The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton.Revised on June 22, 2023. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid.

  5. Clarifying the concept of validity: From measurement to everyday

    Following Stevens, test results give an operational definition of attributes, qualifying any test as valid by definition. Following the representational theory of measurement, an attribute is defined by an empirical relational structure and a corresponding measurement model.

  6. Validity In Psychology Research: Types & Examples

    In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it's intended to measure. It ensures that the research findings are genuine and not due to extraneous factors. Validity can be categorized into different types, including construct validity (measuring the intended abstract trait), internal validity (ensuring causal conclusions ...

  7. Validity, reliability, and generalizability in qualitative research

    In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of "individual" is seen differently between humanistic and positive psychologists due to differing philosophical perspectives: Where humanistic psychologists believe "individual" is a ...

  8. Verification Strategies for Establishing Reliability and Validity in

    These few authors argue that the broad and abstract concepts of reliability and validity can be applied to all research because the goal of finding plausible and credible outcome explanations is central to all research (Hammersley, 1992; Kuzel & Engel, 2001; Yin, 1994). We are concerned, nonetheless, that the focus on evaluation strategies that ...

  9. Reliability and Validity

    This reliability can indicate stability of a measure and of the psychological phenomenon being assessed. Validity of instruments refers to the degree to which instruments assess the construct they aim to measure. To test this, researchers have developed several types of validity indices. Face validity reflects the degree to which items of a ...

  10. Validity and Validation in Research and Assessment

    Abstract. This chapter first sets out the book's purpose, namely to further define validity and to explore the factors that should be considered when evaluating claims from research and assessment. It then discusses validity theory and its philosophical foundations, with connections between the philosophical foundations and specific ways ...

  11. Contemporary Test Validity in Theory and Practice: A Primer for

    However, contemporary validity theory posits that test validity is a unitary (single) concept. Rather than providing evidence of each "type" of validity, the charge for test developers is to construct a cohesive argument for the validity of test score-based inferences that integrates different forms of validity evidence.

  12. Some Notes on the Terms 'Validity' and 'Reliability'

    Variations in Definition. Here is a sample of the definitions of the terms 'reliability' and 'validity' to be found in the methodological literature [2]: (1) Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the ...

  13. What does it mean for an evaluation to be 'valid'? A critical synthesis

    As an alternative, a Rosetta stone approach that helps people understand and translate different definitions and conceptions of validity would enable evaluators to discuss and interact around these ideas with more clarity, without imposing new terminology on the originators of different conceptions and terminology. 4.1.2. Means versus ends

  14. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  15. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  16. Clarifying the Consensus Definition of Validity: Measurement

    Abstract. The 1999 Standards for Educational and Psychological Testing defines validity as the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. Although quite explicit, there are ways in which this definition lacks precision, consistency, and clarity. The history of validity has taught us that ambiguity risks oversimplification ...

  17. Guide: Understanding Reliability and Validity

    Validity and the research process. Beverly Hills: Sage Publications. The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research. Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites.

  18. Valid for What? On the Very Idea of Unconditional Validity

    precise in quite distinct ways by different authors, and there is no settled con-sensus on a preferred account. Now, whether a particular measuring instru-ment—for example, a happiness questionnaire—is considered valid/invalid has not only research consequences, but also policy implications (Angner 2011b). Thus, validity's definition matters.

  19. Validity in Research: A Guide to Better Results

    Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge. Studies must be conducted in environments ...

  20. Full article: The great validity debate

    In highly litigious cultures, such as the US, this issue takes on a different complexion for the assessment industry than it does in other countries. In response to these issues, some of the authors have sought to use the logic of science to seek better definitions of validity or better arguments for particular claims relating to validity.

  21. (PDF) Validity and Reliability in Quantitative Research

    Reliability and Validity are measures that are used to ensure the study is measuring the right variables in the study objectives and that same results are obtained whenever the research is ...

  22. The Concept of Validity and the Process of Validation

    33 Employee Value: Combining Utility Analysis with Strategic Human Resource Management Research to Yield Strong Theory Notes. Notes. 34 "Retooling" Evidence-Based Staffing: Extending the Validation ... In this chapter we first set the stage by focusing on the concept of validity, documenting key changes over time in how the term is used and ...