• How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threats of external validity, how to assess reliability and validity.

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Types of validity.

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

Inductive and deductive reasoning takes into account assumptions and incidents. Here is all you need to know about inductive vs deductive reasoning.

This article provides the key advantages of primary research over secondary research so you can make an informed decision.

You can transcribe an interview by converting a conversation into a written format including question-answer recording sessions between two or more people.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Privacy Policy

Research Method

Home » Reliability – Types, Examples and Guide

Reliability – Types, Examples and Guide

Table of Contents

Reliability

Reliability

Definition:

Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis.

Reliability In Engineering

In engineering and manufacturing, reliability refers to the ability of a product, equipment, or system to function without failure or breakdown under normal operating conditions for a specified period. A reliable system consistently performs its intended functions, meets performance requirements, and withstands various environmental factors, stress, or wear and tear.

Reliability In Software Development

In software development, reliability relates to the stability and consistency of software applications or systems. A reliable software program operates consistently without crashing, produces accurate results, and handles errors or exceptions gracefully. Reliability is often measured by metrics such as mean time between failures (MTBF) and mean time to repair (MTTR).

Reliability In Data Analysis and Statistics

In data analysis and statistics, reliability refers to the consistency and repeatability of measurements or assessments. For example, if a measurement instrument consistently produces similar results when measuring the same quantity or if multiple raters consistently agree on the same assessment, it is considered reliable. Reliability is often assessed using statistical measures such as test-retest reliability, inter-rater reliability, or internal consistency.

Research Reliability

Research reliability refers to the consistency, stability, and repeatability of research findings . It indicates the extent to which a research study produces consistent and dependable results when conducted under similar conditions. In other words, research reliability assesses whether the same results would be obtained if the study were replicated with the same methodology, sample, and context.

What Affects Reliability in Research

Several factors can affect the reliability of research measurements and assessments. Here are some common factors that can impact reliability:

Measurement Error

Measurement error refers to the variability or inconsistency in the measurements that is not due to the construct being measured. It can arise from various sources, such as the limitations of the measurement instrument, environmental factors, or the characteristics of the participants. Measurement error reduces the reliability of the measure by introducing random variability into the data.

Rater/Observer Bias

In studies involving subjective assessments or ratings, the biases or subjective judgments of the raters or observers can affect reliability. If different raters interpret and evaluate the same phenomenon differently, it can lead to inconsistencies in the ratings, resulting in lower inter-rater reliability.

Participant Factors

Characteristics or factors related to the participants themselves can influence reliability. For example, factors such as fatigue, motivation, attention, or mood can introduce variability in responses, affecting the reliability of self-report measures or performance assessments.

Instrumentation

The quality and characteristics of the measurement instrument can impact reliability. If the instrument lacks clarity, has ambiguous items or instructions, or is prone to measurement errors, it can decrease the reliability of the measure. Poorly designed or unreliable instruments can introduce measurement error and decrease the consistency of the measurements.

Sample Size

Sample size can affect reliability, especially in studies where the reliability coefficient is based on correlations or variability within the sample. A larger sample size generally provides more stable estimates of reliability, while smaller samples can yield less precise estimates.

Time Interval

The time interval between test administrations can impact test-retest reliability. If the time interval is too short, participants may recall their previous responses and answer in a similar manner, artificially inflating the reliability coefficient. On the other hand, if the time interval is too long, true changes in the construct being measured may occur, leading to lower test-retest reliability.

Content Sampling

The specific items or questions included in a measure can affect reliability. If the measure does not adequately sample the full range of the construct being measured or if the items are too similar or redundant, it can result in lower internal consistency reliability.

Scoring and Data Handling

Errors in scoring, data entry, or data handling can introduce variability and impact reliability. Inaccurate or inconsistent scoring procedures, data entry mistakes, or mishandling of missing data can affect the reliability of the measurements.

Context and Environment

The context and environment in which measurements are obtained can influence reliability. Factors such as noise, distractions, lighting conditions, or the presence of others can introduce variability and affect the consistency of the measurements.

Types of Reliability

There are several types of reliability that are commonly discussed in research and measurement contexts. Here are some of the main types of reliability:

Test-Retest Reliability

This type of reliability assesses the consistency of a measure over time. It involves administering the same test or measure to the same group of individuals on two separate occasions and then comparing the results. If the scores are similar or highly correlated across the two testing points, it indicates good test-retest reliability.

Inter-Rater Reliability

Inter-rater reliability examines the degree of agreement or consistency between different raters or observers who are assessing the same phenomenon. It is commonly used in subjective evaluations or assessments where judgments are made by multiple individuals. High inter-rater reliability suggests that different observers are likely to reach the same conclusions or make consistent assessments.

Internal Consistency Reliability

Internal consistency reliability assesses the extent to which the items or questions within a measure are consistent with each other. It is commonly measured using techniques such as Cronbach’s alpha. High internal consistency reliability indicates that the items within a measure are measuring the same construct or concept consistently.

Parallel Forms Reliability

Parallel forms reliability assesses the consistency of different versions or forms of a test that are intended to measure the same construct. Two equivalent versions of a test are administered to the same group of individuals, and the scores are compared to determine the level of agreement between the forms.

Split-Half Reliability

Split-half reliability involves splitting a measure into two halves and examining the consistency between the two halves. It can be done by dividing the items into odd-even pairs or by randomly splitting the items. The scores from the two halves are then compared to assess the degree of consistency.

Alternate Forms Reliability

Alternate forms reliability is similar to parallel forms reliability, but it involves administering two different versions of a test to the same group of individuals. The two forms should be equivalent and measure the same construct. The scores from the two forms are then compared to assess the level of agreement.

Applications of Reliability

Reliability has several important applications across various fields and disciplines. Here are some common applications of reliability:

Psychological and Educational Testing

Reliability is crucial in psychological and educational testing to ensure that the scores obtained from assessments are consistent and dependable. It helps to determine the accuracy and stability of measures such as intelligence tests, personality assessments, academic exams, and aptitude tests.

Market Research

In market research, reliability is important for ensuring consistent and dependable data collection. Surveys, questionnaires, and other data collection instruments need to have high reliability to obtain accurate and consistent responses from participants. Reliability analysis helps researchers identify and address any issues that may affect the consistency of the data.

Health and Medical Research

Reliability is essential in health and medical research to ensure that measurements and assessments used in studies are consistent and trustworthy. This includes the reliability of diagnostic tests, patient-reported outcome measures, observational measures, and psychometric scales. High reliability is crucial for making valid inferences and drawing reliable conclusions from research findings.

Quality Control and Manufacturing

Reliability analysis is widely used in industries such as manufacturing and quality control to assess the reliability of products and processes. It helps to identify and address sources of variation and inconsistency, ensuring that products meet the required standards and specifications consistently.

Social Science Research

Reliability plays a vital role in social science research, including fields such as sociology, anthropology, and political science. It is used to assess the consistency of measurement tools, such as surveys or observational protocols, to ensure that the data collected is reliable and can be trusted for analysis and interpretation.

Performance Evaluation

Reliability is important in performance evaluation systems used in organizations and workplaces. Whether it’s assessing employee performance, evaluating the reliability of scoring rubrics, or measuring the consistency of ratings by supervisors, reliability analysis helps ensure fairness and consistency in the evaluation process.

Psychometrics and Scale Development

Reliability analysis is a fundamental step in psychometrics, which involves developing and validating measurement scales. Researchers assess the reliability of items and subscales to ensure that the scale measures the intended construct consistently and accurately.

Examples of Reliability

Here are some examples of reliability in different contexts:

Test-Retest Reliability Example: A researcher administers a personality questionnaire to a group of participants and then administers the same questionnaire to the same participants after a certain period, such as two weeks. The scores obtained from the two administrations are highly correlated, indicating good test-retest reliability.

Inter-Rater Reliability Example: Multiple teachers assess the essays of a group of students using a standardized grading rubric. The ratings assigned by the teachers show a high level of agreement or correlation, indicating good inter-rater reliability.

Internal Consistency Reliability Example: A researcher develops a questionnaire to measure job satisfaction. The researcher administers the questionnaire to a group of employees and calculates Cronbach’s alpha to assess internal consistency. The calculated value of Cronbach’s alpha is high (e.g., above 0.8), indicating good internal consistency reliability.

Parallel Forms Reliability Example: Two versions of a mathematics exam are created, which are designed to measure the same mathematical skills. Both versions of the exam are administered to the same group of students, and the scores from the two versions are highly correlated, indicating good parallel forms reliability.

Split-Half Reliability Example: A researcher develops a survey to measure self-esteem. The survey consists of 20 items, and the researcher randomly divides the items into two halves. The scores obtained from each half of the survey show a high level of agreement or correlation, indicating good split-half reliability.

Alternate Forms Reliability Example: A researcher develops two versions of a language proficiency test, which are designed to measure the same language skills. Both versions of the test are administered to the same group of participants, and the scores from the two versions are highly correlated, indicating good alternate forms reliability.

Where to Write About Reliability in A Thesis

When writing about reliability in a thesis, there are several sections where you can address this topic. Here are some common sections in a thesis where you can discuss reliability:

Introduction :

In the introduction section of your thesis, you can provide an overview of the study and briefly introduce the concept of reliability. Explain why reliability is important in your research field and how it relates to your study objectives.

Theoretical Framework:

If your thesis includes a theoretical framework or a literature review, this is a suitable section to discuss reliability. Provide an overview of the relevant theories, models, or concepts related to reliability in your field. Discuss how other researchers have measured and assessed reliability in similar studies.

Methodology:

The methodology section is crucial for addressing reliability. Describe the research design, data collection methods, and measurement instruments used in your study. Explain how you ensured the reliability of your measurements or data collection procedures. This may involve discussing pilot studies, inter-rater reliability, test-retest reliability, or other techniques used to assess and improve reliability.

Data Analysis:

In the data analysis section, you can discuss the statistical techniques employed to assess the reliability of your data. This might include measures such as Cronbach’s alpha, Cohen’s kappa, or intraclass correlation coefficients (ICC), depending on the nature of your data and research design. Present the results of reliability analyses and interpret their implications for your study.

Discussion:

In the discussion section, analyze and interpret the reliability results in relation to your research findings and objectives. Discuss any limitations or challenges encountered in establishing or maintaining reliability in your study. Consider the implications of reliability for the validity and generalizability of your results.

Conclusion:

In the conclusion section, summarize the main points discussed in your thesis regarding reliability. Emphasize the importance of reliability in research and highlight any recommendations or suggestions for future studies to enhance reliability.

Importance of Reliability

Reliability is of utmost importance in research, measurement, and various practical applications. Here are some key reasons why reliability is important:

  • Consistency : Reliability ensures consistency in measurements and assessments. Consistent results indicate that the measure or instrument is stable and produces similar outcomes when applied repeatedly. This consistency allows researchers and practitioners to have confidence in the reliability of the data collected and the conclusions drawn from it.
  • Accuracy : Reliability is closely linked to accuracy. A reliable measure produces results that are close to the true value or state of the phenomenon being measured. When a measure is unreliable, it introduces error and uncertainty into the data, which can lead to incorrect interpretations and flawed decision-making.
  • Trustworthiness : Reliability enhances the trustworthiness of measurements and assessments. When a measure is reliable, it indicates that it is dependable and can be trusted to provide consistent and accurate results. This is particularly important in fields where decisions and actions are based on the data collected, such as education, healthcare, and market research.
  • Comparability : Reliability enables meaningful comparisons between different groups, individuals, or time points. When measures are reliable, differences or changes observed can be attributed to true differences in the underlying construct, rather than measurement error. This allows for valid comparisons and evaluations, both within a study and across different studies.
  • Validity : Reliability is a prerequisite for validity. Validity refers to the extent to which a measure or assessment accurately captures the construct it is intended to measure. If a measure is unreliable, it cannot be valid, as it does not consistently reflect the construct of interest. Establishing reliability is an important step in establishing the validity of a measure.
  • Decision-making : Reliability is crucial for making informed decisions based on data. Whether it’s evaluating employee performance, diagnosing medical conditions, or conducting research studies, reliable measurements and assessments provide a solid foundation for decision-making processes. They help to reduce uncertainty and increase confidence in the conclusions drawn from the data.
  • Quality Assurance : Reliability is essential for maintaining quality assurance in various fields. It allows organizations to assess and monitor the consistency and dependability of their processes, products, and services. By ensuring reliability, organizations can identify areas of improvement, address sources of variation, and deliver consistent and high-quality outcomes.

Limitations of Reliability

Here are some limitations of reliability:

  • Limited to consistency: Reliability primarily focuses on the consistency of measurements and findings. However, it does not guarantee the accuracy or validity of the measurements. A measurement can be consistent but still systematically biased or flawed, leading to inaccurate results. Reliability alone cannot address validity concerns.
  • Context-dependent: Reliability can be influenced by the specific context, conditions, or population under study. A measurement or instrument that demonstrates high reliability in one context may not necessarily exhibit the same level of reliability in a different context. Researchers need to consider the specific characteristics and limitations of their study context when interpreting reliability.
  • Inadequate for complex constructs: Reliability is often based on the assumption of unidimensionality, which means that a measurement instrument is designed to capture a single construct. However, many real-world phenomena are complex and multidimensional, making it challenging to assess reliability accurately. Reliability measures may not adequately capture the full complexity of such constructs.
  • Susceptible to systematic errors: Reliability focuses on minimizing random errors, but it may not detect or address systematic errors or biases in measurements. Systematic errors can arise from flaws in the measurement instrument, data collection procedures, or sample selection. Reliability assessments may not fully capture or address these systematic errors, leading to biased or inaccurate results.
  • Relies on assumptions: Reliability assessments often rely on certain assumptions, such as the assumption of measurement invariance or the assumption of stable conditions over time. These assumptions may not always hold true in real-world research settings, particularly when studying dynamic or evolving phenomena. Failure to meet these assumptions can compromise the reliability of the research.
  • Limited to quantitative measures: Reliability is typically applied to quantitative measures and instruments, which can be problematic when studying qualitative or subjective phenomena. Reliability measures may not fully capture the richness and complexity of qualitative data, limiting their applicability in certain research domains.

Also see Reliability Vs Validity

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Validity

Validity – Types, Examples and Guide

Alternate Forms Reliability

Alternate Forms Reliability – Methods, Examples...

Construct Validity

Construct Validity – Types, Threats and Examples

Internal Validity

Internal Validity – Threats, Examples and Guide

Reliability Vs Validity

Reliability Vs Validity

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

reliability of a research result implies its

Understanding Reliability and Validity

These related research issues ask us to consider whether we are studying what we think we are studying and whether the measures we use are consistent.

Reliability

Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. Without the agreement of independent observers able to replicate research procedures, or the ability to use research tools and procedures that yield consistent measurements, researchers would be unable to satisfactorily draw conclusions, formulate theories, or make claims about the generalizability of their research. In addition to its important role in research, reliability is critical for many parts of our lives, including manufacturing, medicine, and sports.

Reliability is such an important concept that it has been defined in terms of its application to a wide range of activities. For researchers, four key types of reliability are:

Equivalency Reliability

Equivalency reliability is the extent to which two items measure identical concepts at an identical level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one another to highlight the degree of relationship or association. In quantitative studies and particularly in experimental studies, a correlation coefficient, statistically referred to as r , is used to show the strength of the correlation between a dependent variable (the subject under study), and one or more independent variables , which are manipulated to determine effects on the dependent variable. An important consideration is that equivalency reliability is concerned with correlational, not causal, relationships.

For example, a researcher studying university English students happened to notice that when some students were studying for finals, their holiday shopping began. Intrigued by this, the researcher attempted to observe how often, or to what degree, this these two behaviors co-occurred throughout the academic year. The researcher used the results of the observations to assess the correlation between studying throughout the academic year and shopping for gifts. The researcher concluded there was poor equivalency reliability between the two actions. In other words, studying was not a reliable predictor of shopping for gifts.

Stability Reliability

Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

An example of stability reliability would be the method of maintaining weights used by the U.S. Bureau of Standards. Platinum objects of fixed weight (one kilogram, one pound, etc...) are kept locked away. Once a year they are taken out and weighed, allowing scales to be reset so they are "weighing" accurately. Keeping track of how much the scales are off from year to year establishes a stability reliability for these instruments. In this instance, the platinum weights themselves are assumed to have a perfectly fixed stability reliability.

Internal Consistency

Internal consistency is the extent to which tests or procedures assess the same characteristic, skill or quality. It is a measure of the precision between the observers or of the measuring instruments used in a study. This type of reliability often helps researchers interpret data and predict the value of scores and the limits of the relationship among variables.

For example, a researcher designs a questionnaire to find out about college students' dissatisfaction with a particular textbook. Analyzing the internal consistency of the survey items dealing with dissatisfaction will reveal the extent to which items on the questionnaire focus on the notion of dissatisfaction.

Interrater Reliability

Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Interrater reliability addresses the consistency of the implementation of a rating system.

A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability.

Related Information: Reliability Example

An example of the importance of reliability is the use of measuring devices in Olympic track and field events. For the vast majority of people, ordinary measuring rulers and their degree of accuracy are reliable enough. However, for an Olympic event, such as the discus throw, the slightest variation in a measuring device -- whether it is a tape, clock, or other device -- could mean the difference between the gold and silver medals. Additionally, it could mean the difference between a new world record and outright failure to qualify for an event. Olympic measuring devices, then, must be reliable from one throw or race to another and from one competition to another. They must also be reliable when used in different parts of the world, as temperature, air pressure, humidity, interpretation, or other variables might affect their readings.

Validity refers to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure. While reliability is concerned with the accuracy of the actual measuring instrument or procedure, validity is concerned with the study's success at measuring what the researchers set out to measure.

Researchers should be concerned with both external and internal validity. External validity refers to the extent to which the results of a study are generalizable or transferable. (Most discussions of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include a reference here to transferability because many qualitative research studies are not designed to be generalized.)

Internal validity refers to (1) the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured) and (2) the extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.

Scholars discuss several types of internal validity. For brief discussions of several types of internal validity, click on the items below:

Face Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support (Fink, 1995).

Criterion Related Validity

Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.

For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.

Construct Validity

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.

Construct validity can be broken down into two sub-categories: Convergent validity and discriminate validity. Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related. Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested (Carmines & Zeller, p. 23).

Content Validity

Content Validity is based on the extent to which a measurement reflects the specific intended domain of content (Carmines & Zeller, 1991, p.20).

Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Related Information: Validity Example

Many recreational activities of high school students involve driving cars. A researcher, wanting to measure whether recreational activities have a negative effect on grade point average in high school students, might conduct a survey asking how many students drive to school and then attempt to find a correlation between these two factors. Because many students might use their cars for purposes other than or in addition to recreation (e.g., driving to work after school, driving to school rather than walking or taking a bus), this research study might prove invalid. Even if a strong correlation was found between driving and grade point average, driving to school in and of itself would seem to be an invalid measure of recreational activity.

The challenges of achieving reliability and validity are among the most difficult faced by researchers. In this section, we offer commentaries on these challenges.

Difficulties of Achieving Reliability

It is important to understand some of the problems concerning reliability which might arise. It would be ideal to reliably measure, every time, exactly those things which we intend to measure. However, researchers can go to great lengths and make every attempt to ensure accuracy in their studies, and still deal with the inherent difficulties of measuring particular events or behaviors. Sometimes, and particularly in studies of natural settings, the only measuring device available is the researcher's own observations of human interaction or human reaction to varying stimuli. As these methods are ultimately subjective in nature, results may be unreliable and multiple interpretations are possible. Three of these inherent difficulties are quixotic reliability, diachronic reliability and synchronic reliability.

Quixotic reliability refers to the situation where a single manner of observation consistently, yet erroneously, yields the same result. It is often a problem when research appears to be going well. This consistency might seem to suggest that the experiment was demonstrating perfect stability reliability. This, however, would not be the case.

For example, if a measuring device used in an Olympic competition always read 100 meters for every discus throw, this would be an example of an instrument consistently, yet erroneously, yielding the same result. However, quixotic reliability is often more subtle in its occurrences than this. For example, suppose a group of German researchers doing an ethnographic study of American attitudes ask questions and record responses. Parts of their study might produce responses which seem reliable, yet turn out to measure felicitous verbal embellishments required for "correct" social behavior. Asking Americans, "How are you?" for example, would in most cases, elicit the token, "Fine, thanks." However, this response would not accurately represent the mental or physical state of the respondents.

Diachronic reliability refers to the stability of observations over time. It is similar to stability reliability in that it deals with time. While this type of reliability is appropriate to assess features that remain relatively unchanged over time, such as landscape benchmarks or buildings, the same level of reliability is more difficult to achieve with socio-cultural phenomena.

For example, in a follow-up study one year later of reading comprehension in a specific group of school children, diachronic reliability would be hard to achieve. If the test were given to the same subjects a year later, many confounding variables would have impacted the researchers' ability to reproduce the same circumstances present at the first test. The final results would almost assuredly not reflect the degree of stability sought by the researchers.

Synchronic reliability refers to the similarity of observations within the same time frame; it is not about the similarity of things observed. Synchronic reliability, unlike diachronic reliability, rarely involves observations of identical things. Rather, it concerns itself with particularities of interest to the research.

For example, a researcher studies the actions of a duck's wing in flight and the actions of a hummingbird's wing in flight. Despite the fact that the researcher is studying two distinctly different kinds of wings, the action of the wings and the phenomenon produced is the same.

Comments on a Flawed, Yet Influential Study

An example of the dangers of generalizing from research that is inconsistent, invalid, unreliable, and incomplete is found in the Time magazine article, "On A Screen Near You: Cyberporn" (De Witt, 1995). This article relies on a study done at Carnegie Mellon University to determine the extent and implications of online pornography. Inherent to the study are methodological problems of unqualified hypotheses and conclusions, unsupported generalizations and a lack of peer review.

Ignoring the functional problems that manifest themselves later in the study, it seems that there are a number of ethical problems within the article. The article claims to be an exhaustive study of pornography on the Internet, (it was anything but exhaustive), it resembles a case study more than anything else. Marty Rimm, author of the undergraduate paper that Time used as a basis for the article, claims the paper was an "exhaustive study" of online pornography when, in fact, the study based most of its conclusions about pornography on the Internet on the "descriptions of slightly more than 4,000 images" (Meeks, 1995, p. 1). Some USENET groups see hundreds of postings in a day.

Considering the thousands of USENET groups, 4,000 images no longer carries the authoritative weight that its author intended. The real problem is that the study (an undergraduate paper similar to a second-semester composition assignment) was based not on pornographic images themselves, but on the descriptions of those images. This kind of reduction detracts significantly from the integrity of the final claims made by the author. In fact, this kind of research is commensurate with doing a study of the content of pornographic movies based on the titles of the movies, then making sociological generalizations based on what those titles indicate. (This is obviously a problem with a number of types of validity, because Rimm is not studying what he thinks he is studying, but instead something quite different. )

The author of the Time article, Philip Elmer De Witt writes, "The research team at CMU has undertaken the first systematic study of pornography on the Information Superhighway" (Godwin, 1995, p. 1). His statement is problematic in at least three ways. First, the research team actually consisted of a few of Rimm's undergraduate friends with no methodological training whatsoever. Additionally, no mention of the degree of interrater reliability is made. Second, this systematic study is actually merely a "non-randomly selected subset of commercial bulletin-board systems that focus on selling porn" (Godwin, p. 6). As pornography vending is actually just a small part of the whole concerning the use of pornography on the Internet, the entire premise of this study's content validity is firmly called into question. Finally, the use of the term "Information Superhighway" is a false assessment of what in actuality is only a few USENET groups and BBSs (Bulletin Board System), which make up only a small fraction of the entire "Information Superhighway" traffic. Essentially, what is here is yet another violation of content validity.

De Witt is quoted as saying: "In an 18-month study, the team surveyed 917,410 sexually-explicit pictures, descriptions, short-stories and film clips. On those USENET newsgroups where digitized images are stored, 83.5 percent of the pictures were pornographic" (De Witt 40).

Statistically, some interesting contradictions arise. The figure 917,410 was taken from adult-oriented BBSs--none came from actual USENET groups or the Internet itself. This is a glaring discrepancy. Out of the 917,410 files, 212,114 are only descriptions (Hoffman & Novak, 1995, p.2). The question is, how many actual images did the "researchers" see?

"Between April and July 1994, the research team downloaded all available images (3,254)...the team encountered technical difficulties with 13 percent of these images...This left a total of 2,830 images for analysis" (p. 2). This means that out of 917,410 files discussed in this study, 914,580 of them were not even pictures! As for the 83.5 percent figure, this is actually based on "17 alt.binaries groups that Rimm considered pornographic" (p. 2).

In real terms, 17 USENET groups is a fraction of a percent of all USENET groups available. Worse yet, Time claimed that "...only about 3 percent of all messages on the USENET [represent pornographic material], while the USENET itself represents 11.5 percent of the traffic on the Internet" (De Witt, p. 40).

Time neglected to carry the interpretation of this data out to its logical conclusion, which is that less than half of 1 percent (3 percent of 11 percent) of the images on the Internet are associated with newsgroups that contain pornographic imagery. Furthermore, of this half percent, an unknown but even smaller percentage of the messages in newsgroups that are 'associated with pornographic imagery', actually contained pornographic material (Hoffman & Novak, p. 3).

Another blunder can be seen in the avoidance of peer-review, which suggests that there was some political interests being served in having the study become a Time cover story. Marty Rimm contracted the Georgetown Law Review and Time in an agreement to publish his study as long as they kept it under lock and key. During the months before publication, many interested scholars and professionals tried in vain to obtain a copy of the study in order to check it for flaws. De Witt justified not letting such peer-review take place, and also justified the reliability and validity of the study, on the grounds that because the Georgetown Law Review had accepted it, it was therefore reliable and valid, and needed no peer-review. What he didn't know, was that law reviews are not edited by professionals, but by "third year law students" (Godwin, p. 4).

There are many consequences of the failure to subject such a study to the scrutiny of peer review. If it was Rimm's desire to publish an article about on-line pornography in a manner that legitimized his article, yet escaped the kind of critical review the piece would have to undergo if published in a scholarly journal of computer-science, engineering, marketing, psychology, or communications. What better venue than a law journal? A law journal article would have the added advantage of being taken seriously by law professors, lawyers, and legally-trained policymakers. By virtue of where it appeared, it would automatically be catapulted into the center of the policy debate surrounding online censorship and freedom of speech (Godwin).

Herein lies the dangerous implication of such a study: Because the questions surrounding pornography are of such immediate political concern, the study was placed in the forefront of the U.S. domestic policy debate over censorship on the Internet, (an integral aspect of current anti-First Amendment legislation) with little regard for its validity or reliability.

On June 26, the day the article came out, Senator Grassley, (co-sponsor of the anti-porn bill, along with Senator Dole) began drafting a speech that was to be delivered that very day in the Senate, using the study as evidence. The same day, at the same time, Mike Godwin posted on WELL (Whole Earth 'Lectronic Link, a forum for professionals on the Internet) what turned out to be the overstatement of the year: "Philip's story is an utter disaster, and it will damage the debate about this issue because we will have to spend lots of time correcting misunderstandings that are directly attributable to the story" (Meeks, p. 7).

As Godwin was writing this, Senator Grassley was speaking to the Senate: "Mr. President, I want to repeat that: 83.5 percent of the 900,000 images reviewed--these are all on the Internet--are pornographic, according to the Carnegie-Mellon study" ( p. 7). Several days later, Senator Dole was waving the magazine in front of the Senate like a battle flag.

Donna Hoffman, professor at Vanderbilt University, summed up the dangerous political implications by saying, "The critically important national debate over First Amendment rights and restrictions of information on the Internet and other emerging media requires facts and informed opinion, not hysteria" (p.1).

In addition to the hysteria, Hoffman sees a plethora of other problems with the study. "Because the content analysis and classification scheme are 'black boxes,'" Hoffman said, "because no reliability and validity results are presented, because no statistical testing of the differences both within and among categories for different types of listings has been performed, and because not a single hypothesis has been tested, formally or otherwise, no conclusions should be drawn until the issues raised in this critique are resolved" (p. 4).

However, the damage has already been done. This questionable research by an undergraduate engineering major has been generalized to such an extent that even the U.S. Senate, and in particular Senators Grassley and Dole, have been duped, albeit through the strength of their own desires to see only what they wanted to see.

Annotated Bibliography

American Psychological Association. (1985). Standards for educational and psychological testing. Washington, DC: Author.

This work on focuses on reliability, validity and the standards that testers need to achieve in order to ensure accuracy.

Babbie, E.R. & Huitt, R.E. (1979). The practice of social research 2nd ed . Belmont, CA: Wadsworth Publishing.

An overview of social research and its applications.

Beauchamp, T. L., Faden, R.R., Wallace, Jr., R.J. & Walters, L . ( 1982). Ethical issues in social science research. Baltimore and London: The Johns Hopkins University Press.

A systematic overview of ethical issues in Social Science Research written by researchers with firsthand familiarity with the situations and problems researchers face in their work. This book raises several questions of how reliability and validity can be affected by ethics.

Borman, K.M. et al. (1986). Ethnographic and qualitative research design and why it doesn't work. American behavioral scientist 30 , 42-57.

The authors pose questions concerning threats to qualitative research and suggest solutions.

Bowen, K. A. (1996, Oct. 12). The sin of omission -punishable by death to internal validity: An argument for integration of quantitative research methods to strengthen internal validity. Available: http://trochim.human.cornell.edu/gallery/bowen/hss691.htm

An entire Web site that examines the merits of integrating qualitative and quantitative research methodologies through triangulation. The author argues that improving the internal validity of social science will be the result of such a union.

Brinberg, D. & McGrath, J.E. (1985). Validity and the research process . Beverly Hills: Sage Publications.

The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research.

Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites. Available: http://www.oise.on.ca/~jfbussieres/issue.html

This Web page examines the validity of museum Web sites which calls into question the validity of Web-based resources in general. Addresses the issue that all Websites should be examined with skepticism about the validity of the information contained within them.

Campbell, D. T. & Stanley, J.C. (1963). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

An overview of experimental research that includes pre-experimental designs, controls for internal validity, and tables listing sources of invalidity in quasi-experimental designs. Reference list and examples.

Carmines, E. G. & Zeller, R.A. (1991). Reliability and validity assessment . Newbury Park: Sage Publications.

An introduction to research methodology that includes classical test theory, validity, and methods of assessing reliability.

Carroll, K. M. (1995). Methodological issues and problems in the assessment of substance use. Psychological Assessment, Sep. 7 n3 , 349-58.

Discusses methodological issues in research involving the assessment of substance abuse. Introduces strategies for avoiding problems with the reliability and validity of methods.

Connelly, F. M. & Clandinin, D.J. (1990). Stories of experience and narrative inquiry. Educational Researcher 19:5 , 2-12.

A survey of narrative inquiry that outlines criteria, methods, and writing forms. It includes a discussion of risks and dangers in narrative studies, as well as a research agenda for curricula and classroom studies.

De Witt, P.E. (1995, July 3). On a screen near you: Cyberporn. Time, 38-45.

This is an exhaustive Carnegie Mellon study of online pornography by Marty Rimm, electrical engineering student.

Fink, A., ed. (1995). The survey Handbook, v.1 .Thousand Oaks, CA: Sage.

A guide to survey; this is the first in a series referred to as the "survey kit". It includes bibliograpgical references. Addresses survey design, analysis, reporting surveys and how to measure the validity and reliability of surveys.

Fink, A., ed. (1995). How to measure survey reliability and validity v. 7 . Thousand Oaks, CA: Sage.

This volume seeks to select and apply reliability criteria and select and apply validity criteria. The fundamental principles of scaling and scoring are considered.

Godwin, M. (1995, July). JournoPorn, dissection of the Time article. Available: http://www.hotwired.com

A detailed critique of Time magazine's Cyberporn , outlining flaws of methodology as well as exploring the underlying assumptions of the article.

Hambleton, R.K. & Zaal, J.N., eds. (1991). Advances in educational and psychological testing . Boston: Kluwer Academic.

Information on the concepts of reliability and validity in psychology and education.

Harnish, D.L. (1992). Human judgment and the logic of evidence: A critical examination of research methods in special education transition literature . In D.L. Harnish et al. eds., Selected readings in transition.

This article investigates threats to validity in special education research.

Haynes, N. M. (1995). How skewed is 'the bell curve'? Book Product Reviews . 1-24.

This paper claims that R.J. Herrnstein and C. Murray's The Bell Curve: Intelligence and Class Structure in American Life does not have scientific merit and claims that the bell curve is an unreliable measure of intelligence.

Healey, J. F. (1993). Statistics: A tool for social research, 3rd ed . Belmont: Wadsworth Publishing.

Inferential statistics, measures of association, and multivariate techniques in statistical analysis for social scientists are addressed.

Helberg, C. (1996, Oct.12). Pitfalls of data analysis (or how to avoid lies and damned lies). Available: http//maddog/fammed.wisc.edu/pitfalls/

A discussion of things researchers often overlook in their data analysis and how statistics are often used to skew reliability and validity for the researchers purposes.

Hoffman, D. L. and Novak, T.P. (1995, July). A detailed critique of the Time article: Cyberporn. Available: http://www.hotwired.com

A methodological critique of the Time article that uncovers some of the fundamental flaws in the statistics and the conclusions made by De Witt.

Huitt, William G. (1998). Internal and External Validity . http://www.valdosta.peachnet.edu/~whuitt/psy702/intro/valdgn.html

A Web document addressing key issues of external and internal validity.

Jones, J. E. & Bearley, W.L. (1996, Oct 12). Reliability and validity of training instruments. Organizational Universe Systems. Available: http://ous.usa.net/relval.htm

The authors discuss the reliability and validity of training design in a business setting. Basic terms are defined and examples provided.

Cultural Anthropology Methods Journal. (1996, Oct. 12). Available: http://www.lawrence.edu/~bradleyc/cam.html

An online journal containing articles on the practical application of research methods when conducting qualitative and quantitative research. Reliability and validity are addressed throughout.

Kirk, J. & Miller, M. M. (1986). Reliability and validity in qualitative research. Beverly Hills: Sage Publications.

This text describes objectivity in qualitative research by focusing on the issues of validity and reliability in terms of their limitations and applicability in the social and natural sciences.

Krakower, J. & Niwa, S. (1985). An assessment of validity and reliability of the institutinal perfarmance survey . Boulder, CO: National center for higher education management systems.

Educational surveys and higher education research and the efeectiveness of organization.

Lauer, J. M. & Asher, J.W. (1988). Composition Research. New York: Oxford University Press.

A discussion of empirical designs in the context of composition research as a whole.

Laurent, J. et al. (1992, Mar.) Review of validity research on the stanford-binet intelligence scale: 4th Ed. Psychological Assessment . 102-112.

This paper looks at the results of construct and criterion- related validity studies to determine if the SB:FE is a valid measure of intelligence.

LeCompte, M. D., Millroy, W.L., & Preissle, J. eds. (1992). The handbook of qualitative research in education. San Diego: Academic Press.

A compilation of the range of methodological and theoretical qualitative inquiry in the human sciences and education research. Numerous contributing authors apply their expertise to discussing a wide variety of issues pertaining to educational and humanities research as well as suggestions about how to deal with problems when conducting research.

McDowell, I. & Newell, C. (1987). Measuring health: A guide to rating scales and questionnaires . New York: Oxford University Press.

This gives a variety of examples of health measurement techniques and scales and discusses the validity and reliability of important health measures.

Meeks, B. (1995, July). Muckraker: How Time failed. Available: http://www.hotwired.com

A step-by-step outline of the events which took place during the researching, writing, and negotiating of the Time article of 3 July, 1995 titled: On A Screen Near You: Cyberporn .

Merriam, S. B. (1995). What can you tell from an N of 1?: Issues of validity and reliability in qualitative research. Journal of Lifelong Learning v4 , 51-60.

Addresses issues of validity and reliability in qualitative research for education. Discusses philosophical assumptions underlying the concepts of internal validity, reliability, and external validity or generalizability. Presents strategies for ensuring rigor and trustworthiness when conducting qualitative research.

Morris, L.L, Fitzgibbon, C.T., & Lindheim, E. (1987). How to measure performance and use tests. In J.L. Herman (Ed.), Program evaluation kit (2nd ed.). Newbury Park, CA: Sage.

Discussion of reliability and validity as it pertyains to measuring students' performance.

Murray, S., et al. (1979, April). Technical issues as threats to internal validity of experimental and quasi-experimental designs. San Francisco: University of California. 8-12.

(From Yang et al. bibliography--unavailable as of this writing.)

Russ-Eft, D. F. (1980). Validity and reliability in survey research. American Institutes for Research in the Behavioral Sciences August , 227 151.

An investigation of validity and reliability in survey research with and overview of the concepts of reliability and validity. Specific procedures for measuring sources of error are suggested as well as general suggestions for improving the reliability and validity of survey data. A extensive annotated bibliography is provided.

Ryser, G. R. (1994). Developing reliable and valid authentic assessments for the classroom: Is it possible? Journal of Secondary Gifted Education Fall, v6 n1 , 62-66.

Defines the meanings of reliability and validity as they apply to standardized measures of classroom assessment. This article defines reliability as scorability and stability and validity is seen as students' ability to use knowledge authentically in the field.

Schmidt, W., et al. (1982). Validity as a variable: Can the same certification test be valid for all students? Institute for Research on Teaching July, ED 227 151.

A technical report that presents specific criteria for judging content, instructional and curricular validity as related to certification tests in education.

Scholfield, P. (1995). Quantifying language. A researcher's and teacher's guide to gathering language data and reducing it to figures . Bristol: Multilingual Matters.

A guide to categorizing, measuring, testing, and assessing aspects of language. A source for language-related practitioners and researchers in conjunction with other resources on research methods and statistics. Questions of reliability, and validity are also explored.

Scriven, M. (1993). Hard-Won Lessons in Program Evaluation . San Francisco: Jossey-Bass Publishers.

A common sense approach for evaluating the validity of various educational programs and how to address specific issues facing evaluators.

Shou, P. (1993, Jan.). The singer loomis inventory of personality: A review and critique. [Paper presented at the Annual Meeting of the Southwest Educational Research Association.]

Evidence for reliability and validity are reviewed. A summary evaluation suggests that SLIP (developed by two Jungian analysts to allow examination of personality from the perspective of Jung's typology) appears to be a useful tool for educators and counselors.

Sutton, L.R. (1992). Community college teacher evaluation instrument: A reliability and validity study . Diss. Colorado State University.

Studies of reliability and validity in occupational and educational research.

Thompson, B. & Daniel, L.G. (1996, Oct.). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and psychological measurement v. 56 , 741-745.

Editorial board members of Educational and Psychological Measurement generated bibliography of definitive publications of measurement research. Many articles are directly related to reliability and validity.

Thompson, E. Y., et al. (1995). Overview of qualitative research . Diss. Colorado State University.

A discussion of strengths and weaknesses of qualitative research and its evolution and adaptation. Appendices and annotated bibliography.

Traver, C. et al. (1995). Case Study . Diss. Colorado State University.

This presentation gives an overview of case study research, providing definitions and a brief history and explanation of how to design research.

Trochim, William M. K. (1996) External validity. (. Available: http://trochim.human.cornell.edu/kb/EXTERVAL.htm

A comprehensive treatment of external validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Introduction to validity. (. Available: hhttp://trochim.human.cornell.edu/kb/INTROVAL.htm

An introduction to validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Reliability. (. Available: http://trochim.human.cornell.edu/kb/reltypes.htm

A comprehensive treatment of reliability found in William Trochim's online text about research methods and issues.

Validity. (1996, Oct. 12). Available: http://vislab-www.nps.navy.mil/~haga/validity.html

A source for definitions of various forms and types of reliability and validity.

Vinsonhaler, J. F., et al. (1983, July). Improving diagnostic reliability in reading through training. Institute for Research on Teaching ED 237 934.

This technical report investigates the practical application of a program intended to improve the diagnoses of reading deficient students. Here, reliability is assumed and a pragmatic answer to a specific educational problem is suggested as a result.

Wentland, E. J. & Smith, K.W. (1993). Survey responses: An evaluation of their validity . San Diego: Academic Press.

This book looks at the factors affecting response validity (or the accuracy of self-reports in surveys) and provides several examples with varying accuracy levels.

Wiget, A. (1996). Father juan greyrobe: Reconstructing tradition histories, and the reliability and validity of uncorroborated oral tradition. Ethnohistory 43:3 , 459-482.

This paper presents a convincing argument for the validity of oral histories in ethnographic research where at least some of the evidence can be corroborated through written records.

Yang, G. H., et al. (1995). Experimental and quasi-experimental educational research . Diss. Colorado State University.

This discussion defines experimentation and considers the rhetorical issues and advantages and disadvantages of experimental research. Annotated bibliography.

Yarroch, W. L. (1991, Sept.). The Implications of content versus validity on science tests. Journal of Research in Science Teaching , 619-629.

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed to look at qualitative comparisons between different factors.

Yin, R. K. (1989). Case study research: Design and methods . London: Sage Publications.

This book discusses the design process of case study research, including collection of evidence, composing the case study report, and designing single and multiple case studies.

Related Links

Internal Validity Tutorial. An interactive tutorial on internal validity.

http://server.bmod.athabascau.ca/html/Validity/index.shtml

Howell, Jonathan, Paul Miller, Hyun Hee Park, Deborah Sattler, Todd Schack, Eric Spery, Shelley Widhalm, & Mike Palmquist. (2005). Reliability and Validity. Writing@CSU . Colorado State University. https://writing.colostate.edu/guides/guide.cfm?guideid=66

  • Reliability vs Validity in Research: Types & Examples

busayo.longe

In everyday life, we probably use reliability to describe how something is valid. However, in research and testing, reliability and validity are not the same things.

When it comes to data analysis, reliability refers to how easily replicable an outcome is. For example, if you measure a cup of rice three times, and you get the same result each time, that result is reliable.

The validity, on the other hand, refers to the measurement’s accuracy. This means that if the standard weight for a cup of rice is 5 grams, and you measure a cup of rice, it should be 5 grams.

So, while reliability and validity are intertwined, they are not synonymous. If one of the measurement parameters, such as your scale, is distorted, the results will be consistent but invalid.

Data must be consistent and accurate to be used to draw useful conclusions. In this article, we’ll look at how to assess data reliability and validity, as well as how to apply it.

Read: Internal Validity in Research: Definition, Threats, Examples

What is Reliability?

When a measurement is consistent it’s reliable. But of course, reliability doesn’t mean your outcome will be the same, it just means it will be in the same range. 

For example, if you scored 95% on a test the first time and the next you score, 96%, your results are reliable.  So, even if there is a minor difference in the outcomes, as long as it is within the error margin, your results are reliable.

Reliability allows you to assess the degree of consistency in your results. So, if you’re getting similar results, reliability provides an answer to the question of how similar your results are.

What is Validity?

A measurement or test is valid when it correlates with the expected result. It examines the accuracy of your result.

Here’s where things get tricky: to establish the validity of a test, the results must be consistent. Looking at most experiments (especially physical measurements), the standard value that establishes the accuracy of a measurement is the outcome of repeating the test to obtain a consistent result.

Read: What is Participant Bias? How to Detect & Avoid It

For example, before I can conclude that all 12-inch rulers are one foot, I must repeat the experiment several times and obtain very similar results, indicating that 12-inch rulers are indeed one foot.

Most scientific experiments are inextricably linked in terms of validity and reliability. For example, if you’re measuring distance or depth, valid answers are likely to be reliable.

But for social experiences, one isn’t the indication of the other. For example, most people believe that people that wear glasses are smart. 

Of course, I’ll find examples of people who wear glasses and have high IQs (reliability), but the truth is that most people who wear glasses simply need their vision to be better (validity). 

So reliable answers aren’t always correct but valid answers are always reliable.

How Are Reliability and Validity Assessed?

When assessing reliability, we want to know if the measurement can be replicated. Of course, we’d have to change some variables to ensure that this test holds, the most important of which are time, items, and observers.

If the main factor you change when performing a reliability test is time, you’re performing a test-retest reliability assessment.

Read: What is Publication Bias? (How to Detect & Avoid It)

However, if you are changing items, you are performing an internal consistency assessment. It means you’re measuring multiple items with a single instrument.

Finally, if you’re measuring the same item with the same instrument but using different observers or judges, you’re performing an inter-rater reliability test.

Assessing Validity

Evaluating validity can be more tedious than reliability. With reliability, you’re attempting to demonstrate that your results are consistent, whereas, with validity, you want to prove the correctness of your outcome.

Although validity is mainly categorized under two sections (internal and external), there are more than fifteen ways to check the validity of a test. In this article, we’ll be covering four.

First, content validity, measures whether the test covers all the content it needs to provide the outcome you’re expecting. 

Suppose I wanted to test the hypothesis that 90% of Generation Z uses social media polls for surveys while 90% of millennials use forms. I’d need a sample size that accounts for how Gen Z and millennials gather information.

Next, criterion validity is when you compare your results to what you’re supposed to get based on a chosen criteria. There are two ways these could be measured, predictive or concurrent validity.

Read: Survey Errors To Avoid: Types, Sources, Examples, Mitigation

Following that, we have face validity . It’s how we anticipate a test to be. For instance, when answering a customer service survey, I’d expect to be asked about how I feel about the service provided.

Lastly, construct-related validity . This is a little more complicated, but it helps to show how the validity of research is based on different findings.

As a result, it provides information that either proves or disproves that certain things are related.

Types of Reliability

We have three main types of reliability assessment and here’s how they work:

1) Test-retest Reliability

This assessment refers to the consistency of outcomes over time. Testing reliability over time does not imply changing the amount of time it takes to conduct an experiment; rather, it means repeating the experiment multiple times in a short time.

For example, if I measure the length of my hair today, and tomorrow, I’ll most likely get the same result each time. 

A short period is relative in terms of reliability; two days for measuring hair length is considered short. But that’s far too long to test how quickly water dries on the sand.

A test-retest correlation is used to compare the consistency of your results. This is typically a scatter plot that shows how similar your values are between the two experiments.

If your answers are reliable, your scatter plots will most likely have a lot of overlapping points, but if they aren’t, the points (values) will be spread across the graph.

Read: Sampling Bias: Definition, Types + [Examples]

2) Internal Consistency

It’s also known as internal reliability. It refers to the consistency of results for various items when measured on the same scale.

This is particularly important in social science research, such as surveys, because it helps determine the consistency of people’s responses when asked the same questions.

Most introverts, for example, would say they enjoy spending time alone and having few friends. However, if some introverts claim that they either do not want time alone or prefer to be surrounded by many friends, it doesn’t add up.

These people who claim to be introverts or one this factor isn’t a reliable way of measuring introversion.

Internal reliability helps you prove the consistency of a test by varying factors. It’s a little tough to measure quantitatively but you could use the split-half correlation .

The split-half correlation simply means dividing the factors used to measure the underlying construct into two and plotting them against each other in the form of a scatter plot.

Introverts, for example, are assessed on their need for alone time as well as their desire to have as few friends as possible. If this plot is dispersed, likely, one of the traits does not indicate introversion.

3) Inter-Rater Reliability

This method of measuring reliability helps prevent personal bias. Inter-rater reliability assessment helps judge outcomes from the different perspectives of multiple observers.

A good example is if you ordered a meal and found it delicious. You could be biased in your judgment for several reasons, perception of the meal, your mood, and so on.

But it’s highly unlikely that six more people would agree that the meal is delicious if it isn’t. Another factor that could lead to bias is expertise. Professional dancers, for example, would perceive dance moves differently than non-professionals. 

Read: What is Experimenter Bias? Definition, Types & Mitigation

So, if a person dances and records it, and both groups (professional and unprofessional dancers) rate the video, there is a high likelihood of a significant difference in their ratings.

But if they both agree that the person is a great dancer, despite their opposing viewpoints, the person is likely a great dancer.

Types of Validity

Researchers use validity to determine whether a measurement is accurate or not. The accuracy of measurement is usually determined by comparing it to the standard value.

When a measurement is consistent over time and has high internal consistency, it increases the likelihood that it is valid.

1) Content Validity

This refers to determining validity by evaluating what is being measured. So content validity tests if your research is measuring everything it should to produce an accurate result.

For example, if I were to measure what causes hair loss in women. I’d have to consider things like postpartum hair loss, alopecia, hair manipulation, dryness, and so on.

By omitting any of these critical factors, you risk significantly reducing the validity of your research because you won’t be covering everything necessary to make an accurate deduction. 

Read: Data Cleaning: 7 Techniques + Steps to Cleanse Data

For example, a certain woman is losing her hair due to postpartum hair loss, excessive manipulation, and dryness, but in my research, I only look at postpartum hair loss. My research will show that she has postpartum hair loss, which isn’t accurate.

Yes, my conclusion is correct, but it does not fully account for the reasons why this woman is losing her hair.

2) Criterion Validity

This measures how well your measurement correlates with the variables you want to compare it with to get your result. The two main classes of criterion validity are predictive and concurrent.

3) Predictive validity

It helps predict future outcomes based on the data you have. For example, if a large number of students performed exceptionally well in a test, you can use this to predict that they understood the concept on which the test was based and will perform well in their exams.

4) Concurrent validity

On the other hand, involves testing with different variables at the same time. For example, setting up a literature test for your students on two different books and assessing them at the same time.

You’re measuring your students’ literature proficiency with these two books. If your students truly understood the subject, they should be able to correctly answer questions about both books.

5) Face Validity

Quantifying face validity might be a bit difficult because you are measuring the perception validity, not the validity itself. So, face validity is concerned with whether the method used for measurement will produce accurate results rather than the measurement itself.

If the method used for measurement doesn’t appear to test the accuracy of a measurement, its face validity is low.

Here’s an example: less than 40% of men over the age of 20 in Texas, USA, are at least 6 feet tall. The most logical approach would be to collect height data from men over the age of twenty in Texas, USA.

However, asking men over the age of 20 what their favorite meal is to determine their height is pretty bizarre. The method I am using to assess the validity of my research is quite questionable because it lacks correlation to what I want to measure.

6) Construct-Related Validity

Construct-related validity assesses the accuracy of your research by collecting multiple pieces of evidence. It helps determine the validity of your results by comparing them to evidence that supports or refutes your measurement.

7) Convergent validity

If you’re assessing evidence that strongly correlates with the concept, that’s convergent validity . 

8) Discriminant validity

Examines the validity of your research by determining what not to base it on. You are removing elements that are not a strong factor to help validate your research. Being a vegan, for example, does not imply that you are allergic to meat.

How to Ensure Validity and Reliability in Your Research

You need a bulletproof research design to ensure that your research is both valid and reliable. This means that your methods, sample, and even you, the researcher, shouldn’t be biased.

  • Ensuring Reliability

To enhance the reliability of your research, you need to apply your measurement method consistently. The chances of reproducing the same results for a test are higher when you maintain the method you’re using to experiment.

For example, you want to determine the reliability of the weight of a bag of chips using a scale. You have to consistently use this scale to measure the bag of chips each time you experiment.

You must also keep the conditions of your research consistent. For instance, if you’re experimenting to see how quickly water dries on sand, you need to consider all of the weather elements that day.

So, if you experimented on a sunny day, the next experiment should also be conducted on a sunny day to obtain a reliable result.

Read: Survey Methods: Definition, Types, and Examples
  • Ensuring Validity

There are several ways to determine the validity of your research, and the majority of them require the use of highly specific and high-quality measurement methods.

Before you begin your test, choose the best method for producing the desired results. This method should be pre-existing and proven.

Also, your sample should be very specific. If you’re collecting data on how dogs respond to fear, your results are more likely to be valid if you base them on a specific breed of dog rather than dogs in general.

Validity and reliability are critical for achieving accurate and consistent results in research. While reliability does not always imply validity, validity establishes that a result is reliable. Validity is heavily dependent on previous results (standards), whereas reliability is dependent on the similarity of your results.

Logo

Connect to Formplus, Get Started Now - It's Free!

  • concurrent validity
  • examples of research bias
  • predictive reliability
  • research analysis
  • research assessment
  • validity of research
  • busayo.longe

Formplus

You may also like:

Research Bias: Definition, Types + Examples

Simple guide to understanding research bias, types, causes, examples and how to avoid it in surveys

reliability of a research result implies its

Selection Bias in Research: Types, Examples & Impact

In this article, we’ll discuss the effects of selection bias, how it works, its common effects and the best ways to minimize it.

Simpson’s Paradox & How to Avoid it in Experimental Research

In this article, we are going to look at Simpson’s Paradox from its historical point and later, we’ll consider its effect in...

How to do a Meta Analysis: Methodology, Pros & Cons

In this article, we’ll go through the concept of meta-analysis, what it can be used for, and how you can use it to improve how you...

Formplus - For Seamless Data Collection

Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Reliability vs Validity in Research | Differences, Types & Examples

Reliability vs Validity in Research | Differences, Types & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research .

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect your data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism, run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalisability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability, or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data .

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are of high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardised questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or the findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid generalisable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population.

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible.

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific behaviours or responses will be counted, and make sure questions are phrased the same way each time.

  • Standardise the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper. Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). Reliability vs Validity in Research | Differences, Types & Examples. Scribbr. Retrieved 21 May 2024, from https://www.scribbr.co.uk/research-methods/reliability-or-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples, sampling methods | types, techniques, & examples.

Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

reliability of a research result implies its

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

reliability of a research result implies its

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Research aims, research objectives and research questions

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 5: Psychological Measurement

Reliability and Validity of Measurement

Learning Objectives

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply  assume  that their measures work. Instead, they collect data to demonstrate  that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability  refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time.  Test-retest reliability  is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the  same  group of people at a later time, and then looking at  test-retest correlation  between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s  r . Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is  internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a  split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s  r  for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called  Cronbach’s α  (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a rater.  Inter-rater reliability  is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity  is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity  is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

Content validity  is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity  is the extent to which people’s scores on a measure are correlated with other variables (known as  criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as convergent validity .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982) [1] . In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009) [2] .

Discriminant Validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s  r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
  • Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131. ↵
  • Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵

The consistency of a measure.

The consistency of a measure over time.

The consistency of a measure on the same group of people at different times.

Consistency of people’s responses across the items on a multiple-item measure.

Method of assessing internal consistency through splitting the items into two sets and examining the relationship between them.

A statistic in which α is the mean of all possible split-half correlations for a set of items.

The extent to which different observers are consistent in their judgments.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears to measure the construct of interest.

The extent to which a measure “covers” the construct of interest.

The extent to which people’s scores on a measure are correlated with other variables that one would expect them to be correlated with.

In reference to criterion validity, variables that one would expect to be correlated with the measure.

When the criterion is measured at the same time as the construct.

when the criterion is measured at some point in the future (after the construct has been measured).

When new measures positively correlate with existing measures of the same constructs.

The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

reliability of a research result implies its

Reliability and Validity

  • First Online: 23 May 2019

Cite this chapter

reliability of a research result implies its

  • Roel Popping 2  

512 Accesses

Interrater agreement is usually used as an aid in determining the reliability of measurements based on human coding (from text or from observations). Reliability is usually seen as a requirement for validity, does the data present what they are proposed to present. This chapter contains background information on what is meant by these two terms reliability and validity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

G. Andrén, Reliability and content analysis, in Advances in Content Analysis , ed. by K.E. Rosengren (Sage, Beverly Hills, 1981), pp. 43–67

Google Scholar  

J. Bara, Israel 1949–1981, in Ideology, Strategy and Party Change: Spatial Analyses of Post-war Election Programmes in 19 Democracies , ed. by I. Budge, D. Robertson, D.J. Hearl (Cambridge University Press, Cambridge, 1987), pp. 111–133

Chapter   Google Scholar  

D.T. Campbell, D.W. Fiske, Convergent and discriminant validation by the multitrait–multimethod matrix. Psychol. Bull. 56 (1), 81–105 (1959)

Article   Google Scholar  

K. Carley, An approach for relating social structure to cognitive structure. J. Math. Sociol. 12 (2), 137–189 (1986)

Article   MathSciNet   Google Scholar  

K. Carley, Formalizing the expert’s knowledge. Sociol. Methods Res. 17 (2), 165–232 (1988)

J.D. Cone, The relevance of reliability and validity for behavioral assessment. Behav. Ther. 8 (3), 411–427 (1977)

J. Herbert, C. Attridge, A guide for developers and users of observation systems and manuals. Am. Educ. Res. J. 12 (1), 1–20 (1975)

A.R. Hollenbeck, Problems of reliability in observational research, in Observing Behavior , vol. 2, ed. by G.P. Sacker (University Park Press, London, 1978), pp. 79–98

O.R. Holsti, Content Analysis for the Social Sciences and Humanities (Addison Wesley, London, 1969)

R.H. Kolbe, M.S. Burnett, Content-analysis research: An examination of applications with directives for improving research reliability and objectivity. J. Consum. Res. 18 (2), 243–250 (1991)

K. Krippendorff, Content Analysis: An Introduction to Its Methodology (Sage, Beverly Hills, CA, 1980)

MATH   Google Scholar  

K. Krippendorff, Association, agreement, and equity. Qual. Quant. 21 (1), 109–123 (1987)

S. Lacy, D. Riffe, Sampling error and selecting intercoder reliability samples for nominal content categories. Journal. Mass Commun. Q. 73 (4), 963–973 (1996)

M. Lombard, J. Snyder-Duch, C.C. Bracken, Content analysis in mass communication: assessment and reporting of intercoder reliability. Hum. Commun. Res. 28 (4), 587–604 (2002)

R. Popping, On agreement indices for nominal data, in Sociometric Research , vol. I, ed. by W.E. Saris, I.N. Gallhofer (McMillan, London, 1988), pp. 90–105

W.J. Potter, D. Levine-Donnerstein, Rethinking validity and reliability in content analysis. J. Appl. Commun. Res. 27 (3), 258–284 (1999)

D. Riffe, A.A. Freitag, A content analysis of content analyses: twenty-five years of journalism quarterly. Journal. Mass Commun. Q. 74 (4), 873–882 (1997)

J. Spanjer, B. Krol, R. Popping, J.W. Groothoff, S. Brouwer, Disability assessment interview: the role of concrete and detailed information on functioning besides medical history taking. J. Rehabil. Med. 41 (4), 267–272 (2009)

J.S. Uebersax, W.M. Grove, Latent class analysis of diagnostic agreement. Stat. Med. 9 (5), 559–572 (1990)

Download references

Author information

Authors and affiliations.

Department of Sociology, University of Groningen, Groningen, The Netherlands

Roel Popping

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Roel Popping .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Popping, R. (2019). Reliability and Validity. In: Introduction to Interrater Agreement for Nominal Data. Springer, Cham. https://doi.org/10.1007/978-3-030-11671-2_2

Download citation

DOI : https://doi.org/10.1007/978-3-030-11671-2_2

Published : 23 May 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-11670-5

Online ISBN : 978-3-030-11671-2

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Social Sci LibreTexts

5.13: The Reliability and Validity of Research

  • Last updated
  • Save as PDF
  • Page ID 59853

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Learning Objectives

  • Define reliability and validity

Interpreting Experimental Findings

Once data is collected from both the experimental and the control groups, a statistical analysis is conducted to find out if there are meaningful differences between the two groups. A statistical analysis determines how likely any difference found is due to chance (and thus not meaningful). In psychology, group differences are considered meaningful, or significant, if the odds that these differences occurred by chance alone are 5 percent or less. Stated another way, if we repeated this experiment 100 times, we would expect to find the same results at least 95 times out of 100.

The greatest strength of experiments is the ability to assert that any significant differences in the findings are caused by the independent variable. This occurs because random selection, random assignment, and a design that limits the effects of both experimenter bias and participant expectancy should create groups that are similar in composition and treatment. Therefore, any difference between the groups is attributable to the independent variable, and now we can finally make a causal statement. If we find that watching a violent television program results in more violent behavior than watching a nonviolent program, we can safely say that watching violent television programs causes an increase in the display of violent behavior.

Reporting Research

When psychologists complete a research project, they generally want to share their findings with other scientists. The American Psychological Association (APA) publishes a manual detailing how to write a paper for submission to scientific journals. Unlike an article that might be published in a magazine like Psychology Today, which targets a general audience with an interest in psychology, scientific journals generally publish peer-reviewed journal articles aimed at an audience of professionals and scholars who are actively involved in research themselves.

Link to Learning

The Online Writing Lab (OWL) at Purdue University can walk you through the APA writing guidelines.

A peer-reviewed journal article is read by several other scientists (generally anonymously) with expertise in the subject matter. These peer reviewers provide feedback—to both the author and the journal editor—regarding the quality of the draft. Peer reviewers look for a strong rationale for the research being described, a clear description of how the research was conducted, and evidence that the research was conducted in an ethical manner. They also look for flaws in the study’s design, methods, and statistical analyses. They check that the conclusions drawn by the authors seem reasonable given the observations made during the research. Peer reviewers also comment on how valuable the research is in advancing the discipline’s knowledge. This helps prevent unnecessary duplication of research findings in the scientific literature and, to some extent, ensures that each research article provides new information. Ultimately, the journal editor will compile all of the peer reviewer feedback and determine whether the article will be published in its current state (a rare occurrence), published with revisions, or not accepted for publication.

Peer review provides some degree of quality control for psychological research. Poorly conceived or executed studies can be weeded out, and even well-designed research can be improved by the revisions suggested. Peer review also ensures that the research is described clearly enough to allow other scientists to replicate it, meaning they can repeat the experiment using different samples to determine reliability. Sometimes replications involve additional measures that expand on the original finding. In any case, each replication serves to provide more evidence to support the original research findings. Successful replications of published research make scientists more apt to adopt those findings, while repeated failures tend to cast doubt on the legitimacy of the original article and lead scientists to look elsewhere. For example, it would be a major advancement in the medical field if a published study indicated that taking a new drug helped individuals achieve a healthy weight without changing their diet. But if other scientists could not replicate the results, the original study’s claims would be questioned.

Dig Deeper: The Vaccine-Autism Myth and the Retraction of Published Studies

Some scientists have claimed that routine childhood vaccines cause some children to develop autism, and, in fact, several peer-reviewed publications published research making these claims. Since the initial reports, large-scale epidemiological research has suggested that vaccinations are not responsible for causing autism and that it is much safer to have your child vaccinated than not. Furthermore, several of the original studies making this claim have since been retracted.

A published piece of work can be rescinded when data is called into question because of falsification, fabrication, or serious research design problems. Once rescinded, the scientific community is informed that there are serious problems with the original publication. Retractions can be initiated by the researcher who led the study, by research collaborators, by the institution that employed the researcher, or by the editorial board of the journal in which the article was originally published. In the vaccine-autism case, the retraction was made because of a significant conflict of interest in which the leading researcher had a financial interest in establishing a link between childhood vaccines and autism (Offit, 2008). Unfortunately, the initial studies received so much media attention that many parents around the world became hesitant to have their children vaccinated (Figure 1). For more information about how the vaccine/autism story unfolded, as well as the repercussions of this story, take a look at Paul Offit’s book, Autism’s False Prophets: Bad Science, Risky Medicine, and the Search for a Cure.

A photograph shows a child being given an oral vaccine.

Reliability and Validity

Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways. Unfortunately, being consistent in measurement does not necessarily mean that you have measured something correctly. This is where validity comes into play. Validity refers to the extent to which a given instrument or tool accurately measures what it’s supposed to measure. While any valid measure is by necessity reliable, the reverse is not necessarily true. Researchers strive to use instruments that are both highly reliable and valid.

Query \(\PageIndex{1}\)

Everyday Connection: How Valid Is the SAT?

Standardized tests like the SAT are supposed to measure an individual’s aptitude for a college education, but how reliable and valid are such tests? Research conducted by the College Board suggests that scores on the SAT have high predictive validity for first-year college students’ GPA (Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008). In this context, predictive validity refers to the test’s ability to effectively predict the GPA of college freshmen. Given that many institutions of higher education require the SAT for admission, this high degree of predictive validity might be comforting.

However, the emphasis placed on SAT scores in college admissions has generated some controversy on a number of fronts. For one, some researchers assert that the SAT is a biased test that places minority students at a disadvantage and unfairly reduces the likelihood of being admitted into a college (Santelices & Wilson, 2010). Additionally, some research has suggested that the predictive validity of the SAT is grossly exaggerated in how well it is able to predict the GPA of first-year college students. In fact, it has been suggested that the SAT’s predictive validity may be overestimated by as much as 150% (Rothstein, 2004). Many institutions of higher education are beginning to consider de-emphasizing the significance of SAT scores in making admission decisions (Rimer, 2008).

Recent examples of high profile cheating scandals both domestically and abroad have only increased the scrutiny being placed on these types of tests, and as of March 2019, more than 1000 institutions of higher education have either relaxed or eliminated the requirements for SAT or ACT testing for admissions (Strauss, 2019, March 19).

Query \(\PageIndex{2}\)

Query \(\PageIndex{3}\)

reliability:  consistency and reproducibility of a given result

Licenses and Attributions

CC licensed content, Shared previously

  • Analyzing Findings. Authored by : OpenStax College. Located at : http://cnx.org/contents/[email protected]:mfArybye@7/Analyzing-Findings . License : CC BY: Attribution . License Terms : Download for free at http://cnx.org/contents/[email protected]

Research-Methodology

Research Reliability

Reliability refers to whether or not you get the same answer by using an instrument to measure something more than once. In simple terms, research reliability is the degree to which research method produces stable and consistent results.

A specific measure is considered to be reliable if its application on the same object of measurement number of times produces the same results.

Research reliability can be divided into three categories:

1. Test-retest reliability relates to the measure of reliability that has been obtained by conducting the same test more than one time over period of time with the participation of the same sample group.

Example: Employees of ABC Company may be asked to complete the same questionnaire about   employee job satisfaction two times with an interval of one week, so that test results can be compared to assess stability of scores.

Research Reliability

2. Parallel forms reliability relates to a measure that is obtained by conducting assessment of the same phenomena with the participation of the same sample group via more than one assessment method.

Example: The levels of employee satisfaction of ABC Company may be assessed with questionnaires, in-depth interviews and focus groups and results can be compared.

Research Reliability

3. Inter-rater reliability as the name indicates relates to the measure of sets of results obtained by different assessors using same methods. Benefits and importance of assessing inter-rater reliability can be explained by referring to subjectivity of assessments.

Example: Levels of employee motivation at ABC Company can be assessed using observation method by two different assessors, and inter-rater reliability relates to the extent of difference between the two assessments.

Research Reliability

4. Internal consistency reliability is applied to assess the extent of differences within the test items that explore the same construct produce similar results. It can be represented in two main formats.

a) average inter-item correlation is a specific form of internal consistency that is obtained by applying the same construct on each item of the test

b) split-half reliability as another type of internal consistency reliability involves all items of a test to be ‘spitted in half’.

Research Reliability

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Research Reliability

The Meaning of Reliability in Sociology

Four Procedures for Assessing Reliability

  • Key Concepts
  • Major Sociologists
  • News & Issues
  • Research, Samples, and Statistics
  • Recommended Reading
  • Archaeology

Reliability is the degree to which a measurement instrument gives the same results each time that it is used, assuming that the underlying thing being measured does not change.

Key Takeaways: Reliability

  • If a measurement instrument provides similar results each time it is used (assuming that whatever is being measured stays the same over time), it is said to have high reliability.
  • Good measurement instruments should have both high reliability and high accuracy.
  • Four methods sociologists can use to assess reliability are the test-retest procedure, the alternate forms procedure, the split-halves procedure, and the internal consistency procedure.

Imagine that you’re trying to assess the reliability of a thermometer in your home. If the temperature in a room stays the same, a reliable thermometer will always give the same reading. A thermometer that lacks reliability would change even when the temperature does not. Note, however, that the thermometer does not have to be accurate in order to be reliable. It might always register three degrees too high, for example. Its degree of reliability has to do instead with the predictability of its relationship with whatever is being tested.

Methods to Assess Reliability

In order to assess reliability, the thing being measured must be measured more than once. For example, if you wanted to measure the length of a sofa to make sure it would fit through a door, you might measure it twice. If you get an identical measurement twice, you can be confident you measured reliably.

There are four procedures for assessing the reliability of a test. (Here, the term "test" refers to a group of statements on a questionnaire, an observer's quantitative or qualitative  evaluation, or a combination of the two.)

The Test-Retest Procedure

Here, the same test is given two or more times. For example, you might create a questionnaire with a set of ten statements to assess confidence. These ten statements are then given to a subject twice at two different times. If the respondent gives similar answers both times, you can assume the questions assessed the subject's answers reliably.

One advantage of this method is that only one test needs to be developed for this procedure. However, there are a few downsides of the test-retest procedure. Events might occur between testing times that affect the respondents' answers; answers might change over time simply because people change and grow over time; and the subject might adjust to the test the second time around, think more deeply about the questions, and reevaluate their answers. For instance, in the example above, some respondents might have become more confident between the first and second testing session, which would make it more difficult to interpret the results of the test-retest procedure.

The Alternate Forms Procedure

In the alternate forms procedure (also called parallel forms reliability ), two tests are given. For example, you might create two sets of five statements measuring confidence. Subjects would be asked to take each of the five-statement questionnaires. If the person gives similar answers for both tests, you can assume you measured the concept reliably. One advantage is that cueing will be less of a factor because the two tests are different. However, it's important to ensure that both alternate versions of the test are indeed measuring the same thing.

The Split-Halves Procedure

In this procedure, a single test is given once. A grade is assigned to each half separately and grades are compared from each half. For example, you might have one set of ten statements on a questionnaire to assess confidence. Respondents take the test and the questions are then split into two sub-tests of five items each. If the score on the first half mirrors the score on the second half, you can presume that the test measured the concept reliably. On the plus side, history, maturation, and cueing aren't at play. However, scores can vary greatly depending on the way in which the test is divided into halves.

The Internal Consistency Procedure

Here, the same test is administered once, and the score is based upon average similarity of responses. For example, in a ten-statement questionnaire to measure confidence, each response can be seen as a one-statement sub-test. The similarity in responses to each of the ten statements is used to assess reliability. If the respondent doesn't answer all ten statements in a similar way, then one can assume that the test is not reliable. One way that researchers can assess internal consistency is by using statistical software to calculate Cronbach’s alpha .

With the internal consistency procedure, history, maturation, and cueing aren't a consideration. However, the number of statements in the test can affect the assessment of reliability when assessing it internally.

  • How to Calculate Percent Error
  • Difference Between Independent and Dependent Variables
  • Chemistry Glassware Types, Names and Uses
  • Null Hypothesis Examples
  • Examples of Independent and Dependent Variables
  • Understanding Validity in Sociology
  • A Writing Portfolio Can Help You Perfect Your Writing Skills
  • Structural Equation Modeling
  • Scales Used in Social Science Research
  • Testing and Assessment for Special Education
  • The Differences Between Indexes and Scales
  • Myers-Briggs Personality Types: Definitions and Examples
  • Collection of Learning Styles Tests and Inventories
  • Temperature Definition in Science
  • A Beginner's Guide to Understanding Ambient Air Temperature
  • Constructing a Questionnaire

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Essentials of Statistical Methods for Assessing Reliability and Agreement in Quantitative Imaging

Quantitative imaging is increasing in almost all fields of radiological science. Modern quantitative imaging biomarkers measure complex parameters including metabolism, tissue microenvironment, tissue chemical properties or physical properties. In this paper, we focus on measurement reliability assessment in quantitative imaging. We review essential concepts related to measurement such as measurement variability and measurement error. We also discuss reliability study methods for intraobserver and interobserver variability, and the applicable statistical tests including: intraclass correlation coefficient, Pearson correlation coefficient, and Bland-Altman graphs and limits of agreement, standard error of measurement, and coefficient of variation.

INTRODUCTION

Quantitative imaging technologies are increasingly used for the measurement of normal biological processes, pathologic processes, patient risk stratification, treatment response measurement in clinical care, and drug development ( 1 – 3 ). The goal of quantitative imaging is objective, accurate, and precise measurement of quantifiable features obtained from in vivo imaging studies, termed quantitative imaging biomarkers (QIBs). The simplest QIBs comprise measurement of the size of organs, vessels, or lesions. More complex QIBs measure parameters including metabolism, for example, the standardized uptake value in positron emission tomography imaging; tissue microenvironment, for example, diffusion or perfusion; tissue chemical properties, for example, spectroscopy; or physical properties, for example, tissue stiffness ( 4 ). QIBs are continuous variables, of which there are two subtypes: (1) ratio variables, such as shear wave velocity measured in meters per second (m/s) by shear wave sonoelastography methods for liver fibrosis assessment, or (2) interval variables, such as computed tomography (CT) densitometry measured in Hounsfield units for estimating emphysema severity. Ordinal variables are not QIBs. For example, the widely used Prostate Imaging Reporting and Data System (PI-RADS) classification system used for prostate magnetic resonance imaging assessment has five numbered categories: PI-RADS 1 (very low probability) to PI-RADS 5 (very high probability) of prostate cancer ( 5 ). Although these categories are numbered, the numbers denote order, rather than quantity, and therefore PI-RADS is not a QIB.

To be clinically useful, QIBs must be reliably comparable to one another and to known reference measurements ( 6 ). The goal of this paper is to facilitate better understanding of QIB reliability measurement by imaging researchers new to the field, and to assist researchers to incorporate reliability study design principles into their own quantitative imaging studies.

In this review, we define relevant metrologic terminology and concepts including measurement, reliability, reproducibility, and agreement. We discuss common reliability studies, including intraobserver, interobserver, and method comparison studies. We introduce guidelines for reporting ( 7 ), reviewing ( 8 ), and critical appraisal ( 9 ) of reliability studies, and we review statistical measures of reliability for continuous variables, including intraclass correlation coefficient (ICC), Pearson correlation coefficient, and measures of agreement including Bland-Altman graph and limits of agreement, standard error of measurement (SEM), and coefficient of variation (CV).

DEFINITIONS AND STATISTICAL CONCEPTS

Measurements are central to biomedical research and clinical practice and are used to evaluate current disease status and change over time. In population studies, measurements permit useful comparison of health outcomes within or between patients. The Quantitative Imaging Biomarkers Alliance (QIBA) is an initiative by the Radiological Society of North America to promote the use of QIBs in clinical research and practice. QIBA working groups defined QIB concepts based on the Joint Committee for Guides in Metrology in 2012–2013 ( 3 ). For QIBs, measurement is the process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity. Quantity is a property of a variable, where the property has a magnitude and can be expressed as a number, which is called quantity value. The measurand is the quantity intended to be measured ( 4 ).

Mu (μ) or mean, is a measure of central tendency of a data set, and is computed as the sum of the data set values divided by the number of values. Sigma (σ) or standard deviation (SD) is a measure used to quantify the dispersion of a set of data values, and is computed as the square root of the average of the squared differences from the mean (μ) divided by number of values minus 1. A low SD implies that the data values tend to be close to the mean, whereas a high SD implies the data points are spread out over a wider value range.

MEASUREMENT UNCERTAINTIES AND MEASUREMENT VARIABILITY IN QIBS

Uncertainty is defined as the dispersion of the quantity values attributed to a measurand ( 4 ). Uncertainty in measurement (measurement error) generally has two components: ( 1 ) systematic error (bias) , which is the degree of closeness of a quantity value to the true value and describes the difference between the average of measured values and average of true values in the same patients, and (2) random error , which is the degree of measurement variability in the same patients.

Measurement variability can arise at several levels: (1) biological variability occurs within patients based on diurnal, temporal, or situational changes, for example, postprandial state in liver stiffness, and also variation between patients ; (2) technological variability, for example, different imaging acquisition algorithms within a single imaging modality, image acquisition protocol, or use of different imaging modalities in a method comparison study, for example, ultrasound vs CT in measurement of lesion size; and (3) observer variability, for example, between radiologists with different experience levels. Measurement variability sources act together and simultaneously during the measurement process. Measurement variability is a distinct concept from variance, which is a term that describes the true variability of values within a sample.

Error mitigation should include mitigation of measurement variability and measurement bias. Strategies to minimize measurement variability include (1) restriction – avoiding a specific variation source such as fasting, or acquiring images in a specific position; (2) standardization – defining nonvarying protocols for imaging acquisition or processing; (3) observer training ; and (4) averaging of repeated measurements , which is particularly useful when the measurement has a wide range of random error. Measurement bias can be estimated only if the true value is known and is not mitigated by strategies that minimize random measurement error ( 4 , 10 ). Measurement bias mitigation requires improved measurement system calibration.

Reliability, agreement, precision, repeatability, and reproducibility, are terms often used interchangeably in general conversation, but are distinct concepts that are evaluated differently ( 4 , 11 ).

Reliability is defined as how well patients can be distinguished from each other despite observer measurement error ( 4 ).

Agreement is the degree of closeness between measurements made on the same patients by one or more observers or two methods of measurement, for example, the closeness of common bile duct diameter measurements in the same patients made by two radiologists using the same ultrasound machine.

Precision deals with variability; it is the closeness of agreement between measured quantity values obtained by replicate measurements on the same patients under specified and stable conditions, for example, common bile duct diameter in the same patients is repeatedly measured by a single radiologist using a single ultrasound machine in the same position and on a single occasion.

Repeatabilit y refers to variability measured under a set of conditions that includes the same measurement procedure, the same observer, the same measuring system, the same operating conditions, and replicate measurements on the same patients over a short period of time ( 4 , 10 , 12 ). The coefficient of repeatability is twice the SD of the measurement differences. Typically, the smallest detectable “real” change over time should be more than the coefficient of repeatability to be considered a “real” change in a pathophysiological process ( 12 ).

Reproducibility refers to the variation in measurements made on the same patient under “real world” conditions, which are circumstances analogous to clinical practice, where a variety of external factors cannot all be tightly controlled ( 13 ).

RELIABILITY STUDY DESIGN

In imaging science, we generally study two types of reliability: (1) intraobserver reliability , and (2) interobserver reliability . Intraobserver reliability studies evaluate the same observer using the same measurement instrument from the same set of images on the same patients on different occasions. For intraobserver reliability study, recall bias related to memory of prior measurements is minimized by requiring a reasonable time interval between measurements (usually several weeks) and image set de-identification and randomization. Interobserver reliability studies assess different observers (two or more) using the same measurement instrument to measure QIB on the same image(s) from the same patients. If interobserver reliability is high, it is likely that intraobserver reliability will be at least as high, and typically higher ( 14 ).

Reliability studies can be confounded by practice effect , which is improvement of measurement skill by observers as they progress from novice to expert. This is colloquially termed the “learning curve.” The reliability of a new instrument should only be assessed when observers are no longer improving from measurement to measurement.

Several groups have developed guidelines to standardize concepts related to QIBs and to set reliability study quality criteria. Kessler et al. provided terminology consistent with QIB definitions ( 4 ), and via the Radiological Society of North America’s QIBA, Sullivan et al. introduced metrology standards applicable to QIB ( 3 ), and Raunig et al. reviewed statistical methods for technical performance assessment in detail ( 13 ). Kottner et al. developed the Guidelines for Reporting Reliability and Agreement Studies to assist investigators in designing and reporting reliability studies. These guidelines are widely used and are applicable in the radiological sciences ( 15 ). Lucas et al. developed the Quality Appraisal of Reliability Studies checklist as a methodological tool for systematic review of diagnostic reliability studies including diagnostic imaging. This checklist includes 11 items such as spectrum of observers and patients, the observers’ and patients’ sample representativeness, blinding of observers (to other observers, to their own prior measurements, and to the clinical findings), order effects of measurements, appropriateness of time interval between repeated measurements, assessing appropriateness of statistical analysis, and correctness of measurement application and interpretation ( 8 ). These papers contain additional detail that would be useful to investigators contemplating reliability study design.

STATISTICAL MEASURES OF RELIABILITY FOR CONTINUOUS VARIABLES

Statistical methods for analyzing reliability were first suggested in the late 1930s, and the ICC was introduced in the 1950s ( 16 ). The most commonly used reliability measures for continuous variables are the ICC and Pearson correlation coefficient. Reliability of a diagnostic test refers to how well patients can be distinguished from each other despite observer measurement error ( 10 ). In addition to measurement error, study sample heterogeneity affects reliability; the more heterogeneous the sample, the higher the ICC ( 12 , 17 ).

The observed variability or total variability of a continuous variable in a sample is due to true variance of the continuous variable (σ 2 t ) and variance secondary to measurement error (σ 2 e ). Mathematically, test reliability is defined as the true variance (σ 2 t ), divided by true variability (σ 2 t ) plus measurement error variability (σ 2 e ).

Measurement error variability can be further divided into systematic error variability (σ 2 se ) and random error variability (σ 2 re ).

Intraclass correlation coefficient (ICC)

ICC is the best-known reliability parameter for repeated measurements of continuous variables. It can estimate intra- and interobserver measurement reliability for continuous variables. The ICC formula is defined:

where (σ 2 p ) is the true QIB variance, σ r 2 is the variance between observers, and σ residual 2 is the residual variance comprising interaction between observers and patients, in addition to random error.

Acceptable ICC

ICC is reported as a number between 0 and +1 with a 95% confidence interval (CI). When used to discriminate between individuals in a group, an ICC of 1 represents perfect reliability, in which discrimination is unaffected by measurement error. An ICC of 0 represents lack of reliability, in which measurement error precludes distinction of individual patients. If measurement error is small relative to variability between patients, the ICC approaches 1. The acceptability of a given ICC is not a statistical issue but rather a clinical decision based on the consequences of test measurement error. In the literature, an ICC greater than 0.70 is often considered acceptable if a measurement instrument is used in groups of patients . However, for applications in individual patients , an ICC greater than 0.90 or 0.95 is typically required ( 15 , 18 ).

Shear wave elastography (SWE) is a modern imaging technique that estimates tissue stiffness numerically as the Young modulus. The estimated Young modulus is a quantitative imaging biomarker measured in kilopascals. Table 1 shows thyroid nodule stiffness data for two observers (observer 1 and observer 2) in 35 patients with benign or malignant lesions. The ICC for these two observers, computed with IBM SPSS for Mac (version 21.0), was 0.87 (95% CI, 0.71–0.92) and 0.86 (95% CI, 0.74–0.92), respectively, and interobserver ICC was 0.87 (95% CI, 0.75–0.93) ( 19 ) ( Fig 1 ).

An external file that holds a picture, illustration, etc.
Object name is nihms927315f1.jpg

Intraclass correlation coefficients of elasticity values in 35 patients with thyroid nodules in transverse plane for intra-rater 1 ( a ), intra-rater 2 ( b ), and inter-rater ( c ) (data of Table 1 ).

Quantitative measurements of tissue stiffness values (kPa) in 35 patients with benign and malignant thyroid nodules

Pearson correlation coefficient

Pearson correlation coefficient (ρ) was used historically as a reliability measure, but has limitations. It is more appropriate for inferring linear association between two independent continuous variables (eg, effect of age on height in children), and is not suited to analysis of agreement of repeated measurements, which are, by definition, not independent. Statistically significant correlation does not automatically imply good reliability, because it does not account for systematic error (bias), which can result in reliability overestimation ( 20 ). In addition, the correlation coefficient depends on the range of the true quantity values or the sample heterogeneity: if the range is wide, the correlation will be greater than if the range is narrow. Studies where investigators compare two QIBs over a broad range of values may show a high correlation, but the two QIBs can be in poor agreement ( 4 ).

STATISTICAL MEASURES OF AGREEMENT FOR CONTINUOUS VARIABLES

Agreement refers to how well observer(s) produce similar values on repeated measurements in the same subjects. This parameter, therefore, mostly estimates measurement error inherent to the measurement process. Agreement parameters are particularly applicable when estimating changes over time or between different measurement systems ( 9 , 10 , 21 ). Statistical measures of agreement include Bland-Altman graphs (which determine bias and limits of agreement), SEM, and CV.

Agreement measures typically have the same unit as the property being measured, which is advantageous in understanding their clinical meaning. By contrast, reliability parameters are dimensionless, with value between 0 and 1.

Bland-Altman Graph and Analysis

Bland and Altman introduced their graphical method comparison system in 1986. Method comparison studies assess measurement reliability between two methods (eg, ultrasound elastography vs magnetic resonance elastography) that measure the same variable (eg, liver stiffness) on the same patients. Method comparison studies are usually performed prior to adoption of a new measurement method in clinical practice to ensure the new measurement method is “substantially equivalent” to the current method ( 12 , 22 ). Bland-Altman graphs and analysis may also be used to evaluate agreement between two observers (interobserver agreement) ( 23 ). To perform Bland-Altman analysis, two conditions should be satisfied: (1) data should be continuous with the same unit of measurement used by observers or methods and (2) no more than two observers or methods are compared. For each pair of two continuous measurements (x 1 and x 2 ), a point is placed on the Bland-Altman graph, with y-axis coordinate of the point the difference between x 1 and x 2 , and x-axis coordinate the mean of x 1 and x 2 . The Bland-Altman graph provides a useful visualization of measurement spread as a function of measurement value, and also shows bias and limits of agreement ( Fig 2 ). Bias is calculated as the mean of paired measurement differences of all patients by two observers or methods and represents systematic measurement error, which is the tendency for one of the two observers or methods to underestimate or overestimate the measurement relative to the other observer. This is represented on the Bland-Altman graph as a continuous line, parallel to the x-axis; the closer it is to 0, the less bias. The limits of agreement give a range wherein 95% of the measurement differences between the two observers are captured ( Eq. 4 ). This can be interpreted as a measure of the expected random error between two observers. Limits of agreement are depicted as dotted lines on Bland-Altman graphs. Narrower width between dotted lines implies greater agreement between measurements. The acceptability of specific limits of agreement is determined by clinical goals, not statistical factors ( 24 ):

where x 1 is observer 1, x 2 is the other observer, n is the number of patients, and σ diff is the SD of the difference of measurements between observers. For Example 2 , using the Example 1 data set, we have drawn a Bland-Altman graph using MedCalc for Windows, version 12.4.0.0. This illustrates agreement between two observers and depicts bias of 0.5 and limits of agreement of (−9.9 to 11.0) ( 19 ) ( Fig 2 ).

An external file that holds a picture, illustration, etc.
Object name is nihms927315f2.jpg

Bland-Altman graph stiffness values in 35 patients with thyroid nodules in transverse plane between rater 1 and rater 2. Bias is shown as continuous dark blue line, limits of agreement are shown as the dotted red lines on the graph, and each hollow circle represents the subject that is measured by raters (data of Table 1 ). (Color version of figure is available online.)

Standard Error of Measurement

As there is no perfect measurement instrument, the true value of a measurement is almost always unknown. SEM is an agreement measure, which estimates how repeated measurements on the same patients with the same measurement instrument by different observers tend to be distributed around the patients’ “true” value. The relationship between SEM (a measure of agreement) and ICC (a measure of reliability) is shown by Equation 5 , where σ is the SD of repeated measurements; the larger the SEM, the lower the test reliability ( 10 ).

Coefficient of Variation

CV is a measure of the agreement of a measurement method when numerous repeated measurements are performed on the same patients by multiple different observers ( Eq. 6 ), where σ is the SD of repeated measurements and μ is the mean of repeated measurement.

CV is generally used for calibration of a measurement instrument or for determining the reference range of a continuous variable measurement and expressed as a percentage (%) ( 24 ). The higher the CV, the greater the measurement dispersion. The advantage of CV is that the SD of measurements generally increases or decreases proportionally as the mean of the measurement increases or decreases, so division by the mean renders CV dimensionless ( 25 ). CV can be used with continuous variables that have a real zero and only positive values ( 4 , 24 ).

A measurement must be both reliable and valid to be clinically applicable. Reliability and agreement are different but related concepts analyzed by different statistical tests. The most commonly used parameter for the assessment of QIB intraand interobserver reliability is the ICC. Agreement parameters for continuous data between two observers include limits of agreement defined through Bland-Altman analysis. In some situations, SEM and CV are also appropriate measures of agreement.

SUMMARY STATEMENTS

  • Measurement is never perfect and variability is intrinsic to all measurement processes.
  • In imaging science, measurement variability may arise from use of differing imaging modalities (eg, ultrasound vs CT), different imaging acquisition algorithms within a single modality, protocol, or equipment variation, differences in observer expertise, and biologic variation.
  • Reliability and agreement should NOT be used interchangeably; reliability refers to how well a measurement instrument can distinguish patients in a group from each other, and agreement is how well a measure produces the exact same value on repeated measurements in the same patient by one or more observers.
  • ICCs are the best-known reliability parameters for repeated measurements of continuous values.
  • Bland-Altman graph and analysis is the most common statistical test for expressing agreement for continuous values between two observers (interobserver agreement) or two measurement methods.

Acknowledgments

Funded by: NIH – National Institutes of Health (grant numbers: K23 EB020710; HHSN268201300071 C).

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments. 

reliability of a research result implies its

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability  refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

  • Validity  refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

  • "Is my experimental method and design suitable?"
  • "Is my experiment testing or investigating what it's suppose to?"

reliability of a research result implies its

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

  • Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
  • Does this effect you have isolated actually address the aim and/or hypothesis?
  • Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

  • Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
  • Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
  • Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following: 

  • Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
  • Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

  • Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness. 

Below are some examples of this.

Example – Validity in Chemistry Experiment 

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

Heating a can of soft drink

The following steps are followed:

  • Weigh an unopened can of soft drink on an electronic balance.
  • Open the can.
  • Place the can on a hot plate until it begins to boil.
  • When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

  • type of soft drink used
  • temperature at which this experiment is conducted
  • period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

  • `T` is the period of oscillation
  • `l` is the length of string attached to the mass
  • `g` is the acceleration due to gravity

Pendulum practical

  • Cut a piece of a string or dental floss so that it is 1.0 m long.
  • Attach a 500.0 g mass of high density to the end of the string.
  • Attach the other end of the string to the retort stand using a clamp.
  • Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
  • Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

  • mass used in the pendulum
  • location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

reliability of a research result implies its

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`. 

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

  • the validity of the experiment
  • the degree of error:
  • systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
  • random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
  • sensitivity of equipment used.

Assessing Accuracy 

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy. 

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below. 

reliability of a research result implies its

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate. 

How to Improve Accuracy

  • Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
  • Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
  • Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
  • Obtain More Measurements and Over a Wider Range:  In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the  same steps , use the  same value  for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

reliable results example

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

reliability of a research result implies its

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable. 

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure  is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable. 

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.  

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

  • choosing a selection results in a full page refresh
  • press the space key then arrow keys to make a selection
  • God (or some type of Gods) did it.
  • Nature works with "an unseen hand".
  • There are "rational laws" to be discovered (and people are capable of discovering these).
  • Causal relations are an illusion; the universe is random and chaotic, and runs on entropy.
  • Controlled experiments in which purported causal factors are manipulated systematically.
  • Citing recognized authorities, such as Biblical or Quran scripture-or Sigmund Freud.
  • Marshalling one's reasonable arguments as in a court of law or journalism.
  • Precedent as in a court of law.
  • Intuition...feelings...one just "knows" (in love?).
  • Reading traces in the environment (Sherlock Holmes stories).
  • Devine revelation in dreams, visions, bones, tea leaves, etc.
  • Statistically controlling various purported causal variables.

Illustration

  • Basics of Research Process
  • Methodology

Reliability in Research: Definitions and Types

  • Speech Topics
  • Basics of Essay Writing
  • Essay Topics
  • Other Essays
  • Main Academic Essays
  • Research Paper Topics
  • Basics of Research Paper Writing
  • Miscellaneous
  • Chicago/ Turabian
  • Data & Statistics
  • Admission Writing Tips
  • Admission Advice
  • Other Guides
  • Student Life
  • Studying Tips
  • Understanding Plagiarism
  • Academic Writing Tips
  • Basics of Dissertation & Thesis Writing

Illustration

  • Essay Guides
  • Research Paper Guides
  • Formatting Guides
  • Admission Guides
  • Dissertation & Thesis Guides

reliability

Table of contents

Illustration

Use our free Readability checker

Reliability in research refers to the consistency of a measure. It demonstrates whehter the same results would be obtained if the study was repeated. If a test or tool is reliable, it gives consistent results across different situations or over time. A study with high reliability can be trusted because its outcomes are dependable and can be reproduced. Unreliable research can lead to misleading or incorrect conclusions. That's why you should ensure that your study results can be trusted.

When you’ve collected your data and need to measure your research results, it’s time to consider the reliability level of your methods and tools. It often happens that calculation methods produce errors. Particularly, in case you make wrong initial assumptions. In order to avoid getting wrong conclusions it is better to invest some time into checking whether they are reliable. Today we’ll talk about the reliability of research approaches, what it means and how to check it properly. Main verification methods such as split-half, inter-item and inter-rater will be examined and explained below. Let’s go and find out how to use them with our PhD dissertation writing services !

What Is Reliability in Research: Definition

First, let’s define reliability . It is highly important to ensure your data analysis methods are reliable, meaning that they are likely to produce stable and consistent results whenever you use them against different datasets. So, a special parameter named ‘reliability’ has been introduced in order to evaluate their consistency. High reliability means that a method or a tool you are evaluating will repeatedly produce the same or similar results when the conditions remain stable. This parameter has the following key components:

  • probability
  • availability
  • dependability.

Follow our thesis writing services to find out what are the main types of this parameter and how they can be used.

Main Types of Reliability

There are four main types of reliability. Each of them shows consistency of a different approach to data collection and analysis. These types are related to different ways of conducting research, however all of them are equally considered as quality measurements for the tools and methods they describe. We’ll examine each of these 4 types below, discussing their differences, purposes and areas of usage. Let’s take a closer look!

Test Retest Reliability: Definition

The first type is called ‘test-retest’ reliability. You can use it in case you need to analyze methods which are to be applied to the same group of individuals many times. When running the same tests across the same object over and over again, it is important to know whether they produce reliable results. If the latter don’t change significantly over a period of time, we can assume that this parameter shows a high consistency level. Therefore, these methods must be helpful for your research.

Test Retest Reliability: Examples

Let’s review an example of test-retest reliability which might provide more clarity about this parameter for a student preparing their own research. Suppose, a group of a local mall’s consumers has been monitored by a research team for several years. Shopping habits and preferences of each person of the group were examined, particularly by conducting surveys . If their responses did not change significantly over those years, it means that the current research approach can be considered reliable from the test-retest aspect. Otherwise, some of the methods used to collect this data need to be reviewed and updated to avoid introducing errors in the research.

Parallel Forms Reliability: Definition

Another type is parallel forms reliability. It is applied to a research approach when different versions of an assessment tool are used to examine the same group of respondents. In case the results obtained with the help of all these versions correlate with each other, the approach can be considered reliable. However, an analyst needs to ensure that all the versions contain the same elements before assessing their consistency. For example, if two versions examine different qualities of the target group, it wouldn’t make much sense to compare one version to another.

Parallel Forms Reliability: Examples

A parallel forms reliability example using a real-life situation would help illustrate the definition provided above. Let’s take the previous example where a focus group of consumers is examined to analyze dependencies and trends of a local mall’s goods consumption. Let’s suppose the data about their shopping preferences is obtained by conducting a survey among them, one or several times. At the next stage the same data is collected by analyzing the mall’s sales information. In both cases an assessment tool refers to the same characteristics (e.g., preferred shopping hours). If the results are correlated in both cases, it means that the approach is consistent.

Inter Rater Reliability: Definition

The next type is called inter-rater reliability. This measure does not involve different tools but requires a collective effort of several researchers, or raters, to examine the target population independently from each other. Once they are done with that, their assessment results need to be compared across each other. Strong correlation between all these results would mean that the methods used in this case are consistent. In case some of the observers don’t agree with others, the assessment approach to this problem needs to be reviewed and most probably corrected.

Inter Rater Reliability: Examples

Let’s review an inter rater reliability example – another case to help you visualize this parameter and the ways to use it in your own research. We’ll suppose that the consumer focus group from the previous example is independently tested by three researchers who use the same set of testing types:

  • conducting surveys.
  • interviewing respondents about their preferred items (e.g. bakery or housing supplies) or preferred shopping hours.
  • analyzing sales statistics collected by the mall.

In case each of these researchers obtains the same or very similar results at the end leading to similar conclusions, we can assume that the research approach used in this project is consistent.

What Is Internal Consistency Reliability: Definition

The final type is called internal consistency reliability. This measure can be used to evaluate the degree to which different tools or parts of a test produce similar results after probing the same area or object. The purpose is to try calculating or analyzing some value using different ways. In case the same results are obtained in each case, we can assume that the measurement method itself is consistent. Depending on how precise the calculations are, small deviations between these results may or may not be allowed.

Internal Consistency Reliability: Examples

In the end of this review of reliability types let’s check out an internal consistency reliability example.  Let’s take the same situation as described in previous examples: a focus consumer group whose shopping preferences are analyzed with the help of several different methods. In order to test the consistency of these methods, a researcher can randomly split the focus group in half and analyze each half independently. If done properly, random splitting must provide two subgroups with nearly identical qualities, so they can be viewed as the same construct. If analytic measures provide strongly correlated results for both these groups, the research approach is consistent.

Reliability Coefficient: What Is It

In order to evaluate how well a test measures a selected object, a special parameter named reliability coefficient has been introduced. Its definition is fully explained by its name: it shows whether a test is repeatable or reliable. The coefficient is a number lying within the range between 0 and 1.00, where 0 indicates no reliability and 1.00 indicates perfect reliability. The following proportion is used to calculate this coefficient, R:

R = (N/(N-1)) * ((Total Variance - Sum of Variance)/Total Variance),

where N is the number of times the test has been run. A real test could hardly have a perfect reliability. Typically, having the coefficient of 0.8 or higher means the test can be considered reliable enough.

Reliability: The Same as Quality

It is important to understand the difference between quality vs reliability. These concepts are somewhat related however they have different practical meanings. We use quality to indicate that an object or a solution performs its proper functions well and allows its users to achieve the intended purpose. Reliability indicates how well this object or solution is able to maintain its quality level as time passes or conditions change. It can be stated that reliability is one of the subsets of quality which is used to evaluate the consistency of a certain object or solution in a dynamic environment. Because of its nature, reliability is a probabilistic value. We also have a reliability vs validity blog. It is so crucial to understanding their difference for your research.

Reliability: Key Takeaways

In this article we have reviewed the concept of reliability in research. Its main types and their usage in real life research cases have been examined to a certain degree. Ways of measuring this value, particularly its coefficient, have also been explained.

Illustration

In case you are having troubles with using this concept in your own work or just need help with writing a high quality paper and earning a high score – feel free to check out our writing services! A team of skilled writers with rich experience in various academic areas is ready to help you upon ‘ write a paper for me ’ request.

Joe_Eckel_1_ab59a03630.jpg

Joe Eckel is an expert on Dissertations writing. He makes sure that each student gets precious insights on composing A-grade academic writing.

You may also like

correlation vs causation

Reliability: Frequently Asked Questions

1. how do you determine reliability in research.

One can determine reliability in research using a simple correlation between two scores from the same person. It is quite easy to make a rough estimation of a reliability coefficient for these two items using the formula provided above. In order to make a more precise estimation, you’ll need to obtain more scores and use them for calculation. The more test runs you make, the more precise your coefficient is.

2. Why is reliability important in research?

Reliability refers to the consistency of the results in research. This makes reliability important for nearly any kind of research: psychological, economical, industrial, social etc.. A project that may affect the lives of many people needs to be conducted carefully and its results need to be double checked. If the methods used have been unreliable, its results may contain errors and cause negative effects.

4. What is reliability of a test?

Reliability of a test refers to the extent to which this test can be run without errors. The higher the reliability, the more usable your tests are and the less the probability of errors in your research is. Tests might be constructed incorrectly because of wrong assumptions or incorrect information received from a source. Measuring reliability helps to counter that and to find the ways to improve the quality of tests.

3. How does reliability affect research?

Levels of reliability affect each project which uses complex analysis methods. It is important to know the degree to which your research method produces stable and consistent results. In case the consistency is low, your work might be useless because of incorrect assumptions. If you don’t want your project to fail, you have to assess the consistency of your methods.

Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .

Neag School of Education

Educational Research Basics by Del Siegle

Instrument reliability.

Reliability (visit the concept map that shows the various types of reliability)

A test is reliable to the extent that whatever it measures, it measures it consistently. If I were to stand on a scale and the scale read 15 pounds, I might wonder. Suppose I were to step off the scale and stand on it again, and again it read 15 pounds. The scale is producing consistent results. From a research point of view, the scale seems to be reliable because whatever it is measuring, it is measuring it consistently. Whether those consistent results are valid is another question.  However, an instrument cannot be valid if it is not reliable.

There are three major categories of reliability for most instruments: test-retest, equivalent form, and internal consistency. Each measures consistency a bit differently and a given instrument need not meet the requirements of each. Test-retest measures consistency from one time to the next. Equivalent-form measures consistency between two versions of an instrument. Internal-consistency measures consistency within the instrument (consistency among the questions). A fourth category (scorer agreement)  is often used with performance and product assessments. Scorer agreement is consistency of rating a performance or product among different judges who are rating the performance or product. Generally speaking, the longer a test is, the more reliable it tends to be (up to a point). For research purposes, a minimum reliability of .70 is required for attitude instruments. Some researchers feel that it should be higher. A reliability of .70 indicates 70% consistency in the scores that are produced by the instrument. Many tests, such as achievement tests, strive for .90 or higher reliabilities.

Relationship of Test Forms and Testing Sessions Required for Reliability Procedures

Test-Retest Method (stability: measures error because of changes over time) The same instrument is given twice to the same group of people. The reliability is the correlation between the scores on the two instruments.  If the results are consistent over time, the scores should be similar. The trick with test-retest reliability is determining how long to wait between the two administrations. One should wait long enough so the subjects don’t remember how they responded the first time they completed the instrument, but not so long that their knowledge of the material being measured has changed. This may be a couple weeks to a couple months.

If one were investigating the reliability of a test measuring mathematics skills, it would not be wise to wait two months. The subjects probably would have gained additional mathematics skills during the two months and thus would have scored differently the second time they completed the test. We would not want their knowledge to have changed between the first and second testing.

Equivalent-Form (Parallel or Alternate-Form) Method (measures error because of differences in test forms) Two different versions of the instrument are created. We assume both measure the same thing. The same subjects complete both instruments during the same time period. The scores on the two instruments are correlated to calculate the consistency between the two forms of the instrument.

Internal-Consistency Method (measures error because of idiosyncrasies of the test items) Several internal-consistency methods exist. They have one thing in common. The subjects complete one instrument one time. For this reason, this is the easiest form of reliability to investigate. This method measures consistency within the instrument three different ways.

– Split-Half A total score for the odd number questions is correlated with a total score for the even number questions (although it might be the first half with the second half). This is often used with dichotomous variables that are scored 0 for incorrect and 1 for correct.The Spearman-Brown prophecy formula is applied to the correlation to determine the reliability.

– Kuder-Richardson Formula 20 (K-R 20) and  Kuder-Richardson Formula 21 (K-R 21) These are alternative formulas for calculating how consistent subject responses are among the questions on an instrument. Items on the instrument must be dichotomously scored (0 for incorrect and 1 for correct). All items are compared with each other, rather than half of the items with the other half of the items. It can be shown mathematically that the Kuder-Richardson reliability coefficient is actually the mean of all split-half coefficients (provided the Rulon formula is used) resulting from different splittings of a test. K-R 21 assumes that all of the questions are equally difficult. K-R 20 does not assume that. The formula for K-R 21 can be found on page 179.

– Cronbach’s Alpha When the items on an instrument are not scored right versus wrong, Cronbach’s alpha is often used to measure the internal consistency. This is often the case with attitude instruments that use the Likert scale. A computer program such as SPSS is often used to calculate Cronbach’s alpha. Although Cronbach’s alpha is usually used for scores which fall along a continuum, it will produce the same results as KR-20 with dichotomous data (0 or 1).

I have created an Excel spreadsheet that will calculate Spearman-Brown, KR-20, KR-21, and Cronbach’s alpha. The spreadsheet will handle data for a maximum 1000 subjects with a maximum of 100 responses for each.

Scoring Agreement (measures error because of the scorer) Performance and product assessments are often based on scores by individuals who are trained to evaluate the performance or product. The consistency between rating can be calculated in a variety of ways.

– Interrater Reliability Two judges can evaluate a group of student products and the correlation between their ratings can be calculated (r=.90 is a common cutoff).

– Percentage Agreement Two judges can evaluate a group of products and a percentage for the number of times they agree is calculated (80% is a common cutoff).

———

All scores contain error. The error is what lowers an instrument’s reliability. Obtained Score = True Score + Error Score

———-

There could be a number of reasons why the reliability estimate for a measure is low. Four common sources of inconsistencies of test scores are listed below:

Test Taker — perhaps the subject is having a bad day Test Itself — the questions on the instrument may be unclear Testing Conditions — there may be distractions during the testing that detract the subject Test Scoring — scores may be applying different standards when evaluating the subjects’ responses

Del Siegle, Ph.D. Neag School of Education – University of Connecticut [email protected] www.delsiegle.info

Created 9/24/2002 Edited 10/17/2013

Research Methodology MCQ Questions Set-1

  • Research Methodology MCQ Questions Set-3

Research Methodology MCQ

Research Methodology MCQ Questions Set-1

Also you can read

  • Research Methodology MCQ Questions Set-1
  • Research Methodology MCQ Questions Set-2

1. A successful research requirements

(A) Planning

(B) Guidance

(D) All of the above

2. Which of the following is the research purpose?

(A) To study a phenomenon or to achieve a new insight in to it

(B)To determine the frequency with which something occurs or with which it is associated with

(C) To test a hypothesis of a causal relationship, between variables

3. Which is the Design of sampling?

(A) Probability selection

(B) Purposive Methods

(C) Mixed Sample

4. Survey research methods come under

(A) Pre-empirical research methods

(B) Descriptive research methods

(C) Experimental research methods

5. Ethical principle is available in which report

(A) Belmont Report

(B) Finance report

(C) Research Report

(D) None of the above

6. The logic of induction is very much related with

(A) The logic of sampling

(B) The logic of controlled variable

(C) The logic of observation

7. The aims of research

(A) are descriptive in nature

(B) are founded on human values

(C) cause-effect-relatedness

8. The aims of research is/are

(A) Verification

(B) Fact finding

(C) Theoretical development

9. Objective or unbiased observation is most vital in

(A) All walks of life

(B) Performing experiments

(C) Normal behaviour

(D) Research methods

10. The reporting of Research findings should be done

(A) by the scientists themselves

(B) in a scientific and effective way

(C) through internet

(D) through scientific journals

11. Reliability of a research result implies its

(A) Verifiability

(B) Validity

(C) Uniqueness

(D) Usefulness

12. Watson and Mcgrath defined research as

(A) An intellectual exercise

(B) Using exploratory methods

(C) Using scientific methods

 (D) None of the above

13. A research is

(A) A serious and investigative study

(B) Being illuminated

(C) Based on standarized conclusions

14. A person who is repeating the same mistakes again and again without trying to rectify it, is

(A) A foolish person

 (B) An excellent researcher

(C) An excellent forgetter

(D) An insane person

 15. In Hindi, the word “Anusandhan’

(A) Praying to achieve

(B) Attaining an aim

(C) Being goal-directed

(D) Following an aim

16. The word “Research” means

(A) To know

(C) To move

(D) To innovate

17. Social research can be divided into

(A) Two categories

(B) Three categories

(C) Four categories

(D) Five categories

18. Which of the following is/are categories of social research?

(A) Laboratory experiment

(B) Field experiment

(C) Survey research

19. Which of the following is/are types of field studies?

(A) Exploratory testing

(B) Hypothesis testing

(C) Both ‘A’ and ‘B’

20. Survey research studies

(B) Populations

(C) Circumstances

(D) Processes

21. Evaluation research is concerned with

(A) What are we doing?

(B) Why are we doing?

(C)  How well are we doing?

22. Action research is a type of

 (A) Applied research

 (B) Quality research

(C) Working research

(D) Survey research

23. Which of the following is the key factor in determining the success of group research?

(B) Organization

(C) Researcher

(D) Creativity

24. Which of the following have a direct bearing on research tools and techniques?

(A) Concepts

 (B) Knowledge

(C) Aspirations

25. The aim of group research is to achieve integration on

(A) Conceptual level

(B) Technical level

(C) Human level

(D) All of these

reliability of a research result implies its

Mr. Perfect

Jai Hind ... Namaskar !!! Apka Pyaar aur Respect, Tavi to hai Mr. Perfect... Lecturer / Writer/ Blogger / Dancer.  Rapper / Fitness Lover / Actor. Founder of Income TaxPe , Hindi PiLa , Shayari me Kahani ,  MCQ Questions , Car Insurance Ok Social Worker Co-founder of Kartabya Foundation (An animal, Social and Natural Welfare Organization.) and Founder of Mission Green Balangir ( Natural Welfare Organization ) .

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

reliability of a research result implies its

You are using an outdated browser. Please upgrade your browser or activate Google Chrome Frame to improve your experience.

  • Commercial & industrial PV
  • Grids & integration
  • Residential PV
  • Utility scale PV
  • Energy storage
  • Balance of systems
  • Modules & upstream manufacturing
  • Opinion & analysis
  • Opinion & analysis guidelines
  • Press Releases
  • Technology and R&D
  • Sustainability
  • 50 States of Solar
  • pv magazine UP initiative
  • pv magazine Hydrogen Hub
  • Magazine features
  • US module maker directory
  • pv magazine Roundtables & Insights
  • pv magazine Webinars
  • Event calendar
  • Past events
  • Special Edition Las Vegas 2023
  • OMCO Solar white paper
  • Print archive
  • pv magazine test
  • pv magazine team
  • Newsletter subscription
  • Magazine subscription
  • Community standards

NREL-led consortium releases PV reliability forecasting tools

The Durable Module Materials consortium (DuraMAT) announced in its latest annual report the availability of new PV forecasting tools, and new research results towards the goal of more reliable PV modules.

  • United States

reliability of a research result implies its

DuraMAT researchers conduct outdoor photoluminescence tests on PV modules, testing a high-throughput method to determine module health without needing to disconnect the PV system.

Image: NREL

Icon Facebook

From pv magazine Global

The Durable Module Materials ( DuraMAT ) consortium, established by the United States Department of Energy’s Solar Energy Technology Office (SETO), has released its  latest annual report  with news about the availability of new PV forecasting tools and new research about certain module degradation trends.

DuraMAT  reported the results of its focus on reliability forecasting in 2023, driven by the observation that the PV industry is “innovating so quickly that the performance of modules in the field is no longer always a reliable indicator of what will happen in the future.”

“We awarded six projects under our reliability forecasting call this year,” said Teresa Barnes, DuraMAT director and DOE National Renewable Energy Laboratory (NREL) researcher in a press release.

The reliability forecasting projects addressed ultraviolet-induced degradation, glass fracture mechanics, and degradation mechanisms in encapsulants, as well as how to do faster analysis of failure data. As a result, DuraMAT now has a suite of software tools and data sets, some of which rely on quantitative modeling and rapid validation technologies. The tools cover topics such as mechanical models for materials, wind loading, fracture mechanics, moisture diffusion, and irradiance, and are available in the  DuraMAT Data Hub .

“Drawing insights from all these areas should give us the capability to predict the long-term reliability of new module designs,” stated Barnes.

Two degradation mechanisms that received special attention from DuraMAT in 2023 are cell cracking and ultraviolet (UV) degradation. “Cracked cells are a challenge for the solar industry because they can reduce output but often go unnoticed,” said the team. Studies were carried out on quantifying and addressing cell cracking.

“Researchers found that some newer modules with many busbars, half-cut cells, and glass–glass encapsulation are more tolerant of cracked cells and less likely to show power loss,” it said. An outcome of the research is WhatsCracking, a free cell fracture prediction application to assist in making modules that are less sensitive to cell breakage. For example, designing modules that rotate half-cells at 90-degree angles to reduce the chance of cracking under load, as reported in  pv magazine . The WhatsCracking app is one of the tools in the DuraMAT Data Hub .

DuraMAT researchers also found that UV-induced degradation is a significant issue in certain high-efficiency products. “These results are important, as the increased degradation related to UV exposure in modern cell types may offset some of the gains predicted from bifacial and other high-efficiency cells,” said the team, adding that DuraMAT will be starting new work to quantify this type of degradation in 2024.

The DuraMAT consortium, which is led by the DOE’s  National Renewable Energy Laboratory  (NREL), with participation by  Sandia National Laboratories  and  Lawrence Berkeley National Laboratory , includes a 22-member board of solar industry professionals.

This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors@pv-magazine.com .

Valerie Thompson

More articles from Valerie Thompson

What are states doing to make virtual power plants a reality?

Nypa prequalifies 79 developers and investors for renewable projects, related content, elsewhere on pv magazine..., leave a reply cancel reply.

Please be mindful of our community standards .

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Notify me of follow-up comments by email.

Notify me of new posts by email.

By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment.

Your personal data will only be disclosed or otherwise transmitted to third parties for the purposes of spam filtering or if this is necessary for technical maintenance of the website. Any other transfer to third parties will not take place unless this is justified on the basis of applicable data protection regulations or if pv magazine is legally obliged to do so.

You may revoke this consent at any time with effect for the future, in which case your personal data will be deleted immediately. Otherwise, your data will be deleted if pv magazine has processed your request or the purpose of data storage is fulfilled.

Further information on data privacy can be found in our Data Protection Policy .

pv magazine USA offers daily updates of the latest photovoltaics news. We also offer comprehensive global coverage of the most important solar markets worldwide. Select one or more editions for targeted, up to date information delivered straight to your inbox.

  • Select Edition(s) * Hold Ctrl or Cmd to select multiple editions. Tap to select multiple editions. U.S. (English, daily) Global (English, daily) Germany (German, daily) Australia (English, daily) China (Chinese, weekly) India (English, daily) Latin America (Spanish, daily) Brazil (Portuguese, daily) Mexico (Spanish, daily) Spain (Spanish, daily) France (French, daily) Italy (Italian, daily)
  • Read our Data Protection Policy .

Subscribe to our global magazine

reliability of a research result implies its

Most popular

reliability of a research result implies its

Keep up to date

IMAGES

  1. Types of reliability in research

    reliability of a research result implies its

  2. What does Reliability and Validity mean in Research

    reliability of a research result implies its

  3. How to improve reliability of results

    reliability of a research result implies its

  4. Types of Reliability in Research

    reliability of a research result implies its

  5. Reliability vs. Validity in Research

    reliability of a research result implies its

  6. Understanding reliability and validity in qualitative research

    reliability of a research result implies its

VIDEO

  1. PNM Dissociates From Racist Post

  2. Reliability and Validity in Research || Validity and Reliability in Research in Urdu and Hindi

  3. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

  4. Validity vs Reliability || Research ||

  5. Proving an Inequality using Mathematical Induction (Bartle Real Analysis Exercise)

  6. Implementation of Research Result of STEM Learning in Higher Education

COMMENTS

  1. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  2. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  3. Reliability

    Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis. Reliability In Engineering.

  4. Guide: Understanding Reliability and Validity

    Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

  5. Reliability vs Validity in Research: Types & Examples

    However, in research and testing, reliability and validity are not the same things. When it comes to data analysis, reliability refers to how easily replicable an outcome is. For example, if you measure a cup of rice three times, and you get the same result each time, that result is reliable. The validity, on the other hand, refers to the ...

  6. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  7. Validity & Reliability In Research

    As with validity, reliability is an attribute of a measurement instrument - for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the "thing" it's supposed to be measuring, reliability is concerned with consistency and stability.

  8. Reliability and Validity of Measurement

    Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.

  9. Reliability and Validity

    In statistics, reliability refers to the consistency of a research study or measure. A measure having a high reliability produces similar results under consistent conditions. The term reliability as used with respect to coding refers to rater consistency and not, as commonly, to data consistency. A situation of data consistency, for example, is ...

  10. 5.13: The Reliability and Validity of Research

    Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.

  11. Research Reliability

    Research Reliability. Reliability refers to whether or not you get the same answer by using an instrument to measure something more than once. In simple terms, research reliability is the degree to which research method produces stable and consistent results. A specific measure is considered to be reliable if its application on the same object ...

  12. Reliability vs Validity: Differences & Examples

    Validity is more difficult to evaluate than reliability. After all, with reliability, you only assess whether the measures are consistent across time, within the instrument, and between observers. On the other hand, evaluating validity involves determining whether the instrument measures the correct characteristic.

  13. Reliability and Validity in Research: Definitions, Examples

    Reliability is a measure of the stability or consistency of test scores. You can also think of it as the ability for a test or research findings to be repeatable. For example, a medical thermometer is a reliable tool that would measure the correct temperature each time it is used. In the same way, a reliable math test will accurately measure ...

  14. Definition of Reliability in Research

    Note, however, that the thermometer does not have to be accurate in order to be reliable. It might always register three degrees too high, for example. Its degree of reliability has to do instead with the predictability of its relationship with whatever is being tested.

  15. Determining the Reliability and Validity and Interpretation of a

    The validity and reliability of a test result depend on everything from whether the specimen was collected correctly to whether the results were recorded accurately. Once a test is selected and the validity and reliability determined in the hands of the research team, the continued validity and reliability is assured by quality control and ...

  16. Essentials of Statistical Methods for Assessing Reliability and

    Measurements are central to biomedical research and clinical practice and are used to evaluate current disease status and change over time. ... A low SD implies that the data values tend to be close to the mean, whereas a high SD implies the data points are spread out over a wider value range. ... (bias), which can result in reliability ...

  17. Validity, Accuracy and Reliability: A Comprehensive Guide

    Part 3 - Reliability. Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using ...

  18. (PDF) Validity and Reliability in Quantitative Research

    Abstract and Figures. The validity and reliability of the scales used in research are important factors that enable the research to yield healthy results. For this reason, it is useful to ...

  19. Guide 3: Reliability, Validity, Causality, and Experiments

    The results of the Hormone Replacement Therapy experiments, released in the summer of 2002, remind us of the great care that must be taken when designing nonexperimental research. Self selection of women into the original "hormone" non-experimental conditions implied that HRT prevented heart attacks and strokes among women.

  20. Reliability in Research: Definition, Types & Examples

    This makes reliability important for nearly any kind of research: psychological, economical, industrial, social etc.. A project that may affect the lives of many people needs to be conducted carefully and its results need to be double checked. If the methods used have been unreliable, its results may contain errors and cause negative effects.

  21. Instrument Reliability

    For research purposes, a minimum reliability of .70 is required for attitude instruments. Some researchers feel that it should be higher. A reliability of .70 indicates 70% consistency in the scores that are produced by the instrument. Many tests, such as achievement tests, strive for .90 or higher reliabilities.

  22. Research Methodology MCQ Questions Set-3

    Reliability of a research result implies its (A) Verifiability (B) Validity (C) Uniqueness (D) Usefulness. Answer (B) Validity. 12. Watson and Mcgrath defined research as (A) An intellectual exercise (B) Using exploratory methods (C) Using scientific methods (D) None of the above.

  23. NREL-led consortium releases PV reliability forecasting tools

    From pv magazine Global. The Durable Module Materials consortium, established by the United States Department of Energy's Solar Energy Technology Office (SETO), has released its latest annual report with news about the availability of new PV forecasting tools and new research about certain module degradation trends.DuraMAT reported the results of its focus on reliability forecasting in 2023 ...