Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

validity and reliability research example

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

validity and reliability research example

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Narrative analysis explainer

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly
  • How it works

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threats of external validity, how to assess reliability and validity.

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Types of validity.

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

Experimental research refers to the experiments conducted in the laboratory or under observation in controlled conditions. Here is all you need to know about experimental research.

Disadvantages of primary research – It can be expensive, time-consuming and take a long time to complete if it involves face-to-face contact with customers.

You can transcribe an interview by converting a conversation into a written format including question-answer recording sessions between two or more people.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Validity – Types, Examples and Guide

Validity – Types, Examples and Guide

Table of Contents

Validity

Definition:

Validity refers to the extent to which a concept, measure, or study accurately represents the intended meaning or reality it is intended to capture. It is a fundamental concept in research and assessment that assesses the soundness and appropriateness of the conclusions, inferences, or interpretations made based on the data or evidence collected.

Research Validity

Research validity refers to the degree to which a study accurately measures or reflects what it claims to measure. In other words, research validity concerns whether the conclusions drawn from a study are based on accurate, reliable and relevant data.

Validity is a concept used in logic and research methodology to assess the strength of an argument or the quality of a research study. It refers to the extent to which a conclusion or result is supported by evidence and reasoning.

How to Ensure Validity in Research

Ensuring validity in research involves several steps and considerations throughout the research process. Here are some key strategies to help maintain research validity:

Clearly Define Research Objectives and Questions

Start by clearly defining your research objectives and formulating specific research questions. This helps focus your study and ensures that you are addressing relevant and meaningful research topics.

Use appropriate research design

Select a research design that aligns with your research objectives and questions. Different types of studies, such as experimental, observational, qualitative, or quantitative, have specific strengths and limitations. Choose the design that best suits your research goals.

Use reliable and valid measurement instruments

If you are measuring variables or constructs, ensure that the measurement instruments you use are reliable and valid. This involves using established and well-tested tools or developing your own instruments through rigorous validation processes.

Ensure a representative sample

When selecting participants or subjects for your study, aim for a sample that is representative of the population you want to generalize to. Consider factors such as age, gender, socioeconomic status, and other relevant demographics to ensure your findings can be generalized appropriately.

Address potential confounding factors

Identify potential confounding variables or biases that could impact your results. Implement strategies such as randomization, matching, or statistical control to minimize the influence of confounding factors and increase internal validity.

Minimize measurement and response biases

Be aware of measurement biases and response biases that can occur during data collection. Use standardized protocols, clear instructions, and trained data collectors to minimize these biases. Employ techniques like blinding or double-blinding in experimental studies to reduce bias.

Conduct appropriate statistical analyses

Ensure that the statistical analyses you employ are appropriate for your research design and data type. Select statistical tests that are relevant to your research questions and use robust analytical techniques to draw accurate conclusions from your data.

Consider external validity

While it may not always be possible to achieve high external validity, be mindful of the generalizability of your findings. Clearly describe your sample and study context to help readers understand the scope and limitations of your research.

Peer review and replication

Submit your research for peer review by experts in your field. Peer review helps identify potential flaws, biases, or methodological issues that can impact validity. Additionally, encourage replication studies by other researchers to validate your findings and enhance the overall reliability of the research.

Transparent reporting

Clearly and transparently report your research methods, procedures, data collection, and analysis techniques. Provide sufficient details for others to evaluate the validity of your study and replicate your work if needed.

Types of Validity

There are several types of validity that researchers consider when designing and evaluating studies. Here are some common types of validity:

Internal Validity

Internal validity relates to the degree to which a study accurately identifies causal relationships between variables. It addresses whether the observed effects can be attributed to the manipulated independent variable rather than confounding factors. Threats to internal validity include selection bias, history effects, maturation of participants, and instrumentation issues.

External Validity

External validity concerns the generalizability of research findings to the broader population or real-world settings. It assesses the extent to which the results can be applied to other individuals, contexts, or timeframes. Factors that can limit external validity include sample characteristics, research settings, and the specific conditions under which the study was conducted.

Construct Validity

Construct validity examines whether a study adequately measures the intended theoretical constructs or concepts. It focuses on the alignment between the operational definitions used in the study and the underlying theoretical constructs. Construct validity can be threatened by issues such as poor measurement tools, inadequate operational definitions, or a lack of clarity in the conceptual framework.

Content Validity

Content validity refers to the degree to which a measurement instrument or test adequately covers the entire range of the construct being measured. It assesses whether the items or questions included in the measurement tool represent the full scope of the construct. Content validity is often evaluated through expert judgment, reviewing the relevance and representativeness of the items.

Criterion Validity

Criterion validity determines the extent to which a measure or test is related to an external criterion or standard. It assesses whether the results obtained from a measurement instrument align with other established measures or outcomes. Criterion validity can be divided into two subtypes: concurrent validity, which examines the relationship between the measure and the criterion at the same time, and predictive validity, which investigates the measure’s ability to predict future outcomes.

Face Validity

Face validity refers to the degree to which a measurement or test appears, on the surface, to measure what it intends to measure. It is a subjective assessment based on whether the items seem relevant and appropriate to the construct being measured. Face validity is often used as an initial evaluation before conducting more rigorous validity assessments.

Importance of Validity

Validity is crucial in research for several reasons:

  • Accurate Measurement: Validity ensures that the measurements or observations in a study accurately represent the intended constructs or variables. Without validity, researchers cannot be confident that their results truly reflect the phenomena they are studying. Validity allows researchers to draw accurate conclusions and make meaningful inferences based on their findings.
  • Credibility and Trustworthiness: Validity enhances the credibility and trustworthiness of research. When a study demonstrates high validity, it indicates that the researchers have taken appropriate measures to ensure the accuracy and integrity of their work. This strengthens the confidence of other researchers, peers, and the wider scientific community in the study’s results and conclusions.
  • Generalizability: Validity helps determine the extent to which research findings can be generalized beyond the specific sample and context of the study. By addressing external validity, researchers can assess whether their results can be applied to other populations, settings, or situations. This information is valuable for making informed decisions, implementing interventions, or developing policies based on research findings.
  • Sound Decision-Making: Validity supports informed decision-making in various fields, such as medicine, psychology, education, and social sciences. When validity is established, policymakers, practitioners, and professionals can rely on research findings to guide their actions and interventions. Validity ensures that decisions are based on accurate and trustworthy information, which can lead to better outcomes and more effective practices.
  • Avoiding Errors and Bias: Validity helps researchers identify and mitigate potential errors and biases in their studies. By addressing internal validity, researchers can minimize confounding factors and alternative explanations, ensuring that the observed effects are genuinely attributable to the manipulated variables. Validity assessments also highlight measurement errors or shortcomings, enabling researchers to improve their measurement tools and procedures.
  • Progress of Scientific Knowledge: Validity is essential for the advancement of scientific knowledge. Valid research contributes to the accumulation of reliable and valid evidence, which forms the foundation for building theories, developing models, and refining existing knowledge. Validity allows researchers to build upon previous findings, replicate studies, and establish a cumulative body of knowledge in various disciplines. Without validity, the scientific community would struggle to make meaningful progress and establish a solid understanding of the phenomena under investigation.
  • Ethical Considerations: Validity is closely linked to ethical considerations in research. Conducting valid research ensures that participants’ time, effort, and data are not wasted on flawed or invalid studies. It upholds the principle of respect for participants’ autonomy and promotes responsible research practices. Validity is also important when making claims or drawing conclusions that may have real-world implications, as misleading or invalid findings can have adverse effects on individuals, organizations, or society as a whole.

Examples of Validity

Here are some examples of validity in different contexts:

  • Example 1: All men are mortal. John is a man. Therefore, John is mortal. This argument is logically valid because the conclusion follows logically from the premises.
  • Example 2: If it is raining, then the ground is wet. The ground is wet. Therefore, it is raining. This argument is not logically valid because there could be other reasons for the ground being wet, such as watering the plants.
  • Example 1: In a study examining the relationship between caffeine consumption and alertness, the researchers use established measures of both variables, ensuring that they are accurately capturing the concepts they intend to measure. This demonstrates construct validity.
  • Example 2: A researcher develops a new questionnaire to measure anxiety levels. They administer the questionnaire to a group of participants and find that it correlates highly with other established anxiety measures. This indicates good construct validity for the new questionnaire.
  • Example 1: A study on the effects of a particular teaching method is conducted in a controlled laboratory setting. The findings of the study may lack external validity because the conditions in the lab may not accurately reflect real-world classroom settings.
  • Example 2: A research study on the effects of a new medication includes participants from diverse backgrounds and age groups, increasing the external validity of the findings to a broader population.
  • Example 1: In an experiment, a researcher manipulates the independent variable (e.g., a new drug) and controls for other variables to ensure that any observed effects on the dependent variable (e.g., symptom reduction) are indeed due to the manipulation. This establishes internal validity.
  • Example 2: A researcher conducts a study examining the relationship between exercise and mood by administering questionnaires to participants. However, the study lacks internal validity because it does not control for other potential factors that could influence mood, such as diet or stress levels.
  • Example 1: A teacher develops a new test to assess students’ knowledge of a particular subject. The items on the test appear to be relevant to the topic at hand and align with what one would expect to find on such a test. This suggests face validity, as the test appears to measure what it intends to measure.
  • Example 2: A company develops a new customer satisfaction survey. The questions included in the survey seem to address key aspects of the customer experience and capture the relevant information. This indicates face validity, as the survey seems appropriate for assessing customer satisfaction.
  • Example 1: A team of experts reviews a comprehensive curriculum for a high school biology course. They evaluate the curriculum to ensure that it covers all the essential topics and concepts necessary for students to gain a thorough understanding of biology. This demonstrates content validity, as the curriculum is representative of the domain it intends to cover.
  • Example 2: A researcher develops a questionnaire to assess career satisfaction. The questions in the questionnaire encompass various dimensions of job satisfaction, such as salary, work-life balance, and career growth. This indicates content validity, as the questionnaire adequately represents the different aspects of career satisfaction.
  • Example 1: A company wants to evaluate the effectiveness of a new employee selection test. They administer the test to a group of job applicants and later assess the job performance of those who were hired. If there is a strong correlation between the test scores and subsequent job performance, it suggests criterion validity, indicating that the test is predictive of job success.
  • Example 2: A researcher wants to determine if a new medical diagnostic tool accurately identifies a specific disease. They compare the results of the diagnostic tool with the gold standard diagnostic method and find a high level of agreement. This demonstrates criterion validity, indicating that the new tool is valid in accurately diagnosing the disease.

Where to Write About Validity in A Thesis

In a thesis, discussions related to validity are typically included in the methodology and results sections. Here are some specific places where you can address validity within your thesis:

Research Design and Methodology

In the methodology section, provide a clear and detailed description of the measures, instruments, or data collection methods used in your study. Discuss the steps taken to establish or assess the validity of these measures. Explain the rationale behind the selection of specific validity types relevant to your study, such as content validity, criterion validity, or construct validity. Discuss any modifications or adaptations made to existing measures and their potential impact on validity.

Measurement Procedures

In the methodology section, elaborate on the procedures implemented to ensure the validity of measurements. Describe how potential biases or confounding factors were addressed, controlled, or accounted for to enhance internal validity. Provide details on how you ensured that the measurement process accurately captures the intended constructs or variables of interest.

Data Collection

In the methodology section, discuss the steps taken to collect data and ensure data validity. Explain any measures implemented to minimize errors or biases during data collection, such as training of data collectors, standardized protocols, or quality control procedures. Address any potential limitations or threats to validity related to the data collection process.

Data Analysis and Results

In the results section, present the analysis and findings related to validity. Report any statistical tests, correlations, or other measures used to assess validity. Provide interpretations and explanations of the results obtained. Discuss the implications of the validity findings for the overall reliability and credibility of your study.

Limitations and Future Directions

In the discussion or conclusion section, reflect on the limitations of your study, including limitations related to validity. Acknowledge any potential threats or weaknesses to validity that you encountered during your research. Discuss how these limitations may have influenced the interpretation of your findings and suggest avenues for future research that could address these validity concerns.

Applications of Validity

Validity is applicable in various areas and contexts where research and measurement play a role. Here are some common applications of validity:

Psychological and Behavioral Research

Validity is crucial in psychology and behavioral research to ensure that measurement instruments accurately capture constructs such as personality traits, intelligence, attitudes, emotions, or psychological disorders. Validity assessments help researchers determine if their measures are truly measuring the intended psychological constructs and if the results can be generalized to broader populations or real-world settings.

Educational Assessment

Validity is essential in educational assessment to determine if tests, exams, or assessments accurately measure students’ knowledge, skills, or abilities. It ensures that the assessment aligns with the educational objectives and provides reliable information about student performance. Validity assessments help identify if the assessment is valid for all students, regardless of their demographic characteristics, language proficiency, or cultural background.

Program Evaluation

Validity plays a crucial role in program evaluation, where researchers assess the effectiveness and impact of interventions, policies, or programs. By establishing validity, evaluators can determine if the observed outcomes are genuinely attributable to the program being evaluated rather than extraneous factors. Validity assessments also help ensure that the evaluation findings are applicable to different populations, contexts, or timeframes.

Medical and Health Research

Validity is essential in medical and health research to ensure the accuracy and reliability of diagnostic tools, measurement instruments, and clinical assessments. Validity assessments help determine if a measurement accurately identifies the presence or absence of a medical condition, measures the effectiveness of a treatment, or predicts patient outcomes. Validity is crucial for establishing evidence-based medicine and informing medical decision-making.

Social Science Research

Validity is relevant in various social science disciplines, including sociology, anthropology, economics, and political science. Researchers use validity to ensure that their measures and methods accurately capture social phenomena, such as social attitudes, behaviors, social structures, or economic indicators. Validity assessments support the reliability and credibility of social science research findings.

Market Research and Surveys

Validity is important in market research and survey studies to ensure that the survey questions effectively measure consumer preferences, buying behaviors, or attitudes towards products or services. Validity assessments help researchers determine if the survey instrument is accurately capturing the desired information and if the results can be generalized to the target population.

Limitations of Validity

Here are some limitations of validity:

  • Construct Validity: Limitations of construct validity include the potential for measurement error, inadequate operational definitions of constructs, or the failure to capture all aspects of a complex construct.
  • Internal Validity: Limitations of internal validity may arise from confounding variables, selection bias, or the presence of extraneous factors that could influence the study outcomes, making it difficult to attribute causality accurately.
  • External Validity: Limitations of external validity can occur when the study sample does not represent the broader population, when the research setting differs significantly from real-world conditions, or when the study lacks ecological validity, i.e., the findings do not reflect real-world complexities.
  • Measurement Validity: Limitations of measurement validity can arise from measurement error, inadequately designed or flawed measurement scales, or limitations inherent in self-report measures, such as social desirability bias or recall bias.
  • Statistical Conclusion Validity: Limitations in statistical conclusion validity can occur due to sampling errors, inadequate sample sizes, or improper statistical analysis techniques, leading to incorrect conclusions or generalizations.
  • Temporal Validity: Limitations of temporal validity arise when the study results become outdated due to changes in the studied phenomena, interventions, or contextual factors.
  • Researcher Bias: Researcher bias can affect the validity of a study. Biases can emerge through the researcher’s subjective interpretation, influence of personal beliefs, or preconceived notions, leading to unintentional distortion of findings or failure to consider alternative explanations.
  • Ethical Validity: Limitations can arise if the study design or methods involve ethical concerns, such as the use of deceptive practices, inadequate informed consent, or potential harm to participants.

Also see  Reliability Vs Validity

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Alternate Forms Reliability

Alternate Forms Reliability – Methods, Examples...

Construct Validity

Construct Validity – Types, Threats and Examples

Internal Validity

Internal Validity – Threats, Examples and Guide

Reliability Vs Validity

Reliability Vs Validity

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

validity and reliability research example

Home Market Research

Reliability vs. Validity in Research: Types & Examples

Explore how reliability vs validity in research determines quality. Learn the differences and types + examples. Get insights!

When it comes to research, getting things right is crucial. That’s where the concepts of “Reliability vs Validity in Research” come in. 

Imagine it like a balancing act – making sure your measurements are consistent and accurate at the same time. This is where test-retest reliability, having different researchers check things, and keeping things consistent within your research plays a big role. 

As we dive into this topic, we’ll uncover the differences between reliability and validity, see how they work together, and learn how to use them effectively.

Understanding Reliability vs. Validity in Research

When it comes to collecting data and conducting research, two crucial concepts stand out: reliability and validity. 

These pillars uphold the integrity of research findings, ensuring that the data collected and the conclusions drawn are both meaningful and trustworthy. Let’s dive into the heart of the concepts, reliability, and validity, to comprehend their significance in the realm of research truly.

What is reliability?

Reliability refers to the consistency and dependability of the data collection process. It’s like having a steady hand that produces the same result each time it reaches for a task. 

In the research context, reliability is all about ensuring that if you were to repeat the same study using the same reliable measurement technique, you’d end up with the same results. It’s like having multiple researchers independently conduct the same experiment and getting outcomes that align perfectly.

Imagine you’re using a thermometer to measure the temperature of the water. You have a reliable measurement if you dip the thermometer into the water multiple times and get the same reading each time. This tells you that your method and measurement technique consistently produce the same results, whether it’s you or another researcher performing the measurement.

What is validity?

On the other hand, validity refers to the accuracy and meaningfulness of your data. It’s like ensuring that the puzzle pieces you’re putting together actually form the intended picture. When you have validity, you know that your method and measurement technique are consistent and capable of producing results aligned with reality.

Think of it this way; Imagine you’re conducting a test that claims to measure a specific trait, like problem-solving ability. If the test consistently produces results that accurately reflect participants’ problem-solving skills, then the test has high validity. In this case, the test produces accurate results that truly correspond to the trait it aims to measure.

In essence, while reliability assures you that your data collection process is like a well-oiled machine producing the same results, validity steps in to ensure that these results are not only consistent but also relevantly accurate. 

Together, these concepts provide researchers with the tools to conduct research that stands on a solid foundation of dependable methods and meaningful insights.

Types of Reliability

Let’s explore the various types of reliability that researchers consider to ensure their work stands on solid ground.

High test-retest reliability

Test-retest reliability involves assessing the consistency of measurements over time. It’s like taking the same measurement or test twice – once and then again after a certain period. If the results align closely, it indicates that the measurement is reliable over time. Think of it as capturing the essence of stability. 

Inter-rater reliability

When multiple researchers or observers are part of the equation, interrater reliability comes into play. This type of reliability assesses the level of agreement between different observers when evaluating the same phenomenon. It’s like ensuring that different pairs of eyes perceive things in a similar way. 

Internal reliability

Internal consistency dives into the harmony among different items within a measurement tool aiming to assess the same concept. This often comes into play in surveys or questionnaires, where participants respond to various items related to a single construct. If the responses to these items consistently reflect the same underlying concept, the measurement is said to have high internal consistency. 

Types of validity

Let’s explore the various types of validity that researchers consider to ensure their work stands on solid ground.

Content validity

It delves into whether a measurement truly captures all dimensions of the concept it intends to measure. It’s about making sure your measurement tool covers all relevant aspects comprehensively. 

Imagine designing a test to assess students’ understanding of a history chapter. It exhibits high content validity if the test includes questions about key events, dates, and causes. However, if it focuses solely on dates and omits causation, its content validity might be questionable.

Construct validity

It assesses how well a measurement aligns with established theories and concepts. It’s like ensuring that your measurement is a true representation of the abstract construct you’re trying to capture. 

Criterion validity

Criterion validity examines how well your measurement corresponds to other established measurements of the same concept. It’s about making sure your measurement accurately predicts or correlates with external criteria.

Differences between reliability and validity in research

Let’s delve into the differences between reliability and validity in research.

While both reliability and validity contribute to trustworthy research, they address distinct aspects. Reliability ensures consistent results, while validity ensures accurate and relevant results that reflect the true nature of the measured concept.

Example of Reliability and Validity in Research

In this section, we’ll explore instances that highlight the differences between reliability and validity and how they play a crucial role in ensuring the credibility of research findings.

Example of reliability

Imagine you are studying the reliability of a smartphone’s battery life measurement. To collect data, you fully charge the phone and measure the battery life three times in the same controlled environment—same apps running, same brightness level, and same usage patterns. 

If the measurements consistently show a similar battery life duration each time you repeat the test, it indicates that your measurement method is reliable. The consistent results under the same conditions assure you that the battery life measurement can be trusted to provide dependable information about the phone’s performance.

Example of validity

Researchers collect data from a group of participants in a study aiming to assess the validity of a newly developed stress questionnaire. To ensure validity, they compare the scores obtained from the stress questionnaire with the participants’ actual stress levels measured using physiological indicators such as heart rate variability and cortisol levels. 

If participants’ scores correlate strongly with their physiological stress levels, the questionnaire is valid. This means the questionnaire accurately measures participants’ stress levels, and its results correspond to real variations in their physiological responses to stress. 

Validity assessed through the correlation between questionnaire scores and physiological measures ensures that the questionnaire is effectively measuring what it claims to measure participants’ stress levels.

In the world of research, differentiating between reliability and validity is crucial. Reliability ensures consistent results, while validity confirms accurate measurements. Using tools like QuestionPro enhances data collection for both reliability and validity. For instance, measuring self-esteem over time showcases reliability, and aligning questions with theories demonstrates validity. 

QuestionPro empowers researchers to achieve reliable and valid results through its robust features, facilitating credible research outcomes. Contact QuestionPro to create a free account or learn more!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

customer advocacy software

21 Best Customer Advocacy Software for Customers in 2024

Apr 19, 2024

quantitative data analysis software

10 Quantitative Data Analysis Software for Every Data Scientist

Apr 18, 2024

Enterprise Feedback Management software

11 Best Enterprise Feedback Management Software in 2024

online reputation management software

17 Best Online Reputation Management Software in 2024

Apr 17, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence
  • Reliability vs Validity in Research: Types & Examples

busayo.longe

In everyday life, we probably use reliability to describe how something is valid. However, in research and testing, reliability and validity are not the same things.

When it comes to data analysis, reliability refers to how easily replicable an outcome is. For example, if you measure a cup of rice three times, and you get the same result each time, that result is reliable.

The validity, on the other hand, refers to the measurement’s accuracy. This means that if the standard weight for a cup of rice is 5 grams, and you measure a cup of rice, it should be 5 grams.

So, while reliability and validity are intertwined, they are not synonymous. If one of the measurement parameters, such as your scale, is distorted, the results will be consistent but invalid.

Data must be consistent and accurate to be used to draw useful conclusions. In this article, we’ll look at how to assess data reliability and validity, as well as how to apply it.

Read: Internal Validity in Research: Definition, Threats, Examples

What is Reliability?

When a measurement is consistent it’s reliable. But of course, reliability doesn’t mean your outcome will be the same, it just means it will be in the same range. 

For example, if you scored 95% on a test the first time and the next you score, 96%, your results are reliable.  So, even if there is a minor difference in the outcomes, as long as it is within the error margin, your results are reliable.

Reliability allows you to assess the degree of consistency in your results. So, if you’re getting similar results, reliability provides an answer to the question of how similar your results are.

What is Validity?

A measurement or test is valid when it correlates with the expected result. It examines the accuracy of your result.

Here’s where things get tricky: to establish the validity of a test, the results must be consistent. Looking at most experiments (especially physical measurements), the standard value that establishes the accuracy of a measurement is the outcome of repeating the test to obtain a consistent result.

Read: What is Participant Bias? How to Detect & Avoid It

For example, before I can conclude that all 12-inch rulers are one foot, I must repeat the experiment several times and obtain very similar results, indicating that 12-inch rulers are indeed one foot.

Most scientific experiments are inextricably linked in terms of validity and reliability. For example, if you’re measuring distance or depth, valid answers are likely to be reliable.

But for social experiences, one isn’t the indication of the other. For example, most people believe that people that wear glasses are smart. 

Of course, I’ll find examples of people who wear glasses and have high IQs (reliability), but the truth is that most people who wear glasses simply need their vision to be better (validity). 

So reliable answers aren’t always correct but valid answers are always reliable.

How Are Reliability and Validity Assessed?

When assessing reliability, we want to know if the measurement can be replicated. Of course, we’d have to change some variables to ensure that this test holds, the most important of which are time, items, and observers.

If the main factor you change when performing a reliability test is time, you’re performing a test-retest reliability assessment.

Read: What is Publication Bias? (How to Detect & Avoid It)

However, if you are changing items, you are performing an internal consistency assessment. It means you’re measuring multiple items with a single instrument.

Finally, if you’re measuring the same item with the same instrument but using different observers or judges, you’re performing an inter-rater reliability test.

Assessing Validity

Evaluating validity can be more tedious than reliability. With reliability, you’re attempting to demonstrate that your results are consistent, whereas, with validity, you want to prove the correctness of your outcome.

Although validity is mainly categorized under two sections (internal and external), there are more than fifteen ways to check the validity of a test. In this article, we’ll be covering four.

First, content validity, measures whether the test covers all the content it needs to provide the outcome you’re expecting. 

Suppose I wanted to test the hypothesis that 90% of Generation Z uses social media polls for surveys while 90% of millennials use forms. I’d need a sample size that accounts for how Gen Z and millennials gather information.

Next, criterion validity is when you compare your results to what you’re supposed to get based on a chosen criteria. There are two ways these could be measured, predictive or concurrent validity.

Read: Survey Errors To Avoid: Types, Sources, Examples, Mitigation

Following that, we have face validity . It’s how we anticipate a test to be. For instance, when answering a customer service survey, I’d expect to be asked about how I feel about the service provided.

Lastly, construct-related validity . This is a little more complicated, but it helps to show how the validity of research is based on different findings.

As a result, it provides information that either proves or disproves that certain things are related.

Types of Reliability

We have three main types of reliability assessment and here’s how they work:

1) Test-retest Reliability

This assessment refers to the consistency of outcomes over time. Testing reliability over time does not imply changing the amount of time it takes to conduct an experiment; rather, it means repeating the experiment multiple times in a short time.

For example, if I measure the length of my hair today, and tomorrow, I’ll most likely get the same result each time. 

A short period is relative in terms of reliability; two days for measuring hair length is considered short. But that’s far too long to test how quickly water dries on the sand.

A test-retest correlation is used to compare the consistency of your results. This is typically a scatter plot that shows how similar your values are between the two experiments.

If your answers are reliable, your scatter plots will most likely have a lot of overlapping points, but if they aren’t, the points (values) will be spread across the graph.

Read: Sampling Bias: Definition, Types + [Examples]

2) Internal Consistency

It’s also known as internal reliability. It refers to the consistency of results for various items when measured on the same scale.

This is particularly important in social science research, such as surveys, because it helps determine the consistency of people’s responses when asked the same questions.

Most introverts, for example, would say they enjoy spending time alone and having few friends. However, if some introverts claim that they either do not want time alone or prefer to be surrounded by many friends, it doesn’t add up.

These people who claim to be introverts or one this factor isn’t a reliable way of measuring introversion.

Internal reliability helps you prove the consistency of a test by varying factors. It’s a little tough to measure quantitatively but you could use the split-half correlation .

The split-half correlation simply means dividing the factors used to measure the underlying construct into two and plotting them against each other in the form of a scatter plot.

Introverts, for example, are assessed on their need for alone time as well as their desire to have as few friends as possible. If this plot is dispersed, likely, one of the traits does not indicate introversion.

3) Inter-Rater Reliability

This method of measuring reliability helps prevent personal bias. Inter-rater reliability assessment helps judge outcomes from the different perspectives of multiple observers.

A good example is if you ordered a meal and found it delicious. You could be biased in your judgment for several reasons, perception of the meal, your mood, and so on.

But it’s highly unlikely that six more people would agree that the meal is delicious if it isn’t. Another factor that could lead to bias is expertise. Professional dancers, for example, would perceive dance moves differently than non-professionals. 

Read: What is Experimenter Bias? Definition, Types & Mitigation

So, if a person dances and records it, and both groups (professional and unprofessional dancers) rate the video, there is a high likelihood of a significant difference in their ratings.

But if they both agree that the person is a great dancer, despite their opposing viewpoints, the person is likely a great dancer.

Types of Validity

Researchers use validity to determine whether a measurement is accurate or not. The accuracy of measurement is usually determined by comparing it to the standard value.

When a measurement is consistent over time and has high internal consistency, it increases the likelihood that it is valid.

1) Content Validity

This refers to determining validity by evaluating what is being measured. So content validity tests if your research is measuring everything it should to produce an accurate result.

For example, if I were to measure what causes hair loss in women. I’d have to consider things like postpartum hair loss, alopecia, hair manipulation, dryness, and so on.

By omitting any of these critical factors, you risk significantly reducing the validity of your research because you won’t be covering everything necessary to make an accurate deduction. 

Read: Data Cleaning: 7 Techniques + Steps to Cleanse Data

For example, a certain woman is losing her hair due to postpartum hair loss, excessive manipulation, and dryness, but in my research, I only look at postpartum hair loss. My research will show that she has postpartum hair loss, which isn’t accurate.

Yes, my conclusion is correct, but it does not fully account for the reasons why this woman is losing her hair.

2) Criterion Validity

This measures how well your measurement correlates with the variables you want to compare it with to get your result. The two main classes of criterion validity are predictive and concurrent.

3) Predictive validity

It helps predict future outcomes based on the data you have. For example, if a large number of students performed exceptionally well in a test, you can use this to predict that they understood the concept on which the test was based and will perform well in their exams.

4) Concurrent validity

On the other hand, involves testing with different variables at the same time. For example, setting up a literature test for your students on two different books and assessing them at the same time.

You’re measuring your students’ literature proficiency with these two books. If your students truly understood the subject, they should be able to correctly answer questions about both books.

5) Face Validity

Quantifying face validity might be a bit difficult because you are measuring the perception validity, not the validity itself. So, face validity is concerned with whether the method used for measurement will produce accurate results rather than the measurement itself.

If the method used for measurement doesn’t appear to test the accuracy of a measurement, its face validity is low.

Here’s an example: less than 40% of men over the age of 20 in Texas, USA, are at least 6 feet tall. The most logical approach would be to collect height data from men over the age of twenty in Texas, USA.

However, asking men over the age of 20 what their favorite meal is to determine their height is pretty bizarre. The method I am using to assess the validity of my research is quite questionable because it lacks correlation to what I want to measure.

6) Construct-Related Validity

Construct-related validity assesses the accuracy of your research by collecting multiple pieces of evidence. It helps determine the validity of your results by comparing them to evidence that supports or refutes your measurement.

7) Convergent validity

If you’re assessing evidence that strongly correlates with the concept, that’s convergent validity . 

8) Discriminant validity

Examines the validity of your research by determining what not to base it on. You are removing elements that are not a strong factor to help validate your research. Being a vegan, for example, does not imply that you are allergic to meat.

How to Ensure Validity and Reliability in Your Research

You need a bulletproof research design to ensure that your research is both valid and reliable. This means that your methods, sample, and even you, the researcher, shouldn’t be biased.

  • Ensuring Reliability

To enhance the reliability of your research, you need to apply your measurement method consistently. The chances of reproducing the same results for a test are higher when you maintain the method you’re using to experiment.

For example, you want to determine the reliability of the weight of a bag of chips using a scale. You have to consistently use this scale to measure the bag of chips each time you experiment.

You must also keep the conditions of your research consistent. For instance, if you’re experimenting to see how quickly water dries on sand, you need to consider all of the weather elements that day.

So, if you experimented on a sunny day, the next experiment should also be conducted on a sunny day to obtain a reliable result.

Read: Survey Methods: Definition, Types, and Examples
  • Ensuring Validity

There are several ways to determine the validity of your research, and the majority of them require the use of highly specific and high-quality measurement methods.

Before you begin your test, choose the best method for producing the desired results. This method should be pre-existing and proven.

Also, your sample should be very specific. If you’re collecting data on how dogs respond to fear, your results are more likely to be valid if you base them on a specific breed of dog rather than dogs in general.

Validity and reliability are critical for achieving accurate and consistent results in research. While reliability does not always imply validity, validity establishes that a result is reliable. Validity is heavily dependent on previous results (standards), whereas reliability is dependent on the similarity of your results.

Logo

Connect to Formplus, Get Started Now - It's Free!

  • concurrent validity
  • examples of research bias
  • predictive reliability
  • research analysis
  • research assessment
  • validity of research
  • busayo.longe

Formplus

You may also like:

Research Bias: Definition, Types + Examples

Simple guide to understanding research bias, types, causes, examples and how to avoid it in surveys

validity and reliability research example

Selection Bias in Research: Types, Examples & Impact

In this article, we’ll discuss the effects of selection bias, how it works, its common effects and the best ways to minimize it.

Simpson’s Paradox & How to Avoid it in Experimental Research

In this article, we are going to look at Simpson’s Paradox from its historical point and later, we’ll consider its effect in...

How to do a Meta Analysis: Methodology, Pros & Cons

In this article, we’ll go through the concept of meta-analysis, what it can be used for, and how you can use it to improve how you...

Formplus - For Seamless Data Collection

Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 5: Psychological Measurement

Reliability and Validity of Measurement

Learning Objectives

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply  assume  that their measures work. Instead, they collect data to demonstrate  that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability  refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time.  Test-retest reliability  is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the  same  group of people at a later time, and then looking at  test-retest correlation  between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s  r . Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is  internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a  split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s  r  for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called  Cronbach’s α  (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a rater.  Inter-rater reliability  is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity  is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity  is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

Content validity  is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity  is the extent to which people’s scores on a measure are correlated with other variables (known as  criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as convergent validity .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982) [1] . In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009) [2] .

Discriminant Validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s  r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
  • Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131. ↵
  • Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵

The consistency of a measure.

The consistency of a measure over time.

The consistency of a measure on the same group of people at different times.

Consistency of people’s responses across the items on a multiple-item measure.

Method of assessing internal consistency through splitting the items into two sets and examining the relationship between them.

A statistic in which α is the mean of all possible split-half correlations for a set of items.

The extent to which different observers are consistent in their judgments.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears to measure the construct of interest.

The extent to which a measure “covers” the construct of interest.

The extent to which people’s scores on a measure are correlated with other variables that one would expect them to be correlated with.

In reference to criterion validity, variables that one would expect to be correlated with the measure.

When the criterion is measured at the same time as the construct.

when the criterion is measured at some point in the future (after the construct has been measured).

When new measures positively correlate with existing measures of the same constructs.

The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

validity and reliability research example

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • The 4 Types of Reliability in Research | Definitions & Examples

The 4 Types of Reliability in Research | Definitions & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 26 August 2022.

Reliability tells you how consistently a method measures something. When you apply the same method to the same   sample   under the same conditions, you should get the same results. If not, the method of measurement may be unreliable.

There are four main types of reliability. Each can be estimated by comparing different sets of results produced by the same method.

Table of contents

Test-retest reliability, interrater reliability, parallel forms reliability, internal consistency, which type of reliability applies to my research.

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

Why test-retest reliability is important

Many factors can influence your results at different points in time: for example, respondents might experience different moods, or external conditions might affect their ability to respond accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time. The smaller the difference between the two sets of results, the higher the test-retest reliability.

How to measure test-retest reliability

To measure test-retest reliability, you conduct the same test on the same group of people at two different points in time. Then you calculate the correlation between the two sets of results.

Improving test-retest reliability

  • When designing tests or questionnaires , try to formulate questions, statements, and tasks in a way that won’t be influenced by the mood or concentration of participants.
  • When planning your methods of data collection , try to minimise the influence of external factors, and make sure all samples are tested under the same conditions.
  • Remember that changes can be expected to occur in the participants over time, and take these into account.

Prevent plagiarism, run a free check.

Inter-rater reliability (also called inter-observer reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables .

Why inter-rater reliability is important

People are subjective, so different observers’ perceptions of situations and phenomena naturally differ. Reliable research aims to minimise subjectivity as much as possible so that a different researcher could replicate the same results.

When designing the scale and criteria for data collection, it’s important to make sure that different people will rate the same variable consistently with minimal bias. This is especially important when there are multiple researchers involved in data collection or analysis.

How to measure inter-rater reliability

To measure inter-rater reliability, different researchers conduct the same measurement or observation on the same sample. Then you calculate the correlation between their different sets of results. If all the researchers give similar ratings, the test has high inter-rater reliability.

Improving inter-rater reliability

  • Clearly define your variables and the methods that will be used to measure them.
  • Develop detailed, objective criteria for how the variables will be rated, counted, or categorised.
  • If multiple researchers are involved, ensure that they all have exactly the same information and training.

Parallel forms reliability measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.

Why parallel forms reliability is important

If you want to use multiple different versions of a test (for example, to avoid respondents repeating the same answers from memory), you first need to make sure that all the sets of questions or measurements give reliable results.

How to measure parallel forms reliability

The most common way to measure parallel forms reliability is to produce a large set of questions to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the results. High correlation between the two indicates high parallel forms reliability.

Improving parallel forms reliability

  • Ensure that all questions or test items are based on the same theory and formulated to measure the same thing.

Internal consistency assesses the correlation between multiple items in a test that are intended to measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers, so it’s a good way of assessing reliability when you only have one dataset.

Why internal consistency is important

When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all of the items really do reflect the same thing. If responses to different items contradict one another, the test might be unreliable.

How to measure internal consistency

Two common methods are used to measure internal consistency.

  • Average inter-item correlation : For a set of measures designed to assess the same construct, you calculate the correlation between the results of all possible pairs of items and then calculate the average.
  • Split-half reliability : You randomly split a set of measures into two sets. After testing the entire set on the respondents, you calculate the correlation between the two sets of responses.

Improving internal consistency

  • Take care when devising questions or measures: those intended to reflect the same concept should be based on the same theory and carefully formulated.

It’s important to consider reliability when planning your research design , collecting and analysing your data, and writing up your research. The type of reliability you should calculate depends on the type of research  and your  methodology .

If possible and relevant, you should statistically calculate reliability and state this alongside your results .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, August 26). The 4 Types of Reliability in Research | Definitions & Examples. Scribbr. Retrieved 15 April 2024, from https://www.scribbr.co.uk/research-methods/reliability-explained/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs validity in research | differences, types & examples, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples.

Reliability In Psychology Research: Definitions & Examples

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

Reliability ensures that responses are consistent across times and occasions for instruments like questionnaires . Multiple forms of reliability exist, including test-retest, inter-rater, and internal consistency.

For example, if people weigh themselves during the day, they would expect to see a similar reading. Scales that measured weight differently each time would be of little use.

The same analogy could be applied to a tape measure that measures inches differently each time it is used. It would not be considered reliable.

If findings from research are replicated consistently, they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable, it should show a high positive correlation.

Of course, it is unlikely the same results will be obtained each time as participants and situations vary. Still, a strong positive correlation between the same test results indicates reliability.

Reliability is important because unreliable measures introduce random error that attenuates correlations and makes it harder to detect real relationships.

Ensuring high reliability for key measures in psychology research helps boost the sensitivity, validity, and replicability of studies. Estimating and reporting reliable evidence is considered an important methodological practice.

There are two types of reliability: internal and external.
  • Internal reliability refers to how consistently different items within a single test measure the same concept or construct. It ensures that a test is stable across its components.
  • External reliability measures how consistently a test produces similar results over repeated administrations or under different conditions. It ensures that a test is stable over time and situations.
Some key aspects of reliability in psychology research include:
  • Test-retest reliability : The consistency of scores for the same person across two or more separate administrations of the same measurement procedure over time. High test-retest reliability suggests the measure provides a stable, reproducible score.
  • Interrater reliability : The level of agreement in scores on a measure between different raters or observers rating the same target. High interrater reliability suggests the ratings are objective and not overly influenced by rater subjectivity or bias.
  • Internal consistency reliability : The degree to which different test items or parts of an instrument that measure the same construct yield similar results. Analyzed statistically using Cronbach’s alpha, a high value suggests the items measure the same underlying concept.

Test-Retest Reliability

The test-retest method assesses the external consistency of a test. Examples of appropriate tests include questionnaires and psychometric tests. It measures the stability of a test over time.

A typical assessment would involve giving participants the same test on two separate occasions. If the same or similar results are obtained, then external reliability is established.

Here’s how it works:

  • A test or measurement is administered to participants at one point in time.
  • After a certain period, the same test is administered again to the same participants without any intervention or treatment in between.
  • The scores from the two administrations are then correlated using a statistical method, often Pearson’s correlation.
  • A high correlation between the scores from the two test administrations indicates good test-retest reliability, suggesting the test yields consistent results over time.

This method is especially useful for tests that measure stable traits or characteristics that aren’t expected to change over short periods.

The disadvantage of the test-retest method is that it takes a long time for results to be obtained. The reliability can be influenced by the time interval between tests and any events that might affect participants’ responses during this interval.

Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions one week apart, they found a correlation of .93 therefore demonstrating high test-restest reliability of the depression inventory.

This is an example of why reliability in psychological research is necessary, if it wasn’t for the reliability of such tests some individuals may not be successfully diagnosed with disorders such as depression and consequently will not be given appropriate therapy.

The timing of the test is important; if the duration is too brief, then participants may recall information from the first test, which could bias the results.

Alternatively, if the duration is too long, it is feasible that the participants could have changed in some important way which could also bias the results.

The test-retest method assesses the external consistency of a test. This refers to the degree to which different raters give consistent estimates of the same behavior. Inter-rater reliability can be used for interviews.

Inter-Rater Reliability

Inter-rater reliability, often termed inter-observer reliability, refers to the extent to which different raters or evaluators agree in assessing a particular phenomenon, behavior, or characteristic. It’s a measure of consistency and agreement between individuals scoring or evaluating the same items or behaviors.

High inter-rater reliability indicates that the findings or measurements are consistent across different raters, suggesting the results are not due to random chance or subjective biases of individual raters.

Statistical measures, such as Cohen’s Kappa or the Intraclass Correlation Coefficient (ICC), are often employed to quantify the level of agreement between raters, helping to ensure that findings are objective and reproducible.

Ensuring high inter-rater reliability is essential, especially in studies involving subjective judgment or observations, as it provides confidence that the findings are replicable and not heavily influenced by individual rater biases.

Note it can also be called inter-observer reliability when referring to observational research. Here, researchers observe the same behavior independently (to avoid bias) and compare their data. If the data is similar, then it is reliable.

Where observer scores do not significantly correlate, then reliability can be improved by:

  • Train observers in the observation techniques and ensure everyone agrees with them.
  • Ensuring behavior categories have been operationalized. This means that they have been objectively defined.
For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they would both have their own subjective opinion regarding what aggression comprises.

In this scenario, they would be unlikely to record aggressive behavior the same, and the data would be unreliable.

However, if they were to operationalize the behavior category of aggression, this would be more objective and make it easier to identify when a specific behavior occurs.

For example, while “aggressive behavior” is subjective and not operationalized, “pushing” is objective and operationalized. Thus, researchers could count how many times children push each other over a certain duration of time.

Internal Consistency Reliability

Internal consistency reliability refers to how well different items on a test or survey that are intended to measure the same construct produce similar scores.

For example, a questionnaire measuring depression may have multiple questions tapping issues like sadness, changes in sleep and appetite, fatigue, and loss of interest. The assumption is that people’s responses across these different symptom items should be fairly consistent.

Cronbach’s alpha is a common statistic used to quantify internal consistency reliability. It calculates the average inter-item correlations among the test items. Values range from 0 to 1, with higher values indicating greater internal consistency. A good rule of thumb is that alpha should generally be above .70 to suggest adequate reliability.

An alpha of .90 for a depression questionnaire, for example, means there is a high average correlation between respondents’ scores on the different symptom items.

This suggests all the items are measuring the same underlying construct (depression) in a consistent manner. It taps the unidimensionality of the scale – evidence it is measuring one thing.

If some items were unrelated to others, the average inter-item correlations would be lower, resulting in a lower alpha. This would indicate the presence of multiple dimensions in the scale, rather than a unified single concept.

So, in summary, high internal consistency reliability evidenced through high Cronbach’s alpha provides support for the fact that various test items successfully tap into the same latent variable the researcher intends to measure. It suggests the items meaningfully cohere together to reliably measure that construct.

Split-Half Method

The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires.

There, it measures the extent to which all parts of the test contribute equally to what is being measured.

The split-half approach provides another method of quantifying internal consistency by taking advantage of the natural variation when a single test is divided in half.

It’s somewhat cumbersome to implement but avoids limitations associated with Cronbach’s alpha. However, alpha remains much more widely used in practice due to its relative ease of calculation.

  • A test or questionnaire is split into two halves, typically by separating even-numbered items from odd-numbered items, or first-half items vs. second-half.
  • Each half is scored separately, and the scores are correlated using a statistical method, often Pearson’s correlation.
  • The correlation between the two halves gives an indication of the test’s reliability. A higher correlation suggests better reliability.
  • To adjust for the test’s shortened length (because we’ve split it in half), the Spearman-Brown prophecy formula is often applied to estimate the reliability of the full test based on the split-half reliability.

The reliability of a test could be improved by using this method. For example, any items on separate halves of a test with a low correlation (e.g., r = .25) should either be removed or rewritten.

The split-half method is a quick and easy way to establish reliability. However, it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests that measure different constructs.

For example, the Minnesota Multiphasic Personality Inventory has sub scales measuring differently behaviors such as depression, schizophrenia, social introversion. Therefore the split-half method was not be an appropriate method to assess reliability for this personality test.

Validity vs. Reliability In Psychology

In psychology, validity and reliability are fundamental concepts that assess the quality of measurements.

  • Validity refers to the degree to which a measure accurately assesses the specific concept, trait, or construct that it claims to be assessing. It refers to the truthfulness of the measure.
  • Reliability refers to the overall consistency, stability, and repeatability of a measurement. It is concerned with how much random error might be distorting scores or introducing unwanted “noise” into the data.

A key difference is that validity refers to what’s being measured, while reliability refers to how consistently it’s being measured.

An unreliable measure cannot be truly valid because if a measure gives inconsistent, unpredictable scores, it clearly isn’t measuring the trait or quality it aims to measure in a truthful, systematic manner. Establishing reliability provides the foundation for determining the measure’s validity.

A pivotal understanding is that reliability is a necessary but not sufficient condition for validity.

It means a test can be reliable, consistently producing the same results, without being valid, or accurately measuring the intended attribute.

However, a valid test, one that truly measures what it purports to, must be reliable. In the pursuit of rigorous psychological research, both validity and reliability are indispensable.

Ideally, researchers strive for high scores on both -Validity to make sure you’re measuring the correct construct and reliability to make sure you’re measuring it accurately and precisely. The two qualities are independent but both crucial elements of strong measurement procedures.

Validity vs reliability as data research quality evaluation outline diagram. Labeled educational comparison with reliable or valid information vector illustration. Method, technique or test indication

Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the beck depression inventory The Psychological Corporation. San Antonio , TX.

Clifton, J. D. W. (2020). Managing validity versus reliability trade-offs in scale-building decisions. Psychological Methods, 25 (3), 259–270. https:// doi.org/10.1037/met0000236

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10 (4), 255–282. https://doi.org/10.1007/BF02288892

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Jannarone, R. J., Macera, C. A., & Garrison, C. Z. (1987). Evaluating interrater agreement through “case-control” sampling. Biometrics, 43 (2), 433–437. https://doi.org/10.2307/2531825

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11 (4), 815–852. https://doi.org/10.1177/1094428106296642

Watkins, M. W., & Pacheco, M. (2000). Interobserver agreement in behavioral research: Importance and calculation. Journal of Behavioral Education, 10 , 205–212

Print Friendly, PDF & Email

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Family Med Prim Care
  • v.4(3); Jul-Sep 2015

Validity, reliability, and generalizability in qualitative research

Lawrence leung.

1 Department of Family Medicine, Queen's University, Kingston, Ontario, Canada

2 Centre of Studies in Primary Care, Queen's University, Kingston, Ontario, Canada

In general practice, qualitative research contributes as significantly as quantitative research, in particular regarding psycho-social aspects of patient-care, health services provision, policy setting, and health administrations. In contrast to quantitative research, qualitative research as a whole has been constantly critiqued, if not disparaged, by the lack of consensus for assessing its quality and robustness. This article illustrates with five published studies how qualitative research can impact and reshape the discipline of primary care, spiraling out from clinic-based health screening to community-based disease monitoring, evaluation of out-of-hours triage services to provincial psychiatric care pathways model and finally, national legislation of core measures for children's healthcare insurance. Fundamental concepts of validity, reliability, and generalizability as applicable to qualitative research are then addressed with an update on the current views and controversies.

Nature of Qualitative Research versus Quantitative Research

The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and dimensionality. Like quantitative research, the qualitative research aims to seek answers for questions of “how, where, when who and why” with a perspective to build a theory or refute an existing theory. Unlike quantitative research which deals primarily with numerical data and their statistical interpretations under a reductionist, logical and strictly objective paradigm, qualitative research handles nonnumerical information and their phenomenological interpretation, which inextricably tie in with human senses and subjectivity. While human emotions and perspectives from both subjects and researchers are considered undesirable biases confounding results in quantitative research, the same elements are considered essential and inevitable, if not treasurable, in qualitative research as they invariable add extra dimensions and colors to enrich the corpus of findings. However, the issue of subjectivity and contextual ramifications has fueled incessant controversies regarding yardsticks for quality and trustworthiness of qualitative research results for healthcare.

Impact of Qualitative Research upon Primary Care

In many ways, qualitative research contributes significantly, if not more so than quantitative research, to the field of primary care at various levels. Five qualitative studies are chosen to illustrate how various methodologies of qualitative research helped in advancing primary healthcare, from novel monitoring of chronic obstructive pulmonary disease (COPD) via mobile-health technology,[ 1 ] informed decision for colorectal cancer screening,[ 2 ] triaging out-of-hours GP services,[ 3 ] evaluating care pathways for community psychiatry[ 4 ] and finally prioritization of healthcare initiatives for legislation purposes at national levels.[ 5 ] With the recent advances of information technology and mobile connecting device, self-monitoring and management of chronic diseases via tele-health technology may seem beneficial to both the patient and healthcare provider. Recruiting COPD patients who were given tele-health devices that monitored lung functions, Williams et al. [ 1 ] conducted phone interviews and analyzed their transcripts via a grounded theory approach, identified themes which enabled them to conclude that such mobile-health setup and application helped to engage patients with better adherence to treatment and overall improvement in mood. Such positive findings were in contrast to previous studies, which opined that elderly patients were often challenged by operating computer tablets,[ 6 ] or, conversing with the tele-health software.[ 7 ] To explore the content of recommendations for colorectal cancer screening given out by family physicians, Wackerbarth, et al. [ 2 ] conducted semi-structure interviews with subsequent content analysis and found that most physicians delivered information to enrich patient knowledge with little regard to patients’ true understanding, ideas, and preferences in the matter. These findings suggested room for improvement for family physicians to better engage their patients in recommending preventative care. Faced with various models of out-of-hours triage services for GP consultations, Egbunike et al. [ 3 ] conducted thematic analysis on semi-structured telephone interviews with patients and doctors in various urban, rural and mixed settings. They found that the efficiency of triage services remained a prime concern from both users and providers, among issues of access to doctors and unfulfilled/mismatched expectations from users, which could arouse dissatisfaction and legal implications. In UK, a care pathways model for community psychiatry had been introduced but its benefits were unclear. Khandaker et al. [ 4 ] hence conducted a qualitative study using semi-structure interviews with medical staff and other stakeholders; adopting a grounded-theory approach, major themes emerged which included improved equality of access, more focused logistics, increased work throughput and better accountability for community psychiatry provided under the care pathway model. Finally, at the US national level, Mangione-Smith et al. [ 5 ] employed a modified Delphi method to gather consensus from a panel of nominators which were recognized experts and stakeholders in their disciplines, and identified a core set of quality measures for children's healthcare under the Medicaid and Children's Health Insurance Program. These core measures were made transparent for public opinion and later passed on for full legislation, hence illustrating the impact of qualitative research upon social welfare and policy improvement.

Overall Criteria for Quality in Qualitative Research

Given the diverse genera and forms of qualitative research, there is no consensus for assessing any piece of qualitative research work. Various approaches have been suggested, the two leading schools of thoughts being the school of Dixon-Woods et al. [ 8 ] which emphasizes on methodology, and that of Lincoln et al. [ 9 ] which stresses the rigor of interpretation of results. By identifying commonalities of qualitative research, Dixon-Woods produced a checklist of questions for assessing clarity and appropriateness of the research question; the description and appropriateness for sampling, data collection and data analysis; levels of support and evidence for claims; coherence between data, interpretation and conclusions, and finally level of contribution of the paper. These criteria foster the 10 questions for the Critical Appraisal Skills Program checklist for qualitative studies.[ 10 ] However, these methodology-weighted criteria may not do justice to qualitative studies that differ in epistemological and philosophical paradigms,[ 11 , 12 ] one classic example will be positivistic versus interpretivistic.[ 13 ] Equally, without a robust methodological layout, rigorous interpretation of results advocated by Lincoln et al. [ 9 ] will not be good either. Meyrick[ 14 ] argued from a different angle and proposed fulfillment of the dual core criteria of “transparency” and “systematicity” for good quality qualitative research. In brief, every step of the research logistics (from theory formation, design of study, sampling, data acquisition and analysis to results and conclusions) has to be validated if it is transparent or systematic enough. In this manner, both the research process and results can be assured of high rigor and robustness.[ 14 ] Finally, Kitto et al. [ 15 ] epitomized six criteria for assessing overall quality of qualitative research: (i) Clarification and justification, (ii) procedural rigor, (iii) sample representativeness, (iv) interpretative rigor, (v) reflexive and evaluative rigor and (vi) transferability/generalizability, which also double as evaluative landmarks for manuscript review to the Medical Journal of Australia. Same for quantitative research, quality for qualitative research can be assessed in terms of validity, reliability, and generalizability.

Validity in qualitative research means “appropriateness” of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of “individual” is seen differently between humanistic and positive psychologists due to differing philosophical perspectives:[ 16 ] Where humanistic psychologists believe “individual” is a product of existential awareness and social interaction, positive psychologists think the “individual” exists side-by-side with formation of any human being. Set off in different pathways, qualitative research regarding the individual's wellbeing will be concluded with varying validity. Choice of methodology must enable detection of findings/phenomena in the appropriate context for it to be valid, with due regard to culturally and contextually variable. For sampling, procedures and methods must be appropriate for the research paradigm and be distinctive between systematic,[ 17 ] purposeful[ 18 ] or theoretical (adaptive) sampling[ 19 , 20 ] where the systematic sampling has no a priori theory, purposeful sampling often has a certain aim or framework and theoretical sampling is molded by the ongoing process of data collection and theory in evolution. For data extraction and analysis, several methods were adopted to enhance validity, including 1 st tier triangulation (of researchers) and 2 nd tier triangulation (of resources and theories),[ 17 , 21 ] well-documented audit trail of materials and processes,[ 22 , 23 , 24 ] multidimensional analysis as concept- or case-orientated[ 25 , 26 ] and respondent verification.[ 21 , 27 ]

Reliability

In quantitative research, reliability refers to exact replicability of the processes and the results. In qualitative research with diverse paradigms, such definition of reliability is challenging and epistemologically counter-intuitive. Hence, the essence of reliability for qualitative research lies with consistency.[ 24 , 28 ] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions. Silverman[ 29 ] proposed five approaches in enhancing the reliability of process and results: Refutational analysis, constant data comparison, comprehensive data use, inclusive of the deviant case and use of tables. As data were extracted from the original sources, researchers must verify their accuracy in terms of form and context with constant comparison,[ 27 ] either alone or with peers (a form of triangulation).[ 30 ] The scope and analysis of data included should be as comprehensive and inclusive with reference to quantitative aspects if possible.[ 30 ] Adopting the Popperian dictum of falsifiability as essence of truth and science, attempted to refute the qualitative data and analytes should be performed to assess reliability.[ 31 ]

Generalizability

Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. However, with rising trend of knowledge synthesis from qualitative research via meta-synthesis, meta-narrative or meta-ethnography, evaluation of generalizability becomes pertinent. A pragmatic approach to assessing generalizability for qualitative studies is to adopt same criteria for validity: That is, use of systematic sampling, triangulation and constant comparison, proper audit and documentation, and multi-dimensional theory.[ 17 ] However, some researchers espouse the approach of analytical generalization[ 32 ] where one judges the extent to which the findings in one study can be generalized to another under similar theoretical, and the proximal similarity model, where generalizability of one study to another is judged by similarities between the time, place, people and other social contexts.[ 33 ] Thus said, Zimmer[ 34 ] questioned the suitability of meta-synthesis in view of the basic tenets of grounded theory,[ 35 ] phenomenology[ 36 ] and ethnography.[ 37 ] He concluded that any valid meta-synthesis must retain the other two goals of theory development and higher-level abstraction while in search of generalizability, and must be executed as a third level interpretation using Gadamer's concepts of the hermeneutic circle,[ 38 , 39 ] dialogic process[ 38 ] and fusion of horizons.[ 39 ] Finally, Toye et al. [ 40 ] reported the practicality of using “conceptual clarity” and “interpretative rigor” as intuitive criteria for assessing quality in meta-ethnography, which somehow echoed Rolfe's controversial aesthetic theory of research reports.[ 41 ]

Food for Thought

Despite various measures to enhance or ensure quality of qualitative studies, some researchers opined from a purist ontological and epistemological angle that qualitative research is not a unified, but ipso facto diverse field,[ 8 ] hence any attempt to synthesize or appraise different studies under one system is impossible and conceptually wrong. Barbour argued from a philosophical angle that these special measures or “technical fixes” (like purposive sampling, multiple-coding, triangulation, and respondent validation) can never confer the rigor as conceived.[ 11 ] In extremis, Rolfe et al. opined from the field of nursing research, that any set of formal criteria used to judge the quality of qualitative research are futile and without validity, and suggested that any qualitative report should be judged by the form it is written (aesthetic) and not by the contents (epistemic).[ 41 ] Rolfe's novel view is rebutted by Porter,[ 42 ] who argued via logical premises that two of Rolfe's fundamental statements were flawed: (i) “The content of research report is determined by their forms” may not be a fact, and (ii) that research appraisal being “subject to individual judgment based on insight and experience” will mean those without sufficient experience of performing research will be unable to judge adequately – hence an elitist's principle. From a realism standpoint, Porter then proposes multiple and open approaches for validity in qualitative research that incorporate parallel perspectives[ 43 , 44 ] and diversification of meanings.[ 44 ] Any work of qualitative research, when read by the readers, is always a two-way interactive process, such that validity and quality has to be judged by the receiving end too and not by the researcher end alone.

In summary, the three gold criteria of validity, reliability and generalizability apply in principle to assess quality for both quantitative and qualitative research, what differs will be the nature and type of processes that ontologically and epistemologically distinguish between the two.

Source of Support: Nil.

Conflict of Interest: None declared.

Reliability and validity: Importance in Medical Research

Affiliations.

  • 1 Al-Nafees Medical College,Isra University, Islamabad, Pakistan.
  • 2 Fauji Foundation Hospital, Foundation University Medical College, Islamabad, Pakistan.
  • PMID: 34974579
  • DOI: 10.47391/JPMA.06-861

Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtained and the degree to which any measuring tool controls random error. The current narrative review was planned to discuss the importance of reliability and validity of data-collection or measurement techniques used in research. It describes and explores comprehensively the reliability and validity of research instruments and also discusses different forms of reliability and validity with concise examples. An attempt has been taken to give a brief literature review regarding the significance of reliability and validity in medical sciences.

Keywords: Validity, Reliability, Medical research, Methodology, Assessment, Research tools..

Publication types

  • Biomedical Research*
  • Reproducibility of Results
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Psychologenie

Psychologenie

The Concepts of Reliability and Validity Explained With Examples

All research is conducted via the use of scientific tests and measures, which yield certain observations and data. But for this data to be of any use, the tests must possess certain properties like reliability and validity, that ensure unbiased, accurate, and authentic results. This PsycholoGenie post explores these properties and explains them with the help of examples.

The Concepts of Reliability and Validity Explained With Examples

Reliability and validity are key concepts in the field of psychometrics , which is the study of theories and techniques involved in psychological measurement or assessment.

The science of psychometrics forms the basis of psychological testing and assessment, which involves obtaining an objective and standardized measure of the behavior and personality of the individual test taker. It is an exhaustive process that examines and measures all aspects of an individual’s identity. The data obtained via this process is then interlinked and integrated to form a rounded profile of the individual. Such profiles are often created in day-to-day life by various professionals, e.g, doctors create medical and lifestyle profiles of the patient in order to diagnose and treat health disorders, if any. Career counselors employ a similar approach to identify the field most suited to an individual. Such profiles are also constructed in courts to lend context and justification to legal cases, in order to be able to resolve them quickly, judiciously, and efficiently. However, to be able to formulate accurate profiles, the method of assessment being employed must be accurate, unbiased, and relatively error-free. In order to ensure these qualities, each method or technique must possess certain essential properties.

◉ Standardization – All testing must be conducted under consistent and uniform parameters to avoid introduction of any erroneous variation in text results.

◉ Objectivity – The evaluation of test must be carried out in an objective manner such that no bias, either of the examiner or the examinee, is introduced or reflected in the obtained data.

◉ Test Norms – Each test must be designed in such a way that the results can be interpreted in a relative manner, i.e., it must establish a frame of reference or a point of comparison to compare the attributes of two or more individuals in a common setting.

◉ Reliability – The test must yield the same result each time it is administered on a particular entity or individual, i.e., the test results must be consistent.

◉ Validity – The test being conducted should produce data that it intends to measure, i.e., the results must satisfy and be in accordance with the objectives of the test.

Concept of Reliability

It refers to the consistency and reproducibility of data produced by a given method, technique, or experiment. The form of assessment is said to be reliable if it repeatedly produces stable and similar results under consistent conditions. Consistency is partly ensured if the attribute being measured is stable and does not change suddenly. However, errors may be introduced by factors such as the physical and mental state of the examinee, inadequate attention, distractedness, response to visual and sensory stimuli in the environment, etc. When estimating the reliability of a measure, the examiner must be able to demarcate and differentiate between the errors produced as a result of inefficient measurement and the actual variability of the true score. A true score is that subset of measured data that would recur consistently across various instances of testing in the absence of errors. Hence, the general score produced by a test would be a composite of the true score and the errors of measurement.

Types of Reliability

Test-retest reliability.

It is a measure of the consistency of test results when the test is administered to the same individual twice, where both instances are separated by a specific period of time, using the same testing instruments and conditions. The two scores are then evaluated to determine the true score and the stability of the test.

This type is used in case of attributes that are not expected to change within that given time period. This works for the measuring of physical entities, but in the case of psychological constructs, it does exhibit a few drawbacks that may induce errors in the score. Firstly, the quality being studied may have undergone a change between the two instances of testing. Secondly, the experience of taking the test again could alter the way the examinee performs. And lastly, if the time interval between the two tests is not sufficient, the individual might give different answers based on the memory of his previous attempt.

Medical monitoring

Medical monitoring of “critical” patients works on this principle since vital statistics of the patient are compared and correlated over specific-time intervals, in order to determine whether the patient’s health is improving or deteriorating. Depending on which, the medication and treatment of the patient is altered.

Parallel-forms Reliability

It measures reliability by either administering two similar forms of the same test, or conducting the same test in two similar settings. Despite the variability, both versions must focus on the same aspect of skill, personality, or intelligence of the individual. The two scores obtained are compared and correlated to determine if the results show consistency despite the introduction of alternate versions of environment or test. However, this leads to the question of whether the two similar but alternate forms are actually equivalent or not.

Test

If the problem-solving skills of an individual are being tested, one could generate a large set of suitable questions that can then be separated into two groups with the same level of difficulty, and then administered as two different tests. The comparison of the scores from both tests would help in eliminating errors, if any.

Inter-rater Reliability

t measures the consistency of the scoring conducted by the evaluators of the test. It is important since not all individuals will perceive and interpret the answers in the same way, hence the deemed accurateness of the answers will vary according to the person evaluating them. This helps in refining and eliminating any errors that may be introduced by the subjectivity of the evaluator. If a majority of the evaluators judge are in agreement with regards to the answers, the test is accepted as being reliable. But if there is no consensus between the judges, it implies that the test is not reliable and has failed to actually test the desired quality. However, the judging of the test should be carried out without the influence of any personal bias. In other words, the judges should not be agreeable or disagreeable to the other judges based on their personal perception of them.

Job interview

This is often put into practice in the form of a panel of accomplished professionals, and can be witnessed in various contexts such as, the judging of a beauty pageant, conducting a job interview, a scientific symposium, etc.

Internal Consistency Reliability

It refers to the ability of different parts of the test to probe the same aspect or construct of an individual. If two similar questions are posed to the examinee, the generation of similar answers implies that the test shows internal consistency. If the answers are dissimilar, the test is not consistent and needs to be refined. It is a statistical approach to determine reliability. It is of two types.

► Average Inter-item Correlation

It considers all the questions that probe the same construct, segregates them into individual pairs, and then calculates the correlation coefficient of each pair of questions. Finally, an average is calculated of all the correlation coefficients to yield the final value for the average inter-item correlation. In other words, it ascertains the correlation between each question of the entire test.

► Split-half Reliability

It splits the questions that probe the same construct into two sets of equal proportions, and the data obtained from both sets is compared and matched in order to determine the correlation, if any, between these two sets of data.

Concept of Validity

It refers to the ability of the test to measure data that satisfies and supports the objectives of the test. It refers to the extent of applicability of the concept to the real world instead of a experimental setup. With respect to psychometrics, it is known as test validity and can be described as the degree to which the evidence supports a given theory. It is important since it helps researchers determine which test to implement in order to develop a measure that is ethical, efficient, cost-effective, and one that truly probes and measures the construct in question. Other non-psychological forms of validity include experimental validity and diagnostic validity. Experimental validity refers to whether a test will be supported by statistical evidence and if the test or theory has any real-life application. Diagnostic validity, on the other hand, is in the context of clinical medicine, and refers to the validity of diagnostic and screening tests.

Types of Validity

Construct validity.

School Test

It refers to the ability of the test to measure the construct or quality that it claims to measure, i.e., if a test claims to test intelligence, it is valid if it truly tests the intelligence of the individual. It involves conducting a statistical analysis of the internal structure of the test and its examination by a panel of experts to determine the suitability of each question. It also studies the relationship between the test responses to the test questions, and the ability of the individual to comprehend the questions and provide apt answers. For example , if a test is prepared with the intention of testing a student subject knowledge of science, but the language used to present problems is highly sophisticated and difficult to comprehend. In such a case, the test, instead of gauging the knowledge, ends up testing the language proficiency, and hence is not a valid construct for measuring the subject knowledge of the student.

► Convergent Validity

This type of construct validity measures the degree to which two hypothetically-related concepts are actually real in real life. For example , if a test which is designed to test the correlation of the emotion of joy, bliss, and happiness proves the correlation by providing irrefutable data, then the test is said to possess convergent validity.

► Discriminant Validity

It is a measure of the degree to which two hypothetically unrelated concepts are actually unrelated in real life (evidenced by observed data). For example , if a certain test is designed to prove that happiness and despair are unrelated, and this is proved by the data obtained by conducting the test, then the test is said to have discriminant validity.

Content Validity

Scientists

It is a non-statistical form of validity that involves the examination of the content of the test to determine whether it equally probes and measures all aspects of the given domain, i.e., if a specific domain has 4 subtypes, then equal number of test questions must probe all 4 of these subtypes with an equal intensity. This type of validity has to be taken in to account while formulating the test itself, after conducting a thorough study of the construct to be measured. For example , if a test is designed to assess the learning in the biology department, then that test must cover all aspects of it including its various branches like zoology, botany, microbiology, biotechnology, genetics, ecology, etc., or at least appear to cover.

► Representation Validity

It is also known as translation validity, and refers to the degree to which an abstract theoretical concept can be translated and implemented as a practical testable construct. For example , if one were to design a test to determine if comatose patients could communicate via some form of signals and if the test worked and produced appropriate supportive results, then the test would have representation validity.

► Face Validity

It is an estimate of whether a particular test appears to measure a construct. It does in no way imply whether it actually measures the construct or not, but merely projects that it does. For example , if a test appears to be measuring what it is supposed to, it has high face validity , but if it doesn’t then it has low face validity. It is the least sophisticated form of validity and is also known as surface validity. Hence, if an intelligence appears to be testing the intelligence of individuals, as observed by an evaluator, the test possesses face validity.

Criterion-related Validity

Students in Classroom

It measures the correlation between the outcomes of a test for a construct and the outcomes of pre-established tests that examine the individual criteria that form the overall construct. In other words, if a given construct has 3 criteria, the outcomes of the test are correlated with the outcomes of tests for each individual criteria that are already established as being valid. For example , if a company conducts an IQ test of a job applicant and matches it with his/her past academic record, any correlation that is observed will be an example of criterion-related validity. Depending on the type of correlation the validity is of two types.

► Concurrent Validity

It refers to the degree to which the results of a test correlates well with the results obtained from a related test that has already been validated. The two tests are taken at the same time, and they provide a correlation between events that are on the same temporal plane (present). For example , if a batch of students is given an evaluative test, and on the same day, their teachers are asked to rate each one of those students and the results of both sets are compared, any correlation that is observed between the two sets of data will be concurrently valid.

► Predictive Validity

Score Panel

It refers to the degree to which the results of a test correlate to the results of a related test that is administered sometime in the future. The difference of the time period between the administering of the two tests allows the correlation to possess a predictive quality. For example , if an evaluative test that claims to test the intelligence of students is administered and the students with high scores gained academic success later, while the ones with low scores did not do well academically, the test is said to possess predictive validity.

Although both concepts are essential for accurate psychological assessment, they are not interdependent. A test may be reliable without being valid, and vice versa. This is explained by considering the example of a weighing machine. If one puts a weight of 500g on the machine, and if it shows any other value than 500g, then it is not a valid measure. However, it may still be considered reliable if each time the weight is put on, the machine shows the same reading of say 250g. Hence, in terms of measurement, validity describes accuracy, whereas reliability describes precision.

In case of a test that is valid but unreliable, implementation of the classical test theory provides the examiner or researcher with options and ways to improve the reliability of that test.

Like it? Share it!

Get Updates Right to Your Inbox

Further insights.

stressed

Privacy Overview

  • Open access
  • Published: 11 April 2024

The Fremantle Back Awareness Questionnaire: cross-cultural adaptation, reliability, and validity of the Italian version in people with chronic low back pain

  • Marco Monticone 1 ,
  • Carolina Maurandi 2 ,
  • Elisa Porcu 3 ,
  • Federico Arippa 4 ,
  • Benedict M. Wand 5 &
  • Giorgio Corona 6  

BMC Musculoskeletal Disorders volume  25 , Article number:  279 ( 2024 ) Cite this article

131 Accesses

Metrics details

Background and aim

There is evidence to suggest that assessing back-specific altered self-perception may be useful when seeking to understand and manage low back pain (LBP). The Fremantle Back Awareness Questionnaire (FreBAQ) is a patient-reported measure of back-specific body perception that has never been adapted and psychometrically analysed in Italian. Hence, the objectives of this research were to cross-culturally adapt and validate the Italian version of this outcome measure (namely, the FreBAQ-I), to make it available for use with Italians suffering from chronic LBP.

The FreBAQ-I was developed by forward and backward translation, review by a committee skilled in patient-reported measures and test of the pre-final version to assess its clarity, acceptability, and relevance. The statistical analyses examined: structural validity based on Rasch analysis; hypotheses testing by investigating correlations of the FreBAQ-I with the Roland Morris Disability Questionnaire (RMDQ), a pain intensity numerical rating scale (PI-NRS), the Pain Catastrophising Scale (PCS), and the Tampa Scale of Kinesiophobia (TSK) (Pearson’s correlations); reliability by internal consistency (Cronbach’s alpha) and test–retest repeatability (intraclass correlation coefficient, ICC (2,1)); and measurement error by determining the minimum detectable change (MDC). After the development of a consensus-based translation of the FreBAQ-I, the new outcome measure was delivered to 100 people with chronic LBP.

Rasch analysis confirmed the substantial unidimensionality and the structural validity of the FreBAQ-I. Hypothesis testing was considered good as at least 75% of the hypotheses were confirmed; correlations: RMDQ ( r  = 0.35), PI-NRS ( r  = 0.25), PCS ( r  = 0.41) and TSK ( r  = 0.38). Internal consistency was acceptable (alpha = 0.82) and test–retest repeatability was excellent (ICC (2,1) = 0.88, 95% CI: 0.83, 0.92). The MDC 95 corresponded to 6.7 scale points.

The FreBAQ-I was found to be a unidimensional, valid, and reliable outcome measure in Italians with chronic LBP. Its application is advised for clinical and research use within the Italian speaking community.

Peer Review reports

Low back pain (LBP) is a very common condition with a highly variable course[ 1 ]. Most episodes improve considerably within 6 weeks [ 2 ]; however, about two-thirds of persons still report some pain at 3 and 12 months, leading to chronic ill health[ 3 , 4 ]. Chronic LBP has a wide range of deleterious effects on the individual, limiting functional capacity, work participation, and social engagement, as well as negatively impacting personal relationships, and mental and physical health [ 2 ].

An extensive research effort over many years has suggested multiple factors that might impact the chronic LBP experience, including changes in the way the back is perceived or experienced by the individual. Previous studies have reported that people with chronic LBP represent the back differently when asked to draw how it feels to them [ 5 ], have reduced tactile acuity [ 6 ], deficits in proprioception [ 7 ], reduced motor-imagery ability [ 8 ], and changes in tactile processing similar to the spatial neglect seen following cerebrovascular accidents [ 9 ].

Based on these premises, the Fremantle Back Awareness Questionnaire (FreBAQ) –a self-report questionnaire designed to assess back-specific body perception– was specifically developed for persons with chronic LBP [ 10 ]. The questionnaire was shown to be feasible, reliable, and valid, by means of associations with measures of pain duration, pain intensity, disability, and pain catastrophising [ 10 ]. Further, a later study on chronic LBP demonstrated the FreBAQ’s unidimensionality and acceptable internal consistency, as well as offering further support for the relationship between FreBAQ scores and clinical status, including measures of fear-avoidance and psychological distress [ 11 ].

The FreBAQ represents a helpful tool for assessing warning signs of compromised self-perception of the lower back in people with chronic LBP [ 10 , 11 ] and in women with lumbopelvic pain during pregnancy and postpartum [ 12 ]. This questionnaire seems a promising instrument for identifying additional factors involved in the persistence of back problems, and could serve to guide targeted treatment strategies [ 10 , 11 , 13 ]. Indeed, preliminary studies suggest that treatment programs aimed at improving disturbed body perception (through sensorimotor retraining) may have positive effects on pain and function in individuals with non-specific LBP [ 14 , 15 ].

However, the quality of a patient-reported outcome measure (PROM) may differ noticeably when cross-culturally adapted and used in a country different to where it was initially developed [ 16 ]. Well-established methodological criteria are recommended when validation studies are performed [ 17 ]. Application of these criteria are designed to ensure the quality of the new measurement tool and permit more confident comparison of findings across populations.

The FreBAQ has previously been adapted and psychometrically examined in Japanese, Dutch, German, Turkish, Chinese, Indian, Spanish and Persian populations [ 13 , 18 , 19 , 20 , 21 , 22 , 23 , 24 ], but an Italian version (FreBAQ-I) has not yet been cross-culturally adapted and psychometrically analysed in a similar population. Therefore, the purpose of this research was to develop a FreBAQ-I and examine its psychometric properties in Italians suffering from chronic LBP.

The COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) guidelines were adopted [ 16 ].

The Institutional Review Board endorsed this cross-sectional study (no. 7/16, April 5th 2016). The research was carried out in accordance with the ethical and humane principles of research specified in the Declaration of Helsinki.

Participants

This research engaged people attending an outpatient Hospital Rehabilitation Unit, meeting the following inclusion criteria: diagnosis of chronic non-specific LBP (i.e. a pain localised below the costal margin and above the inferior gluteal folds lasting for more than three months, without a distinguishable, specific, patho-anatomical cause or disease [ 25 ]); people speaking Italian as their first language (as well as those who had an adequate knowledge of Italian) and aged over 18. Exclusion criteria: acute (lasting up to one month) and subacute non-specific LBP (lasting up to three months); specific LBP (i.e. fracture, spinal deformity, disc herniation, canal stenosis, spondylolisthesis, or infections); peripheral or central neurological disorders assessed by means of imaging (e.g. radiographs, CT scans, or MRI) and/or anamnesis; systemic illness (including rheumatologic diseases); cognitive disorders (Mini Mental State Examination of < 24); recent myocardial infarctions; any past cerebrovascular accidents; and not capable or reluctant to give informed consent.

Participants were assessed by two physical and rehabilitation medicine physicians, who were under the supervision of the principal investigator (MM). Both physicians had at least a fifteen years’ experience, were involved in the assessments of the participants during the research process but not in the treatment procedure. Those who satisfied the criteria for inclusion were provided with information about the research aims and procedures and invited to sign a written informed consent form. After that, demographic and clinical characteristics were collected, and all participants completed the outcome measures listed below. Participants were invited to fill in the FreBAQ-I a second time, 7–10 days after their initial assessment to side-step variations in symptoms associated with possible memory effects [ 26 ].

Cross-cultural adaptation

The Italian translation and adaptation of the FreBAQ was performed following the American Association of Orthopaedic Surgeon Outcomes Committee’s recommended protocol and according to the standards for good practice used in the translation and cultural adaptation procedure for PROM [ 27 , 28 ].

Step 1: Italian translation

The original FreBAQ [ 10 ] was independently translated into Italian by two bilingual professionals with distinct backgrounds and some experience in the PROM field. They strove to select terms capturing the connotative meanings of the source text and –at the same time– reflecting everyday-spoken language. Conflicts between translations were examined by the principal investigator and then settled by consensus, in order to consolidate a preliminary Italian version.

Step 2: Back-translation Into English

Two native English-speaking bilingual professional translators, separately back-translated the preliminary adaptation. Then, the principal investigator and the translators checked and clarified potential inconsistencies between the different versions and agreed an advanced Italian version of the questionnaire.

Step 3: Expert Committee

This advanced version was submitted to a team of six bilingual clinicians, methodologists, and translators. They investigated the idiomatic, semantic, and theoretical similarity of items and response categories. This phase finished when a consensus was reached on a prefinal version.

Step 4: Test of the prefinal version

A pilot test was then performed to explore intelligibility, appropriateness, cultural relevance, and potential ambiguity of the prefinal version. Cognitive interviews were performed by a qualified psychologist after administering the tool to 10 persons with chronic LBP, representatives of the target population. The team of experts examined the findings of this test in order to detect any useful refinement and then agreed the final version of FreBAQ-I. These deliberations are available from the corresponding author on request.

Acceptability and feasibility

Participants were interviewed about the comprehensibility of each part of the questionnaire, and the data were verified for missing or multiple answers. The time to compile each questionnaire was gathered (Fig.  1 ).

figure 1

The study procedure

Psychometric properties

Construct validity.

It represents the degree to which the scores of a measurement instrument are consistent with hypotheses, with regard to internal relationships, or relationships with scores of other instruments or differences between relevant groups [ 29 ] and was assessed by structural validity and hypothesis testing.

1. Structural validity (i.e. the degree to which the scores of a measurement instrument are an adequate reflection of the dimensionality of the construct to be measured [ 29 ]) . Rasch analysis (Winsteps software v. 4.8.0) examined the FreBAQ-I using the rating scale model (because all items shared the same rating scale structure). Our detailed iterative procedure has been reported in previous studies [ 30 , 31 ]. In short, the following psychometric issues were investigated:

a) diagnostic assessment of the rating categories, by investigating whether each response category was being used consistently and effectively; for that, the transition thresholds between categories (i.e. the points where two adjacent categories have an equal probability to be endorsed) and average category measures should be ordered from less to more on the underlying latent continuum [ 32 ];

b) internal construct validity, assessed checking how well the observed responses to the items align with the responses predicted by the Rasch model, using chi-square fit statistics (infit and outfit mean-square statistics, MnSq). Based on the sample size, values from 0.75 to 1.30 [ 33 ] were considered as indicating an acceptable fit;

c) reliability, in terms of both person reliability index and item reliability index [ 32 ], providing an estimate of the degree of replicability (across different samples) of person and item placements along the trait continuum (range 0–1; coefficients > 0.80 are considered as good, > 0.90 as excellent). High reliability levels (of persons or items) mean that there is a high probability that persons (or items) estimated with high Rasch measures actually do have higher measures than persons (or items) estimated with low measures [ 33 ].

d) unidimensionality of the scale, examining the unexplained variance after the Rasch dimension is extracted, as obtained by a Principal Component Analysis of the residuals (PCAr). Additional factors are not likely to be present in the residuals if the eigenvalue of the first residual component is < 2 [ 33 ];

e) local item dependence. For any pair of items, no residual correlation > 0.20 (above the average observed residual correlation) should be detected once the variable under measurement (Rasch factor) has been filtered out [ 34 ].

2. Hypothesis testing , which takes place when hypotheses are formulated a priori on the relationships of scores on the instrument under investigation with scores deriving from other measures evaluating related or dissimilar constructs, by also describing the expected direction (i.e. positive or negative) and magnitude (i.e. low, moderate, large) [ 29 ] . Based on what previously assumed in a previous study on the same matter [ 11 ], it was hypothesized a priori the FreBAQ-I would achieve positive moderate correlations (from 0.30 to 0.60) with measures of disability, and catastrophizing, and low correlations (< 0.30) with measures of pain intensity and kinesiophobia. Pearson’s correlation coefficients were calculated, and construct validity was considered as satisfactory if at least 75% of the hypotheses was reached [ 26 ].

  • Reliability

It represents the degree to which the measurement is free from measurement error [ 29 ] and was calculated as detailed below.

1. Internal consistency is the degree of interrelatedness among the items [ 29 ]) and was evaluated by calculating Cronbach’s alpha (values of > 0.70 being considered acceptable).

2. Test–retest repeatability is the degree to which the measurement is free from measurement error over time [ 29 ]) and was examined 7–10 days later without treatment using the intraclass correlation coefficient, ICC (2,1) (values of 0.70–0.85 were considered good and > 0.85 excellent) [ 26 ].

3. The standard error of measurement (SEM) is the difference between an amount that can be measured and its true value [ 29 ]) and was assessed using the formula:

where SD represents the standard deviation of the measurements at baseline.

Interpretability

It represents the degree to which one can assign qualitative meaning to an instrument’s quantitative scores or change in scores [ 29 ] and was calculated by the minimum detectable change (MDC) (i.e. the change beyond measurement error [ 29 ]) by using the following equation:

A z value of 1.96 was used to derive a 95% confidence level MDC (MDC 95 ).

Descriptive statistics were further calculated to identify floor/ceiling effects, which were present if > 15% of the scores achieve their lowest or highest potential value, respectively [ 26 ].

Questionnaires

This questionnaire quantifies distorted perception of the back. It is self-administered and includes 9 items. Each item is scored by a five-point response scale (range: 0 = ‘never’ up to 4 = ‘always’); the final score is obtained by summing the responses from each of the items and ranges from 0 to 36, with higher scores corresponding to greater levels of back-perception distortion [ 10 ].

Roland Morris Disability Questionnaire

This questionnaire assesses LBP related disability. It comprises 24 items, with a total score ranging from 0 (no disability) to 24 (highest level of disability) [ 35 ].

Pain Numerical Rating Scale (PI-NRS)

An 11-point pain numerical rating scale ranging from 0 (no pain at all) to 10 (the worst imaginable pain) was used [ 36 ], asking participants to rate their current pain intensity.

Pain Catastrophising Scale (PCS)

This is a 13-item self-administered questionnaire. People are asked to classify the frequency with which they experience the thoughts listed in the tool, based on a five-point scale, which ranges from 0 (never) to 4 (always). The total score is obtained summing up the scores of the individual items and can vary from 0 to 52 [ 37 ]. Higher scores denote greater levels of pain catastrophising.

Tampa Scale of Kinesiophobia (TSK)

This questionnaire is self-administered and composed of 13 items [ 38 ]. Each item is scored using a four-point Likert scale ranging from 1 (strongly disagree) to 4 (strongly agree), and the total score is calculated by adding the scores of the individual items (range 13–52). Higher values correspond to greater fear of movement [ 38 ].

All outcome measures were administered in their validated Italian versions [ 35 , 37 , 38 , 39 , 40 , 41 ]. The FreBAQ-I was systematically distributed first, then the RMDQ, the NRS, the TSK and the PCS during the first assessment, respectively; only the FreBAQ-I was delivered during the second assessment. Statistical analyses were performed with STATA 13.1 software (StataCorp LP, College Station, TX, USA).

The sample size of 100 was determined to provide adequate statistical power for test–retest reliability, expecting to obtain, with 90% probability, an ICC of about 0.85, with the lower limit of the 95% CI not less than 0.75 [ 42 ]. Moreover, a sample size of 100 participants is able to ensure stability in Rasch item calibrations within ± 0.5 logits with 95% confidence [ 43 ].

One hundred and thirty-five persons with chronic non-specific LBP were consecutively assessed, of whom 25 were excluded. The reasons for exclusion were: cognitive impairment ( n  = 5); systemic illness ( n  = 4); recent cerebrovascular event ( n  = 2); recent myocardial infarction ( n  = 7); and reluctance to take part ( n  = 7). Of the remaining people, 10 dropped out before starting the study because of: logistic issues ( n  = 4); economic constraints ( n  = 3); or personal problems ( n  = 3). Hence, our final sample was comprised of 100 subjects. All participants returned to the hospital for a second assessment within a period of 7 to 10 days, facilitated by a telephonic follow-up conducted by a research assistant. Average pain duration was 49 months (SD 80). Socio-demographic and clinical characteristics of the participants are described in Tables 1 and 2 (Fig.  2 ).

figure 2

Flow chart of participants inclusion

Translation and cross-cultural adaptation

The adaptation process took four weeks to settle upon a culturally appropriate version. All the terms were easily forward and back translated and there was no need of any major local adjustment. The term “occasionally” of one rating category was translated in Italian with “qualche volta” because it was judged as more suitable for a middle response category, while the translation “occasionalmente” would have been hardly discernible in Italian from the adjacent category “rarely/raramente”.

The appropriateness of the whole procedure and related results was endorsed by the team of experts, who also reviewed the findings from the cognitive interviews and made only minor changes based on concerns raised by some participants, to enhance the questionnaire’s comprehensibility. After that, the principal investigator and experts confirmed the final version of the FreBAQ-I, in agreement with the developer (BMW).

Acceptability

The questionnaire took 1.97 ± 1.13 min to complete. The questions were well received. No missing responses or multiple responses were observed, nor were any comprehension difficulties raised during the instrument completion.

Scale psychometric properties

Structural validity.

The 5-level rating scale of the FreBAQ-I showed a monotonical advance of both the transition thresholds between categories and the average category measure. According to the mean-square goodness-of-fit statistics, all items fitted the Rasch model (Table  3 ).

The mean person ability was -0.97 logits (ability range from -3.06 to 3.34). The item reliability was 0.96, while person reliability was 0.75. The PCAr showed that FreBAQ was essentially unidimensional: the measured variable explained 50.3% of the variance in the data, while the secondary component explained only 9.9% of the variance (corresponding to an eigenvalue of 1.8). No local item dependence was detected, (i.e., no strong > 0.20 residual correlation between items was found).

Hypothesis testing

It was considered good as 3 out of 4 a priori hypotheses were confirmed (i.e., ≥ 75%). Related results are shown in Table  4 .

Cronbach’s α was 0.82. Test–retest reliability was found to be high: ICC (2,1) = 0.88 (95% CI: 0.83—0.92). The SEM was 2.44.

The MDC 95 was 6.7 points.

The mean score for the FreBAQ-I was 9.68 points (SD 6.98). No ceiling or floor effects were detected in any of the used scales, including the FreBAQ-I.

This research describes the cross-cultural adaptation of the FreBAQ and the assessment of its validity, reliability, and measurement error, in Italian-speaking people with chronic non-specific LBP. The FreBAQ-I demonstrated unidimensionality, good validity, and adequate reliability. International recommendations were followed in the current study and all the steps suggested that the process of translation and cross-cultural adaptation was accurate and efficient. Our methodological approach comprised forward and backward translation, minor amendments by a team of experts, cognitive debriefing, and discussion and resolution through consensus among the committee members. The procedure established the initial conceptual, semantic, and content equivalence between source and target language. The final version was well accepted, and easily understood and self-administered. The respondent burden was minimal as the questionnaire needs only a few minutes to complete. Overall, the FreBAQ-I appears to be appropriate for everyday clinical practice.

Our results corroborated the findings of the previous structural validations of this outcome measure [ 11 , 13 , 44 ]. The tool proved to be unidimensional (a key measurement requirement), with items acceptably fitting the mathematical model, and a rating scale functioning as expected. No local item dependence was found. In our sample the average difficulty of the items (endorsability) was about 1 logit lower than the mean sample ability (agreeability): in that condition the scale better assesses persons with moderate to high levels of disturbed body perception. These findings indicate that the item selection was appropriate and able to correctly measure the variable of interest. A minor deviation from unidimensionality was found in a sample of Indian people with chronic LBP but no issue was expected as for the clinical application of the FreBAQ in this population [ 22 ].

The correlation with disability (RMDQ) was as expected (Table  3 ), indicating that higher levels of back pain related disability are associated with enhanced levels of body perception disruption relating specifically to the back. This is in line with the findings of the developers, where a moderate correlation with disability was observed (0.32)[ 11 ]. As predicted, we also noted a positive correlation between pain intensity and disrupted perception of the back, though this relationship was weaker than that noted for disability, also consistent with previous research [ 11 ]. Taken together, these results suggest that disrupted body perception is more strongly related to disability than pain in people with chronic LBP [ 11 ]. This issue was also seen in most other adapted studies of the FreBAQ (i.e. Japanese, Dutch, German, Turkish and Persian) [ 13 , 18 , 19 , 20 , 24 ], probably indicating disrupted body perception impacts more on the functional consequences of pain, rather than the experience of pain itself. A higher correlation with disability than pain was otherwise found in the Spanish sample (0.48 vs 0.38) [ 23 ]. Very low correlations with disability and pain were found in the Indian study [ 22 ].

With respect to catastrophizing, we noted a similar correlation to what was reported in the original English version (0.36) [ 11 ]. Our results are also consistent with results obtained from the Japanese (0.38) [ 13 ], Turkish (0.41) [ 20 ], Spanish (0.46) [ 23 ] and Persian (0.60) [ 24 ] cross-cultural adaptation studies. These findings support the idea of a relationship between high levels of pain catastrophizing and disrupted body perception. For kinesiophobia, our estimates are slightly higher than those reported by the original developers (0.26) [ 11 ], however this study utilized a different measure of kinesiophobia (i.e. the Fear Avoidance Beliefs Questionnaire, physical activity subscale), which makes direct comparison difficult. Our results are somewhat in line with the Turkish and Spanish estimates (0.60 and 0.37) [ 20 , 23 ], while divergent from what was found in Japanese, Dutch and Persian samples (0.22, 0.10, and 0.17 respectively) [ 13 , 18 , 24 ], and, therefore, more analyses are recommended before firm conclusions can be drawn about the relationship between kinesiophobia and altered body perception.

The internal consistency of the FreBAQ-I was good (0.82) and quite similar to that reported by the developers (0.80) [ 11 ]. Similar results were found in other versions of the questionnaires: Japanese 0.80 [ 13 ]; Dutch 0.82 [ 18 ]; German 0.91 [ 19 ]; Turkish 0.87 [ 20 ]; Chinese 0.83 [ 21 ], Indian 0.91 [ 22 ], Spanish 0.82 [ 23 ] and Persian 0.74 [ 24 ].

Test–retest repeatability demonstrated an excellent level of agreement between the results on days 1 and 8 (ICC (2,1) = 0.88), a value higher than those reported in the original study (0.65) [ 10 ], while in the other validations, values were of 0.69 [ 18 ], 0.78 [ 23 ] and 0.90 [ 21 ]. The better reliability noted in this investigation may reflect the fact that no treatment was provided to participants between the two testing occasions, a control not enacted in the original study [ 10 ]. The measurement error of the FreBAQ-I was acceptable. Due to the high repeatability of the test–retest the SEM and MDC were rather low.

The MDC 95 demonstrated that a change of more than 7 points after a given intervention (~ 19% of the FreBAQ score range of 36 points) would not be the result of an error in measurement. Slightly different findings were achieved by the Dutch study (where the MDC was estimated at 10.8 points) [ 18 ] and the Persian study (2.52 points) [ 24 ], while the validation of the Chinese and the Spanish version of the FreBAQ reported an MDC 95 of 5.99 and 5.12 points, respectively [ 21 ].

This study should acknowledge some limitations. First, the study design is cross-sectional and thus responsiveness and minimal important change could not be assessed. Second, no external anchor such as a global rating of change was used to assess clinical stability during the assessment of reliability and participants may have improved or worsened between the first and second assessment of the FreBAQ-I. Third, the association between back-related perceptual dysfunction and physical performance measures was not investigated as only questionnaires were employed. Fourth, relationships with other psychological characteristics (e.g. Pain Self-Efficacy Questionnaire or the Coping Strategies Questionnaire-27 revised) [ 45 , 46 , 47 ], or quality of life (e.g. the Short-Form Health Survey 36-items) [ 48 ], as well as with clinical tests that might have the ability to detect alterations in the sensorimotor system [ 49 ], were not examined. Fifth, our research was limited to people with chronic non-specific LBP and it is doubtful if these results can be expanded to individuals with other causes of lumbar pain (e.g. canal stenosis, fracture, or disk herniation) or pain of another duration. Therefore, studies in these populations are advised.

Conclusions

The FreBAQ-I displays a one-factor structure, it is valid and reliable, and has an adequate measurement error. This Italian version can be recommended for use in clinical and research settings for the assessment of Italian-speaking people with chronic LBP.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the first author under a reasonable request.

Abbreviations

Confidence Interval

Computed Tomography

Fremantle Back Awareness Questionnaire

Fremantle Back Awareness Questionnaire-Italian

Intraclass correlation coefficient

International Society for Pharmacoeconomics and Outcomes Research

  • Low back pain

Minimum detectable change

Mean-square statistics

Magnetic Resonance Imaging

Principal Component Analysis of the residuals

Pain Catastrophising Scale

Pain intensity numerical rating scale

Patient-reported outcomes measures

Standard error

Standard error of measurement

Tampa Scale of Kinesiophobia

Dunn KM, Hestbaek L, Cassidy JD. Low back pain across the life course. Best Prac Res Clin Rheumatol. 2013;27:591–600.

Article   Google Scholar  

Hartvigsen J, Hancock MJ, Kongsted A, Louw Q, Ferreira ML, Genevay S, et al. What low back pain is and why we need to pay attention. The Lancet. 2018;391:2356–67.

Menezes Costa LDC, Maher CG, Hancock MJ, McAuley JH, Herbert RD, Costa LOP. The prognosis of acute and persistent low-back pain: a meta-analysis. CMAJ. 2012;184:E613–24.

Article   PubMed Central   Google Scholar  

Itz CJ, Geurts JW, Van Kleef M, Nelemans P. Clinical course of non-specific low back pain: A systematic review of prospective cohort studies set in primary care: Clinical course of non-specific low back pain. EJP. 2013;17:5–15.

Article   CAS   PubMed   Google Scholar  

Moseley LG. I can’t find it! Distorted body image and tactile dysfunction in patients with chronic back pain. Pain. 2008;140:239–43.

Article   PubMed   Google Scholar  

Wand BM, Di Pietro F, George P, O’Connell NE. Tactile thresholds are preserved yet complex sensory function is impaired over the lumbar spine of chronic non-specific low back pain patients: a preliminary investigation. Physiotherapy. 2010;96:317–23.

Sheeran L, Sparkes V, Caterson B, Busse-Morris M, Van Deursen R. Spinal Position Sense and Trunk Muscle Activity During Sitting and Standing in Nonspecific Chronic Low Back Pain: Classification Analysis. Spine. 2012;37:E486–95.

Bray H, Moseley GL. Disrupted working body schema of the trunk in people with back pain. Br J Sports Med. 2011;45:168–73.

Moseley GL, Gallagher L, Gallace A. Neglect-like tactile dysfunction in chronic back pain. Neurology. 2012;79:327–32.

Wand BM, James M, Abbaszadeh S, George PJ, Formby PM, Smith AJ, et al. Assessing self-perception in patients with chronic low back pain: Development of a back-specific body-perception questionnaire. BMR. 2014;27:463–73.

Wand BM, Catley MJ, Rabey MI, O’Sullivan PB, O’Connell NE, Smith AJ. Disrupted Self-Perception in People With Chronic Low Back Pain. Further Evaluation of the Fremantle Back Awareness Questionnaire. J Pain. 2016;17:1001–12.

Goossens N, Geraerts I, Vandenplas L, Van Veldhoven Z, Asnong A, Janssens L. Body perception disturbances in women with pregnancy-related lumbopelvic pain and their role in the persistence of pain postpartum. BMC Pregnancy Childbirth. 2021;21:219.

Article   PubMed   PubMed Central   Google Scholar  

Nishigami T, Mibu A, Tanaka K, Yamashita Y, Shimizu ME, Wand BM, et al. Validation of the Japanese Version of the Fremantle Back Awareness Questionnaire in Patients with Low Back Pain. Pain Pract. 2018;18:170–9.

Wälti P, Kool J, Luomajoki H. Short-term effect on pain and function of neurophysiological education and sensorimotor retraining compared to usual physiotherapy in patients with chronic or recurrent non-specific low back pain, a pilot randomized controlled trial. BMC Musculoskelet Disord. 2015;16:83.

Wand BM, O’Connell NE, Di Pietro F, Bulsara M. Managing Chronic Nonspecific Low Back Pain With a Sensorimotor Retraining Approach: Exploratory Multiple-Baseline Study of 3 Participants. Phys Ther. 2011;91:535–46.

Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, De Vet HCW, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27:1147–57.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Mokkink LB, De Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27:1171–9.

Janssens L, Goossens N, Wand BM, Pijnenburg M, Thys T, Brumagne S. The development of the Dutch version of the Fremantle Back Awareness Questionnaire. Musculoskeletal Science and Practice. 2017;32:84–91.

Ehrenbrusthoff K, Ryan CG, Grüneberg C, Wand BM, Martin DJ. The translation, validity and reliability of the German version of the Fremantle Back Awareness Questionnaire. PLoS ONE. 2018;13:e0205244.

Erol E, Yildiz A, Yildiz R, Apaydin U, Gokmen D, Elbasan B. Reliability and Validity of the Turkish Version of the Fremantle Back Awareness Questionnaire. Spine. 2019;44:E549–54.

Hu F, Liu C, Cao S, Wang X, Liu W, Li T, et al. Cross-cultural adaptation and validation of the simplified chinese version of the fremantle back awareness questionnaire in patients with low back Pain. Eur Spine J. 2022;31:935–42.

Rao P, Jain M, Barman A, Bansal S, Sahu R, Singh N. Fremantle Back awareness questionnaire in chronic low Back pain (Frebaq-I): translation and validation in the Indian population. Asian J Neurosurg. 2021;16:113–8.

García-Dopico N, La Torre-Luque D, Wand BM, Velasco-Roldán O, Sitges C. The cross-cultural adaptation, validity, and reliability of the Spanish version of the Fremantle Back Awareness Questionnaire. Front Psychol. 2023;14:1070411.

Mahmoudzadeh A, Abbaszadeh S, Baharlouei H, Karimi A. Translation and Cross-cultural Adaptation of the Fremantle Back Awareness Questionnaire into Persian language and the assessment of reliability and validity in patients with chronic low back pain. J Res Med Sci. 2020;25:74.

Negrini S, Bonaiuti D, Monticone M, Trevisan C. Medical causes of low back pain. In: Interventional Spine. Elsevier; 2008. p. 803–11.

Chapter   Google Scholar  

Monticone M, Galeoto G, Berardi A, Tofani M. Psychometric properties of assessment tools. Measuring Spinal Cord Injury: A Practical Guide of Outcome Measures; 2021. p. 7–15.

Google Scholar  

Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the Process of Cross-Cultural Adaptation of Self-Report Measures: Spine. 2000;25:3186–91.

CAS   PubMed   Google Scholar  

Wild D, Grove A, Martin M, Eremenco S, McElroy S, Verjee-Lorenz A, et al. Principles of Good Practice for the Translation and Cultural Adaptation Process for Patient-Reported Outcomes (PRO) Measures: Report of the ISPOR Task Force for Translation and Cultural Adaptation. Value in Health. 2005;8:94–104.

De Vet HC, Terwee CB, Mokkink LB, Knol DL. Measurement in medicine: a practical guide. Cambridge university press; 2011.

Book   Google Scholar  

Franchignoni F, Monticone M, Giordano A, Rocca B. Rasch validation of the Prosthetic Mobility Questionnaire: A new outcome measure for assessing mobility in people with lower limb amputation. J Rehabil Med. 2015;47:460–5.

Franchignoni F, Giordano A, Vercelli S, Bravini E, Stissi V, Ferriero G. Rasch Analysis of the Patient and Observer Scar Assessment Scale in Linear Scars: Suggestions for a Patient and Observer Scar Assessment Scale v21 Plastic  Recons Surg. 2019;144:1073E-9E.

CAS   Google Scholar  

Bond T, Yan Z, Heene M. Applying the Rasch model: Fundamental measurement in the human sciences. Routledge; 2020.

Linacre JM, Wright BD. Winsteps. https://www.winsteps.com/index.htm . 2000.Accessed 27–06–2013

Christensen KB, Makransky G, Horton M. Critical values for Yen’s Q 3: Identification of local dependence in the Rasch model using residual correlations. Appl Psychol Meas. 2017;41:178–94.

Padua R, Padua L, Ceccarelli E, Romanini E, Zanoli G, Bondì R, et al. Italian version of the Roland Disability Questionnaire, specific for low back pain: cross-cultural adaptation and validation. Eur Spine J. 2002;11:126–9.

Farrar JT, Young JP Jr, LaMoreaux L, Werth JL, Poole RM. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain. 2001;94:149–58.

Monticone M, Baiardi P, Ferrari S, Foti C, Mugnai R, Pillastrini P, et al. Development of the Italian version of the Pain Catastrophising Scale (PCS-I): cross-cultural adaptation, factor analysis, reliability, validity and sensitivity to change. Qual Life Res. 2012;21:1045–50.

Monticone M, Giorgi I, Baiardi P, Barbieri M, Rocca B, Bonezzi C. Development of the Italian Version of the Tampa Scale of Kinesiophobia (TSK-I): Cross-Cultural Adaptation. Factor Analysis, Reliability, and Validity: Spine. 2010;35:1241–6.

PubMed   Google Scholar  

Monticone M, Baiardi P, Vanti C, Ferrari S, Pillastrini P, Mugnai R, et al. Responsiveness of the Oswestry Disability Index and the Roland Morris Disability Questionnaire in Italian subjects with sub-acute and chronic low back pain. Eur Spine J. 2012;21:122–9.

Monticone M, Ambrosini E, Rocca B, Foti C, Ferrante S. Responsiveness of the Tampa Scale of Kinesiophobia in Italian subjects with chronic low back pain undergoing motor and cognitive rehabilitation. European Spine J. 2016;25:2882–8.

Monticone M, Portoghese I, Rocca B, Giordano A, Campagna M, Franchignoni F. Responsiveness and minimal important change of the Pain Catastrophizing Scale in people with chronic low back pain undergoing multidisciplinary rehabilitation. Eur J Phys Rehab Med. 2022;58:68.

Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med. 2012;31:3972–81.

Linacre JM. Sample size and item calibration stability. Rasch measurement transactions. 1994;7:328.

Schäfer A, Wand BM, Lüdtke K, Ehrenbrusthoff K, Schöttker-Königer T. Validation and investigation of cross cultural equivalence of the Fremantle back awareness questionnaire-German version (FreBAQ-G). BMC Musculoskelet Disord. 2021;22:1–14.

Chiarotto A, Vanti C, Ostelo RW, Ferrari S, Tedesco G, Rocca B, et al. The pain self-efficacy questionnaire: cross-cultural adaptation into Italian and assessment of its measurement properties. Pain Pract. 2015;15:738–47.

Monticone M, Giordano A, Franchignoni F. Scale shortening and decrease in measurement precision: Analysis of the Pain Self-Efficacy Questionnaire and its short forms in an Italian-speaking population with neck pain disorders. Physical Therapy. 2021;101:pzab039.

Monticone M, Ferrante S, Giorgi I, Galandra C, Rocca B, Foti C. The 27-item coping strategies questionnaire—revised: confirmatory factor analysis, reliability and validity in italian-speaking subjects with chronic pain. Pain Res Manage. 2014;19:153–8.

Apolone G, Mosconi P. The Italian SF-36 Health Survey: translation, validation and norming. J Clin Epidemiol. 1998;51:1025–36.

Meier R, Emch C, Gross-Wolf C, Pfeiffer F, Meichtry A, Schmid A, et al. Sensorimotor and body perception assessments of nonspecific chronic low back pain: a cross-sectional study. BMC Musculoskelet Disord. 2021;22:1–10.

Download references

Acknowledgements

The authors would like to thank all the people who got involved into the study, along with Dr Franco Franchignoni and Dr Andrea Giordano, which gave important suggestions and support for the data analysis.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Surgical Sciences, University of Cagliari, Cagliari, Italy

Marco Monticone

Ambulatorio di quartiere, Cagliari, Italy

Carolina Maurandi

Rehabilitation Medicine and Neurorehabilitation, P.O. San Martino, Oristano, Italy

Elisa Porcu

Department of Mechanical, Chemical, and Materials Engineering, University of Cagliari, Cagliari, Italy

Federico Arippa

The Faculty of Medicine, Nursing & Midwifery and Health Sciences, The University of Notre Dame Australia, Fremantle, WA, Australia

Benedict M. Wand

Studio Fisioterapico Corona, Cagliari, Italy

Giorgio Corona

You can also search for this author in PubMed   Google Scholar

Contributions

MM initiated study conception and design. CM, GC and EP performed the data collection and acquisition of data. FA performed the data analysis. MM interpreted the data. MM wrote the manuscript. MM, EP and FA edited the manuscript. BMW had a role in critical revision of different parts of the work. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Federico Arippa .

Ethics declarations

Ethics approval and consent to participate.

This cross-sectional study was approved by our Local Ethical Committee (Salvatore Maugeri Foundation, Scientific Institute of Lissone, Monza Brianza, Italy) on the 05/04/2016 (committee’s reference number: no. 7/16), and conducted in accordance with ethical and humane principles of research of the Declaration of Helsinki. Informed consent, written and verbal, was obtained from all participants. Before inclusion, all participants signed a written informed consent that detailed the study’s aims, attendance modalities in the trial, and that asked for permission to use their clinical and socio-demographic data for the purpose of the present research.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Monticone, M., Maurandi, C., Porcu, E. et al. The Fremantle Back Awareness Questionnaire: cross-cultural adaptation, reliability, and validity of the Italian version in people with chronic low back pain. BMC Musculoskelet Disord 25 , 279 (2024). https://doi.org/10.1186/s12891-024-07420-2

Download citation

Received : 07 November 2023

Accepted : 05 April 2024

Published : 11 April 2024

DOI : https://doi.org/10.1186/s12891-024-07420-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Rehabilitation
  • Rasch analysis
  • Measurement error

BMC Musculoskeletal Disorders

ISSN: 1471-2474

validity and reliability research example

  • Open access
  • Published: 19 April 2024

A first look at the reliability, validity and responsiveness of L-PF-35 dyspnea domain scores in fibrotic hypersensitivity pneumonitis

  • Jeffrey J. Swigris   ORCID: orcid.org/0000-0002-2643-8110 1 ,
  • Kerri Aronson 2 &
  • Evans R. Fernández Pérez 1  

BMC Pulmonary Medicine volume  24 , Article number:  188 ( 2024 ) Cite this article

Metrics details

Dyspnea impairs quality of life (QOL) in patients with fibrotic hypersensitivity pneumonitis (FHP). The Living with Pulmonary Fibrosis questionnaire (L-PF) assesses symptoms, their impacts and PF-related QOL in patients with any form of PF. Its scores have not undergone validation analyses in an FHP cohort.

We used data from the Pirfenidone in FHP trial to examine reliability, validity and responsiveness of the L-PF-35 Dyspnea domain score (Dyspnea) and to estimate its meaningful within-patient change (MWPC) threshold for worsening. Lack of suitable anchors precluded conducting analyses for other L-PF-35 scores.

At baseline, Dyspnea’s internal consistency (Cronbach’s coefficient alpha) was 0.85; there were significant correlations with all four anchors (University of California San Diego Shortness of Breath Questionnaire scores r  = 0.81, St. George’s Activity domain score r  = 0.82, percent predicted forced vital capacity r  = 0.37, and percent predicted diffusing capacity of the lung for carbon monoxide r  = 0.37). Dyspnea was significantly different between anchor subgroups (e.g., lowest percent predicted forced vital capacity (FVC%) vs. highest, 33.5 ± 18.5 vs. 11.1 ± 9.8, p  = 0.01). There were significant correlations between changes in Dyspnea and changes in anchor scores at all trial time points. Longitudinal models further confirmed responsiveness. The MWPC threshold estimate for worsening was 6.6 points (range 5–8).

The L-PF-35 Dyspnea domain appears to possess acceptable psychometric properties for assessing dyspnea in patients with FHP. Because instrument validation is never accomplished with one study, additional research is needed to build on the foundation these analyses provide.

Trial registration

The data for the analyses presented in this manuscript were generated in a trial registered on ClinicalTrials.gov; the identifier was NCT02958917.

Peer Review reports

Introduction

Fibrotic hypersensitivity pneumonitis (FHP) is a form of fibrosing interstitial lung disease (fILD) that, like other fILDs is incurable, induces burdensome symptoms, confers the risk of shortened survival [ 1 , 2 ], and robs patients of their quality of life (QOL) [ 3 , 4 ]. Although in FHP there has not been as much research into the patient experience as with idiopathic pulmonary fibrosis (IPF), available data reveal that FHP-induced dyspnea, fatigue and cough affect how patients feel and function in their daily lives [ 3 , 4 ].

Given the potential for FHP to progress and respond poorly to immunosuppression and antigen avoidance (if one can be identified), Fernández Pérez and colleagues conducted a placebo-controlled trial of the antifibrotic, pirfenidone, in patients with FHP [ 5 ]. In that trial (Pirfenidone in FHP), among other patient-reported outcome measures (PROMs), the Living with Pulmonary Fibrosis (L-PF) questionnaire was used to examine the effects of pirfenidone on FHP-related QOL, symptoms and their impacts.

Here, we present findings from a hypothesis-based analysis of the reliability, validity and responsiveness of the Dyspnea domain from the 35-item L-PF (or L-PF-35; these 35 items are the same 35 that compose the Living with Idiopathic Pulmonary Fibrosis questionnaire (L-IPF) [ 6 ]).

The design and primary results for the single-center, double-blinded Pirfenidone in FHP trial (ClinicalTrials.gov identifier NCT02958917) from which the data for our analyses were generated have been published [ 5 ]. Briefly, 40 subjects with FHP were randomized 2:1 to receive pirfenidone or a matching placebo for 52 weeks. Study visits occurred at baseline, 13, 26, 39 and 52 weeks. At each visit, subjects completed three patient-reported outcome measures (PROMs) and performed spirometry to capture forced vital capacity (FVC). Diffusing capacity (DLCO) was assessed at baseline, 26 and 52 weeks only. This analysis was performed under an approved research protocol by the National Jewish Health central Institutional Review Board (HS# 3034).

PROMs used in the Pirfenidone in FHP trial

The l-pf-35 (living with pulmonary fibrosis 35-item questionnaire).

The L-PF-35 is designed to assess PF-related QOL, symptoms and their impacts. L-PF-35 is equivalent to the Living with Idiopathic Pulmonary Fibrosis Questionnaire (L-IPF) but with the word “idiopathic” removed from the title and a single item from the Impacts Module. L-IPF began as a 44-item questionnaire, but in a previously published validation study that included 125 patients with IPF, psychometric analyses supported reducing numbers from 44 to 35 items [ 6 ]. The intent of the developer of the L-PF is to have a single, 35-item questionnaire for all forms of PF (IPF and non-IPF, including FHP). Thus, although the 44-item version (again, with the word “idiopathic” removed) was administered in the Pirfenidone in FHP trial, our analyses here were conducted on the Dyspnea domain from the 35-item version resulting from the IPF analysis. From here on, we refer to this instrument as the L-PF-35.

Percentage-of-total-possible points is used to generate the Dyspnea domain, Cough domain, Energy/Fatigue domain, and Impacts module from the L-PF-35. The Symptoms module score is derived as the average of the Dyspnea, Cough and Energy/Fatigue domain scores. The total score is the average of the Symptoms and Impacts module scores. The Symptoms module contains 15 items (Dyspnea domain 7 items, Cough domain 5 items, Energy/Fatigue domain 3 items), each with a 24-hour recall period. The Impacts module contains 20 items, each with a 7-day recall period. The range for each of the six scores is 0-100, and higher scores connote greater impairment.

The SGRQ (St. George’s Respiratory Questionnaire)

The SGRQ is a 50-item questionnaire that yields four scores (total, Symptoms, Activity, Impacts). For the version used in the trial, the recall period for some items is three months and for others, it is “these days”. The range for each score is 0-100, and higher scores indicate worse respiratory health status [ 7 , 8 ].

The UCSD (University of California San Diego Shortness of Breath Questionnaire)

The UCSD is a 24-item questionnaire that assesses dyspnea severity while performing each of 21 activities, and it includes another 3 items that ask about limitations induced by shortness of breath [ 9 ]. Each item is scored on a 0–5 scale. There is no stated recall period. Scores range from 0 to 120, and higher scores indicate greater dyspnea severity.

Statistical analyses

Baseline data were tabulated and summarized using counts, percentages and measures of central tendency. We formulated hypotheses (included in the Supplementary material) for the L-PF-35 Dyspnea domain and conducted analyses in accordance with COSMIN recommendations for studies on the measurement properties of PROMs [ 10 , 11 ]. We used SGRQ Activity domain change scores, UCSD change scores, percent predicted FVC (FVC%) change, and percent predicted DLCO (DLCO%) change as anchors. Analyses included the following: (1) internal consistency and test-retest reliability, (2) convergent and known-groups analyses to assess content validity, (3) responsiveness, and (4) an estimation of the meaningful within-patient change (MWPC) threshold for worsening.

For applicable analyses, we defined worsening for the anchors in the following way: 1) ≥ 5 point increase for SGRQ Activity domain [ 12 , 13 ]; 2) ≥ 5 point increase in UCSD score [ 14 ]; 3) > 2% drop in FVC% (e.g., 70% to less than 68%) [ 15 ]; and 4) ≥ 5% drop in DLCO% (e.g., 70–65% or lower). Analyses were conducted in SAS, version 9.4 (SAS Institute Inc.; Cary, NC).

Internal consistency

We used Cronbach’s raw coefficient alpha as the measure of internal consistency (IC). Values > 0.7 are considered acceptable.

Test-retest reliability

We used a two-way mixed effects model for absolute agreement to generate the intraclass correlation coefficient (ICC (2,1)) as a measure of test-retest reliability of L-PF-35 Dyspnea domain scores (from baseline to week 26) among subjects considered stable according to change (also from baseline to week 26) scores for the various anchors. Values > 0.7 are considered acceptable.

Convergent and known-groups validity

Convergent validity was examined using pairwise Spearman correlations between L-PF-35 Dyspnea domain scores and anchors at baseline. We used analysis of variance with secondary, p-value corrected (Tukey) pairwise comparisons to look for statistically significant differences in L-PF-35 Dyspnea domain scores between most and least severe anchor subgroup strata (with anchors di- or trichotomized based on clinically relevant cut-points; e.g., FVC: ≤55, 55 < FVC < 70, or ≥ 70).

Responsiveness

We used pairwise correlation, longitudinal models and empirical cumulative distribution function (eCDF) plots to assess the responsiveness of L-PF-35 Dyspnea domain scores among subjects whose dyspnea changed as defined by the applicable anchor. In the correlational analyses, for 13-, 26-, 39- and 52-week timepoints, we examined pairwise Spearman correlations between L-PF-35 Dyspnea domain change scores and anchor change. In the modeling analyses, for each anchor, we built a repeated-measures, longitudinal model with L-PF-35 Dyspnea domain change score (from baseline to each subsequent time point) as the outcome variable and anchor change (from baseline to each subsequent time point) as the lone predictor variable. Visit (week 13, 26, 39, 52) was included in each model as a class variable, and an unstructured covariance structure was used (i.e., type = un in SAS). For the eCDF, we graphed the cumulative distribution of L-PF-35 Dyspnea domain change scores from baseline to week 26 for each of two dichotomized anchor change strata (worse vs. not at week 26 as defined above).

Meaningful within patient change (MWPC) threshold

We used predictive modeling (anchor as the outcome and L-PF-35 Dyspnea domain as the lone predictor) and adjustment for the correlation between L-PF-35 Dyspnea domain score change and anchor score change [ 16 ] to generate MWPC threshold estimates for worsening at 26 weeks. We used the method of Trigg and Griffiths [ 17 ] to generate a correlation-weighted point estimate.

Baseline characteristics and PROM scores from the trial population are presented in Table  1 . Most subjects were of non-Hispanic white ethnicity/race and supplemental oxygen users, with moderate pulmonary physiological impairment.

Internal consistency and test-retest reliability

IC for the L-PF-35 Dyspnea domain was at least 0.85 at all time points. Test-retest reliability (TRR) coefficients for L-PF-35 Dyspnea were 0.81 or greater for each anchor. Table S1 contains IC and TRR values.

Pairwise correlations at baseline are presented in Table  2 . Correlations between L-PF-35 Dyspnea domain scores and UCSD or SGRQ Activity scores are very strong, statistically significant and in the expected directions. Correlations between L-PF-35 Dyspnea and FVC% or DLCO% are low-moderately strong, statistically significant and in the expected directions.

Table  3 shows results for known-groups validity analyses. For each of the four anchors, compared to the least impaired anchor subgroup, L-PF-35 Dyspnea scores were significantly worse (i.e., higher and of large effect; e.g., worse by > 1 standard deviation) for the more impaired anchor subgroup.

Across study timepoints, 12 of 14 correlations between L-PF-35 Dyspnea domain score change and anchor change values were statistically significant and at least moderately strong (Table S2).

Longitudinal modeling showed significant ( p  < 0.0001 for all) associations between L-PF-35 Dyspnea domain score change and anchor change values over the course of the trial (Fig.  1 ). Table S3 shows results for all longitudinal models.

eCDF plots of L-PF-35 Dyspnea domain 26-week change scores are displayed in Fig.  2 . They show separation between subgroups that worsened vs. not at 26 weeks according to each of the four anchors. Table  4 provides values of L-PF-35 Dyspnea domain 26-week change scores for the cohort using percentile cut-points.

MWPC threshold

Predictive modeling yielded estimates for MWPC for worsening in L-PF-35 Dyspnea domain scores of 6.3, 4.8, 8.0 and 6.9 for the four anchors: UCSD, SGRQ Activity, FVC%, and DLCO% respectively. The corresponding point-biserial correlations between L-PF-35 Dyspnea domain score change and the dichotomized UCSD, SGRQ Activity, FVC%, and DLCO% anchors (worse vs. not) were the following: 0.30, 0.49, 0.47, and 0.65. Thus, the weighted MWPC threshold estimate for worsening of L-PF-35 Dyspnea domain scores was 6.6 points (range 5–8).

In this study, we conducted analyses whose results offer a first glance at the psychometric properties of the L-PF-35 Dyspnea domain and support its reliability, validity and the responsiveness of its score as a measure of dyspnea in patients with FHP. Measurement experts and regulatory bodies have compiled criteria that, when met, deem clinical outcome assessments (COAs)– like PROMs– fit for the purpose of measuring outcomes in a target population [ 10 , 18 ]. The internal structure of the PROM must be sound, with sufficiently strong correlations among grouped items (internal consistency); PROM scores from respondents who are stable on the construct being measured should be similarly stable (test-retest reliability); PROM scores should differ between subgroups of respondents known– or hypothesized– to differ on the construct being measured (known-groups validity); and PROM scores should change for respondents who change on the underlying construct (responsiveness).

Because there are no gold standards for any of the constructs assessed by L-PF-35 scores (including dyspnea), anchors are employed as surrogates for gold standards, and hypotheses are formulated around the surrogates while incorporating the fit-for-purpose criteria outlined above. Anchors, themselves, must be suitable and ideally have undergone validity assessments of their own. Reassuringly, in their studies of patients with PF, other investigators have employed the anchors we used in our analyses [ 19 ]. Additionally, self-report anchors (like the UCSD and SGRQ Activity domain) generally surpass expert-endorsed suitability criteria [ 20 ], and the FVC and DLCO are universally accepted metrics of PF severity.

As hypothesized, the L-PF-35 Dyspnea domain surpassed the acceptability criteria (0.7) for internal consistency and test-retest reliability. Likewise, L-PF-35 Dyspnea domain scores distinguished respondents hypothesized to have the greatest dyspnea severity (e.g., those with the highest (worst) UCSD scores, highest (worst) SGRQ Activity scores, lowest FVC% or lowest DLCO%) from those with the least dyspnea severity. L-PF-35 Dyspnea domain change scores correlated with anchor change scores, and longitudinal modeling and eCDF plots further supported the L-PF-35 Dyspnea domain score as responsive to changes in dyspnea severity over time.

When the recall period for a PROM is 24 h, variability can be accommodated by averaging scores over a given time frame (e.g., a week). That was not done in the Pirfenidone in FHP trial. However, reassuringly, despite the difference in recall periods (L-PF-35 Dyspnea domain 24 h, UCSD no timeframe, SGRQ Activity domain three months), correlations between anchor change scores were generally moderately strong, statistically significant and always in the hypothesized directions. These results, and previously published data showing a < 1 point day-to-day variability in scores from the L-IPF Dyspnea domain scores over a 14 day period in 125 patients with IPF [ 6 ], provide indirect evidence that a single administration of L-PF-35 at each data collection timepoint/visit will likely suffice. And administration on consecutive days with averaging of scores is unlikely to yield significant differences from single administration.

In a previously published study, using different methodology than us, the MWPC threshold for deterioration in the L-PF-44 Dyspnea domain was estimated at 6–7 points in the INBUILD trial population (which included patients with all forms of PF, including FHP, who had progressed within 24 months of enrollment) [ 21 ]. The population in the Pirfenidone in FHP trial was similar to the INBUILD population; in both trials, subjects had to have fibrosis and meet the same progression criteria. In our MWPC analysis, we employed predictive modeling, which is argued to yield the most precise MWPC estimates [ 16 ]. We did not include distribution-based estimates, because they fail to capture patients’ perspectives, ignore the concept of “minimal”, and arguably, should not be included at all in MWPC estimates [ 22 , 23 ]. We used a weighting approach that appropriately incorporated the correlation between the L-PF-35 Dyspnea domain score change and anchor change. Doing so yields a less biased estimate than taking the mean or median of all estimates [ 17 ]. Regardless, it is reassuring that our point estimate perfectly aligns with the estimate generated from the INBUILD data.

Limitations

A lack of suitable anchors were available to conduct analyses for the other L-PF-35 scores, so those must be left for future studies (e.g., there were no cough or fatigue questionnaires included in the trial; SGRQ “total” and L-PF-35 “total” are similar in name but not necessarily in the constructs they capture. The same is true for the L-PF-35 Symptoms module and the SGRQ Symptoms domain). Moving forward, investigators would greatly help advance the science of measurement in the ILD field by including patient global impression (PGI) items for all the constructs being evaluated (e.g., here, these could have included PGI Dyspnea Severity, PGI Cough Severity/Frequency, PGI Fatigue Severity, PGI pulmonary fibrosis-related QOL or PGI general QOL). Additional limitations in our study include the low number of subjects (of predominantly the same ethnic/racial background) and the single-center design of the trial that generated the data, both of which potentially limit generalizing results to the broader FHP population. Because “validation” is not a threshold phenomenon and can not be achieved in a single study, our results should be viewed as only a first– but important– step in the process of confirming L-PF-35 Dyspnea domain scores as fit-for-purpose in this population. Additional research, including validation work, concept elicitation, and cognitive debriefing studies in patients with FHP and other non-IPF populations, is encouraged.

Conclusions

L-PF-35 Dyspnea domain scores appear to possess acceptable reliability, validity and responsiveness for assessing dyspnea severity in patients with FHP. Additional studies are needed to further support its validity and to assess the psychometric properties of the other five L-PF-35 scores for assessing their respective constructs. For now, it is reasonable to use 5–8 points as the estimated range for the MWPC threshold for worsening for the L-PF-35 Dyspnea domain in patients with FHP.

figure 1

Results for mixed-effects longitudinal models showing the relationship between baseline-to-weeks 13/26/39/52 changes in L-PF-35 Dyspnea domain scores and baseline-to-weeks 13/26/39/52 changes in anchor values (Panel A: UCSD anchor, Panel B: SGRQ Activity Domain anchor, Panel C: FVC% anchor, Panel D: DLCO% anchor). Footnote: UCSD = University of California San Diego Shortness of Breath Questionnaire; SGRQ = St. George’s Respiratory Questionnaire; FVC% = percentage of the predicted forced vital capacity; DLCO% = percentage of the predicted diffusing capacity of the lung for carbon monoxide; L-PF = 35-item Living with Pulmonary Fibrosis Questionnaire

figure 2

CDF (Cumulative Distribution Function) plots showing baseline-to-week 26 changes in L-PF-35 Dyspnea domain scores for subgroups defined by anchor change, worse or not from baseline to week 26 (Panel A: UCSD anchor, Panel B: SGRQ Activity Domain anchor, Panel C: FVC% anchor, Panel D: DLCO% anchor) values. Footnote: Red = worsened according to anchor; Blue = not worsened (stable/improved) according to anchor; UCSD = University of California San Diego Shortness of Breath Questionnaire; SGRQ = St. George’s Respiratory Questionnaire; FVC% = percentage of the predicted forced vital capacity; DLCO% = percentage of the predicted diffusing capacity of the lung for carbon monoxide; L-PF = 35-item Living with Pulmonary Fibrosis Questionnaire. Definitions for anchors worsened: 1) ≥ 5 point increase for SGRQ Activity domain; 2) ≥ 5 point increase in UCSD score; 3) > 2% drop in FVC% (e.g., 70% to less than 68%); and 4) ≥ 5% drop in DLCO% (e.g., 70–65% or lower)

Data availability

Data are not publicly available. Parties interested in accessing the data used in this study are encouraged to contact Dr. Fernandez Perez ([email protected]).

Fernandez Perez ER, Swigris JJ, Forssen AV, Tourin O, Solomon JJ, Huie TJ, Olson AL, Brown KK. Identifying an inciting antigen is associated with improved survival in patients with chronic hypersensitivity pneumonitis. Chest. 2013;144:1644–51.

Article   PubMed   PubMed Central   Google Scholar  

Hanak V, Golbin JM, Ryu JH. Causes and presenting features in 85 consecutive patients with hypersensitivity pneumonitis. Mayo Clin Proc. 2007;82:812–6.

Article   PubMed   Google Scholar  

Aronson KI, Hayward BJ, Robbins L, Kaner RJ, Martinez FJ, Safford MM. It’s difficult, it’s life changing what happens to you’ patient perspective on life with chronic hypersensitivity pneumonitis: a qualitative study. BMJ Open Resp Res. 2019;6:e000522.

Article   PubMed Central   Google Scholar  

Lubin M, Chen H, Elicker B, Jones KD, Collard HR, Lee JS. A Comparison of Health-Related Quality of Life in Idiopathic Pulmonary Fibrosis and Chronic Hypersensitivity Pneumonitis. Chest. 2014.

Fernandez Perez ER, Crooks JL, Lynch DA, Humphries SM, Koelsch TL, Swigris JJ, Solomon JJ, Mohning MP, Groshong SD, Fier K. Pirfenidone in fibrotic hypersensitivity pneumonitis: a double-blind, randomised clinical trial of efficacy and safety. Thorax. 2023.

Swigris JJ, Andrae DA, Churney T, Johnson N, Scholand MB, White ES, Matsui A, Raimundo K, Evans CJ. Development and initial validation analyses of the living with idiopathic pulmonary fibrosis questionnaire. Am J Respir Crit Care Med. 2020;202:1689–97.

Jones PW, Quirk FH, Baveystock CM. The St George’s Respiratory Questionnaire. Respiratory medicine 1991; 85 Suppl B: 25–31; discussion 33– 27.

Jones PW, Quirk FH, Baveystock CM, Littlejohns P. A self-complete measure of health status for chronic airflow limitation. The St. George’s respiratory questionnaire. Am Rev Respir Dis. 1992;145:1321–7.

Article   CAS   PubMed   Google Scholar  

Eakin EG, Resnikoff PM, Prewitt LM, Ries AL, Kaplan RM. Validation of a new dyspnea measure: the UCSD Shortness of Breath Questionnaire. University of California, San Diego. Chest. 1998; 113: 619–624.

Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, de Vet HC. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual life Research: Int J Qual life Aspects Treat care Rehabilitation. 2010;19:539–49.

Article   Google Scholar  

Terwee CB, Prinsen CAC, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, Bouter LM, de Vet HCW, Mokkink LB. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual life Research: Int J Qual life Aspects Treat care Rehabilitation. 2018;27:1159–70.

Article   CAS   Google Scholar  

Swigris JJ, Brown KK, Behr J, du Bois RM, King TE, Raghu G, Wamboldt FS. The SF-36 and SGRQ: validity and first look at minimum important differences in IPF. Respir Med. 2010;104:296–304.

Swigris JJ, Wilson H, Esser D, Conoscenti CS, Stansen W, Kline Leidy N, Brown KK. Psychometric properties of the St George’s respiratory questionnaire in patients with idiopathic pulmonary fibrosis: insights from the INPULSIS trials. BMJ open Respiratory Res. 2018;5:e000278.

Chen T, Tsai APY, Hur SA, Wong AW, Sadatsafavi M, Fisher JH, Johannson KA, Assayag D, Morisset J, Shapera S, Khalil N, Fell CD, Manganas H, Cox G, To T, Gershon AS, Hambly N, Halayko AJ, Wilcox PG, Kolb M, Ryerson CJ. Validation and minimum important difference of the UCSD Shortness of Breath Questionnaire in fibrotic interstitial lung disease. Respir Res. 2021;22:202.

Article   CAS   PubMed   PubMed Central   Google Scholar  

du Bois RM, Weycker D, Albera C, Bradford WZ, Costabel U, Kartashov A, King TE, Lancaster L, Noble PW, Sahn SA, Thomeer M, Valeyre D, Wells AU. Forced Vital Capacity in Patients with Idiopathic Pulmonary Fibrosis: Test Properties and Minimal Clinically Important Difference. American journal of respiratory and critical care medicine. 2011.

Terluin B, Eekhout I, Terwee C, De Vet H. Minimal important change (MIC) based on a predictive modeling approach was more precise than MIC based on ROC analysis. J Clin Epidemiol 2015; 68.

Trigg A, Griffiths P. Triangulation of multiple meaningful change thresholds for patient-reported outcome scores. Qual life Research: Int J Qual life Aspects Treat care Rehabilitation. 2021;30:2755–64.

US Department of Health and Human Services and the Food and Drug Administration (CDER). Guidance for Industry, Food and Drug Administration Staff, and other stakeholders: patient-focused Drug Development. Incorporating Clinical Outcome Assessments Into Endpoints for Regulatory Decision-Making Silver Spring, MD; 2023.

Swigris JJ, Esser D, Conoscenti CS, Brown KK. The psychometric properties of the St George’s respiratory questionnaire (SGRQ) in patients with idiopathic pulmonary fibrosis: a literature review. Health Qual Life Outcomes. 2014;12:124.

Devji T, Carrasco-Labra A, Qasim A, Phillips M, Johnston BC, Devasenapathy N, Zeraatkar D, Bhatt M, Jin X, Brignardello-Petersen R, Urquhart O, Foroutan F, Schandelmaier S, Pardo-Hernandez H, Vernooij RW, Huang H, Rizwan Y, Siemieniuk R, Lytvyn L, Patrick DL, Ebrahim S, Furukawa T, Nesrallah G, Schunemann HJ, Bhandari M, Thabane L, Guyatt GH. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714.

Swigris JJ, Bushnell DM, Rohr K, Mueller H, Baldwin M, Inoue Y. Responsiveness and meaningful change thresholds of the living with pulmonary fibrosis (L-PF) questionnaire Dyspnoea and Cough scores in patients with progressive fibrosing interstitial lung diseases. BMJ open Respiratory Res 2022; 9.

Swigris J, Foster B, Johnson N. Determining and reporting minimal important change for patient-reported outcome instruments in pulmonary medicine. Eur Respir J 2022; 60.

Terwee CB, Peipert JD, Chapman R, Lai JS, Terluin B, Cella D, Griffith P, Mokkink LB. Minimal important change (MIC): a conceptual clarification and systematic review of MIC estimates of PROMIS measures. Qual life Research: Int J Qual life Aspects Treat care Rehabilitation. 2021;30:2729–54.

Download references

Acknowledgements

Authors’ information (This is optional): N/A.

There was no funding for this study. Genentech/Roche was the sponsor of the Pirfenidone in Chronic HP trial.

Author information

Authors and affiliations.

Center for Interstitial Lung Disease, National Jewish Health, 1400 Jackson Street, G07, 80206, Denver, CO, USA

Jeffrey J. Swigris & Evans R. Fernández Pérez

Division of Pulmonary and Critical Care Medicine, Weill Cornell College of Medicine, New York, NY, USA

Kerri Aronson

You can also search for this author in PubMed   Google Scholar

Contributions

Study conceptualization: JJS, KA, ERFP. Data acquisition: ERFP. Data analysis: JJS. Interpretation of results: JJS, KA, ERFP. Manuscript preparation and approval of submitted version: JJS, KA, ERFP.

Corresponding author

Correspondence to Jeffrey J. Swigris .

Ethics declarations

Ethics approval and consent to participate.

This analysis was performed under an approved research protocol by the National Jewish Health central Institutional Review Board (HS# 3034). All methods were carried out in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects enrolled in the trial.

Consent for publication

Not applicable.

Competing interests

JJS is the developer of L-PF-44, L-PF-35 and other questionnaires designed to assess outcomes in patients with various forms of interstitial lung disease. KA and ERFP report no conflict related to this study.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Take-home message: Our analyses begin to build the foundation supporting scores from the 35-item Living with Pulmonary Fibrosis Dyspnea domain as possessing psychometric characteristics that make it a suitable measure of dyspnea severity in patients with fibrotic hypersensitivity pneumonitis. The estimate for the meaningful within patient threshold for deterioration in this patient population is 6.6 points with a range of 5–8.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Swigris, J.J., Aronson, K. & R. Fernández Pérez, E. A first look at the reliability, validity and responsiveness of L-PF-35 dyspnea domain scores in fibrotic hypersensitivity pneumonitis. BMC Pulm Med 24 , 188 (2024). https://doi.org/10.1186/s12890-024-02991-1

Download citation

Received : 25 July 2023

Accepted : 02 April 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s12890-024-02991-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Hypersensitivity pneumonitis
  • Patient-reported outcome

BMC Pulmonary Medicine

ISSN: 1471-2466

validity and reliability research example

ORIGINAL RESEARCH article

Translation and psychometric validation of the chinese version of the metacognitive awareness scale among nursing students.

Shasha Li

  • 1 Department of nursing, college of medical science, Huzhou University, huzhou, China
  • 2 Department of Nursing, School of Medicine, Weifang University of Science and Technology, Shouguang, China
  • 3 Hebei University of Chinese Medicine, Shijiazhuang, Hebei Province, China

The final, formatted version of the article will be published soon.

Select one of your emails

You have multiple emails registered with Frontiers:

Notify me on publication

Please enter your email address:

If you already have an account, please login

You don't have a Frontiers account ? You can register here

This study endeavors to translate and psycho-metrically validate the metacognitive awareness inventory scale (MAS) for nursing students in China.Method: A total of 592 nursing students were enlisted from four universities situated in the eastern, southern, western, and northern regions of China. Content validity and reliability were evaluated using the content validity index and item-total correlation coefficient, and Cronbach's alpha coefficients respectively. Convergent validity examined the goodness of fit among sub-scales through the average extracted variance and composite reliability.Results: Exploratory factor analysis confirmed the first-order and second-order factor models, contributing to a cumulative variance of 89.4% and 59.5%, respectively. The Cronbach's alpha values were .963 and .801, respectively. Confirmatory factor analysis outcomes indicated an excellent overall fit index for the model, satisfying the convergent validity criteria and achieving a target coefficient of 96.0%, which is consistent with the original scale structure.The Chinese version of the MAS (C-MAS) is a reliable and valid instrument for assessing metacognitive awareness among Chinese nursing students. Further research should consider a broader sample of nursing students across China to reinforce the scale's applicability.

Keywords: metacognitive awareness, nursing students, Transcultural, Reliability, validity

Received: 15 Jan 2024; Accepted: 18 Apr 2024.

Copyright: © 2024 Li, Zhao, Jia, Liu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Shasha Li, Department of nursing, college of medical science, Huzhou University, huzhou, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

  • Open access
  • Published: 11 April 2024

Validity and reliability of the Turkish version of the transition of primiparas becoming mothers scale

  • Zila Özlem Kirbaş   ORCID: orcid.org/0000-0003-4030-5442 1 ,
  • Elif Odabaşi Aktaş   ORCID: orcid.org/0000-0002-3435-7118 2 &
  • Hava Özkan   ORCID: orcid.org/0000-0001-7314-0934 3  

BMC Pregnancy and Childbirth volume  24 , Article number:  259 ( 2024 ) Cite this article

153 Accesses

Metrics details

The transition to motherhood is an important life event in a woman’s life and represents an important developmental process that brings physical, psychological and social changes to gain a new role. However, research on the transition to motherhood in Turkish society is scarce. There is a need for a comprehensive, practical and reliable tool to evaluate the transition to motherhood in primiparous mothers. This study evaluated the reliability and validity of the Turkish version of the Transition of Primiparous Becoming Mothers Scale (TMP-S) to evaluate the transition process of primiparous mothers to motherhood.

This methodological research was carried out in obstetrics and gynecology outpatient clinics, pediatric outpatient clinics, and family health centers of a hospital in Türkiye. The sample consisted of primiparous mothers of 0 to 6- month-old babies who visited clinics and family health centers for routine postnatal examinations (n ​​= 305). After evaluating the language equivalence and content validity of the scale, test-retest reliability, internal consistency and construct validity were examined. Factor analysis, Pearson’s correlation, retest reliability, and Cronbach’s alpha were employed to evaluate structural validity and reliability.

The final TPM-S had two dimensions with 25 items. Exploratory factor analysis revealed a two-factor solution, which accounted for 59.276% of the variance. Confirmatory factor analysis showed that the model fit of the two-factor model also reached a satisfactory model ft after modification. The comparative fit index was 0.894, the Tucker‒Lewis index was 0.882, and the root mean square error of approximation was 0.079. The content validity index of the scale ranged from 0.56 ~ 0.77. The Cronbach’s alpha coefficient was 0.93 for the total scale, and the test–retest reliability was 0.96.

Conclusions

It is a valid and reliable tool for evaluating the transition to motherhood among primiparous mothers of 0 to 6 month-old babies in Türkiye. Turkish researchers and healthcare professionals can routinely apply this measurement tool to primiparous mothers in the first six months after birth to evaluate their transition to motherhood.

Peer Review reports

Introduction

The transition to motherhood is one of women’s most significant developmental life events. Becoming a mother refers to transitioning from a known situation to an unknown status and a new role. Transitioning to motherhood requires reinterpreting goals, behaviors, and responsibilities to obtain new meanings [ 1 , 2 , 3 ]. The transition to motherhood, which begins during pregnancy, continues in the postnatal period [ 4 ]. The postpartum period is when adaptation to parenthood and secure attachment with the newborn baby can develop [ 5 ]. Moving toward a new normal phase, the woman begins to structure her motherhood to suit herself and her family according to her future goals. Adapted to changing relationships with spouses, family, and friends. Many cognitive restructurings occur as she learns her baby’s cues and what is best for her baby and adapts to her new reality [ 6 ].

Gradually, women in the process of accessing the maternal role learn the behaviors expected from the maternal role. She imitates the maternal performances she observes by following the rules of motherhood and the guidance of other mothers. Thus, she develops a unique behavioral pattern and gains self-confidence and competence in her maternal role [ 7 ]. Both the mother’s health and well-being are at risk during this period, as are her baby’s well-being and the stability of her family. Some special conditions, such as complex and lengthy postpartum recovery and newborn admission to the intensive care unit, may cause disruptions in the transition to motherhood. Nurses and midwives need to have a basic understanding of the motherhood transition to facilitate the process for mothers and babies at risk [ 4 ].

The stage after birth involves becoming familiar with the maternal role and learning to care for a child, where childcare becomes part of daily life [ 4 ]. Slade et al. [ 8 ] stated that the process of becoming a mother causes many essential and permanent changes in pregnant women’s lives; therefore, pregnant women may have difficulty accepting the role of motherhood. Study results have shown that primiparous women and those who are not ready for motherhood experience intense stress during this period [ 9 , 10 ]. Women with multiple pregnancies may also experience problems adapting to motherhood because they experience more anxiety [ 9 ]. The mothers’ perceptions of pregnant women who are hospitalized due to any risk are also negatively affected [ 11 ]. Shorey et al. [ 12 ] stated that primiparous women’s perception of parental competence is lower than that of multiparous women.

The transition to motherhood and the maternal role is affected by the social and cultural values ​​of the individual and the society in which he or she lives. In particular, the value society places on the status of motherhood significantly affects the transition to motherhood [ 13 ]. In Turkish society, women perceive motherhood as the most essential duty. This task constitutes a large part of women’s daily lives. However, women may sometimes experience inadequacies in their motherhood roles due to their individual, physical, and psychological characteristics and social, cultural, and economic situations [ 14 ]. This situation can affect a child’s health, growth, and development. According to motherhood theory, nurses and midwives are health professionals with essential roles in women’s transition to motherhood and their adaptation to maternal roles [ 15 ]. Therefore, nurses and midwives must be able to identify and measure progress in the transition to motherhood to provide adequate physical, psychosocial, and self-care support to mothers. For this purpose, Katou et al. [ 4 ] developed a measurement tool to determine the progress of Japanese primiparas in the transition to motherhood. However, in Türkiye, there is currently no measurement tool that can be used to determine the actual status of adjusting to life in the role of mother.

Today, measurement tools are being developed regarding the acquisition of the postpartum maternal role in Türkiye and various countries. However, since the conditions related to motherhood differ in each country, applying a measurement tool initially developed in other countries to the Turkish population takes time and effort. In addition, existing tools measure confidence, capacity, compassion, level of satisfaction and sense of self-efficacy as the state of acquisition of the maternal role [ 16 , 17 , 18 , 19 , 20 ], and instruments have been developed as scales to determine whether the maternal role has been achieved or to assess poor postpartum physical and mental states [ 20 , 21 , 22 , 23 , 24 ]. However, there is currently no tool for determining the actual situation of adjusting to life through the role of motherhood in Turkey. To provide support to the Turkish primipara, we believe that it is imperative to understand the actual process of adjusting to life in the motherhood role during the transition to motherhood. No study of the adaptation, validity, or reliability of this scale, which was developed in Japan, has been conducted in another culture. This study is the first cross-cultural adaptation, validity, and reliability study of the scale. Therefore, our study was carried out via face-to-face interviews in gynecology and obstetrics outpatient clinics, pediatric outpatient clinics and family health centers of a hospital in Türkiye between November 2022 and May 2023.

To support mothers, it is imperative to understand the process of adjusting to life in the mother’s role as a transition to becoming a mother. Therefore, in this study, we aimed to develop a Turkish version of the measurement tool developed to determine the progress of Japanese primiparas in the transition to motherhood and to examine its psychometric properties and factor structure to shed light on the transition of Turkish primiparas in the process of becoming mothers.

Aim and design

This study is a methodological study to test the validity and reliability of the TPM-S in the Turkish context. The scale aims to determine the care self-efficacy levels of pediatric nurses. The study findings were presented following the guidelines outlined in the STROBE.

Participants and procedures

This methodological research was carried out in obstetrics and gynecology outpatient clinics, pediatric outpatient clinics, and family health centers of a hospital in Türkiye between November 2022 and May 2023. The sample consisted of primiparous mothers of 0 to 6-month-old babies who visited clinics and family health centers for routine postnatal examinations. Since regular checks in Türkiye are performed by hospitals and family health centers (FHCs), data were collected in these centers for ease of access to mothers. When adapting a measurement tool to another culture, it is recommended to include at least 5–10 times as many participants as the number of items in the measurement tool [ 25 ]. However, Hogarty et al. [ 26 ] suggested that this ratio should be 20:1. In scale psychometric studies, it is recommended that the number of items forming the scale be 5–20 times greater, the factor structure be stable, and the sample size be at least 300 participants so that the results can be generalized [ 27 ]. Accordingly, this study’s sample size was 300 (30 × 10 = 300), with at least ten primiparous mothers per item. Primiparous mothers of 0 to 6-month-old babies who came to the clinic and FHCs between the data collection dates were included in the study by a random sampling method. The study included 305 primiparous mothers. Power analysis was performed to determine whether the number of samples was sufficient to detect significant differences. Power analysis revealed a power of 100% with an effect size of 0.7 (α = 0.05). The results showed that the sample size was sufficient.

The inclusion criteria for primiparous mothers were as follows: having a live birth at the age of 18 or over (without a baby with a congenital anomaly, a premature baby, a low birth weight baby, or a multiple birth), the mother’s verbal declaration that she was not diagnosed with any mental disorder, and being open to communication to be a primiparous mother between 1 and 6 months postnatally. Mothers who did not want to participate in the study were excluded from the study. When mothers underwent routine examination or vaccination of their babies in the hospitals and centers where the research was conducted, they were informed about the purpose and procedure of the research before participating in the study, and informed consent was obtained from those who agreed to participate. After approval, the mothers completed the data collection forms individually to reduce the impact of the research on the mothers. Explanations were given when participants had questions about the study.

Data were collected with a sociodemographic survey and the Transition of Primiparas Becoming Mothers scale.

Sociodemographic questionnaire

The researchers created the survey by reviewing the relevant literature. It consists of sociodemographic information such as maternal age, educational background, working status, household income, place of residence, family type, and postnatal month and type of birth. The mothers completed the questionnaires individually.

The transition of Primiparas’ becoming mothers Scale

Mercer [ 28 ] suggested that the individual’s thought process regarding gaining the maternal role in becoming a mother should be changed. The maternal role involves coping with the child’s growth and changes in the environment as mothers constantly experience transitions throughout their lives. The scale is based on Mercer’s concept of the maternal role and is a tool developed to measure the transition of primiparous mothers to the maternal role. It consists of 30 items scored on a 5-point Likert-type scale (1 = “Not applicable” to 5 = “Applicable”). The total score varies between 30 and 150. There are reverse-scored items. The items are 1.2.3.4.5.6.7.8.9.10.26.29 on the scale. High scores from the scale indicate increased adaptation during the transition process of primiparas. The contents of the five dimensions are as follows: Factor I: Feeling of inadequacy in the maternal role, Factor II: What does child care mean to me?, Factor III: Feeling of mastery in fulfilling the role of mother, Factor IV: Relationship with one’s partner in child care, and Factor V: One’s own improving parenting perspective. The scale consists of 5 subdimensions, and the Cronbach’s α coefficient for each subdimension was 0.871 for Factor I, 0.870 for Factor II, 0.751 for Factor III, 0.767 for Factor IV, and 0.648 for Factor V [ 4 ].

Translation of the transition of Primiparas becoming mothers Scale

The back-translation method was used to test the linguistic validity regarding the semantic equivalence of the scale’s items in the desired language [ 29 ]. The back-translation method is the most common method used to assess linguistic validity [ 30 ]. In this study, three translators translated the TPM-S from English to Turkish. An academic who speaks both languages ​​checked the translated version. The researchers revised the scale based on the academic feedback. Three fluent translators in both languages translated the Turkish version back into English. The researchers rechecked all translated versions, selected the items that best represented the dimensions in both languages and made some changes. Ten experts were consulted on TPM-S women’s health.

Data analysis

The data obtained in the study were analyzed using SPSS (Statistical Package for Social Sciences) for Windows 26.0 and AMOS (Analysis of Moment Structures) 23.0. The sociodemographic characteristics of the mothers were analyzed using descriptive statistics. Percentages were used for categorical data on sociodemographic characteristics, and means and standard deviations were used for continuous data. The Davis technique was used to measure the level of expert agreement on the TPM-S items. The Kaiser‒Meyer‒Olkin (KMO) test was used to determine sample adequacy in the scale. Bartlett’s test of sphericity was used to determine whether the correlation was suitable for factor analysis. The root mean square error of approximation (RMSEA), comparative fit index (CFI), goodness-of-fit index (GFI), and Tucker‒Lewis index (TLI) were used for confirmatory factor analysis. Test-retest reliability was used to determine the reliability of the TPM-S, and Cronbach’s alpha reliability coefficient was used for internal consistency. The significance value was accepted as 0.05.

Sample characteristics

Three hundred and five primiparous mothers were included in the study. The average age of the mothers participating in the study was 26.87 ± 4.62 years; 42.6% were high school graduates ( n  = 130), and 71.5% were not employed ( n  = 218). A total of 50.8% of the participants lived in the district ( n  = 155), 70.8% had a nuclear family ( n  = 216), and 52.8% were at the middle-income level ( n  = 161). A total of 25.6% of the mothers had a normal birth ( n  = 160) in the third postpartum month ( n  = 78) (Table  1 ).

Content validity

The Davis technique [ 31 ] was used to assess the scale’s content validity. To ensure the content validity and face validity of the scale, the opinions of academicians who are experts in the field of nursing/midwifery in Türkiye were consulted. The experts evaluated each item on the scale as follows:

“Appropriate.”

“The item should be slightly revised.“

“The item should be seriously revised.”

” The item is not appropriate.”

The content validity index was calculated by dividing the number of experts who marked (a) and (b) each item by the total number of experts who expressed their opinions. The results showed that the experts had a high view of the interpretability/clarity and cultural appropriateness of the TPM-S items (CVI = 0.97). They agreed at the level of agreement and that there would be no change in any of the items. To ensure the clarity of the scale, 20 primiparous mothers were asked to complete the scale. None of the mothers reported any problems with the scale items. The 20 primiparous mothers included in the pilot study were not included in the main study.

Construct validity

Before applying exploratory factor analysis, the Kaiser–Meyer–Olkin (KMO) test was used to test whether the sample size was suitable for factor analysis. The analysis revealed that the KMO value was 0.933. In line with this result, it was concluded that the sample adequacy was “sufficient” to conduct factor analysis. KMO values ​​between 0.5 and 1.0 are considered acceptable, while values ​​below 0.5 indicate that factor analysis is unsuitable for the dataset [ 32 ]. Additionally, when the Bartlett sphericity test results were examined, the chi-square value obtained was acceptable (χ2 (300) = 5286.273, p  < 0.05; Table  2 ).

Exploratory factor analysis and factor naming

Factor loading describes the relationships between items and factors. There was no limitation on the number of dimensions when performing EFA. As a result of the analysis, the items with low factor loads, those that did not load on a theoretically significant dimension, those that loaded on more than one factor, and those with differences between factor loads of less than 0.01 were excluded from the analysis one by one, and the analyses were repeated. Thus, the final structure of the scale was revealed (Table  2 ). The factor loadings of the TPM-S items ranged from 0.58 to 0.81, and some items were removed from the scale because they were below 0.40. The criterion for the items to remain on the scale was that their factor loadings were more than 0.40 [ 33 ]. Başol [ 34 ] stated that the discrimination power of an item, expressed as the coefficient of determination or validity of the measured feature, should have values ​​of 0.40 and above. In the original scale, 30 items were grouped into five subscales. In the explanatory factor analysis conducted to reveal the factor pattern of the scale and based on the scree plot, five items with low factor loads were removed from the scale (item 9, item 10, item 14, item 26, item 29). The remaining 25 items were collected in two subdimensions (item 9, item 10, item 14, item 26, item 29; Fig.  1 ). Factor I was named ‘sense of mastery in fulfilling the role of mother’ and consisted of items related to the awareness of understanding the demands of the child and starting to act like a mother. Factor II was defined as a ‘feeling of inadequacy in the maternal role’ and consisted of items related to the inadequacy and lack of self-confidence in one’s parenting and the inability to fulfill the maternal role fully. These factors explained 59.276% of the total variance (Table  2 ). In multifactor designs, it is considered sufficient if the explained variance is greater than 50% [ 35 ]. When the correlations between variables are examined, the factor loadings of the items are above 0.40, and all correlation relationships are significant (Table  2 ). The factors were rotated with the Varimax rotation process [ 36 ].

figure 1

Scree plot for factor components of TPM-S

Confirmatory factor analysis

The standardized values ​​(factor loadings), model goodness of fit, validity, and composite reliability values ​​of the models tested with first-level CFA are given in Table  3 . The CR and AVE values ​​of the factors are presented in the table. The table shows that the CR values ​​are above 0.70. This CR indicates that the scales are reliable. All the AVE values ​​of the factors are above 0.50. This AVE shows that the convergent validity of the scales was achieved.

According to the confirmatory factor analysis, it was determined that the 25 items that make up the scale were related to the 2-dimensional scale structure (Table  4 ). Improvements are being made to the model. While improving, variables that reduce fit were identified, and a new covariance was created for those with high covariance between residual values. Afterwards, in the renewed fit index calculations, the accepted values ​​of the appropriate indices are shown in Table  4 . We found that CMIN/DF = 3.022. The RMSEA is an index that evaluates fit as a function of degrees of freedom; higher values ​​indicate poorer fit, and a value below 0.08 indicates an acceptable fit [ 37 ]. We found that the RMSEA was 0.079 (Table  4 ). In this study, the CFI was 0.894, the GFI was 0.817, the NFI was 0.850, and the TLI was 0.882 (Fig.  2 ).

figure 2

Internal consistency reliability

Test-retest reliability was used to determine the internal consistency of the TPM-S. The scale was applied to 40 primiparous mothers who returned for routine postpartum outpatient clinic control (first measurement). The retest was applied to the same 40 primiparous mothers who came to FHCs to vaccinate their babies on the 30th day (second measurement). The Pearson product-moment correlation coefficient was used to determine the correlation between the test and retest scores used to measure the internal consistency of the scale, and there was a significant correlation between the scores ([ 27 , 38 ], p  < 0.01, Table  4 ). In addition, item analysis based on subsupergroups was conducted to test the internal consistency reliability. The findings are summarized in Table  3 . As a result of the comparison, there is a significant difference between the averages of the lower and upper group item scores at the p  < 0.05 level for all items for each sub dimension. Based on this, the scale’s subdimensions are distinctive in measuring the desired quality.

Reliability results

The Cronbach’s alpha value of the scale is 0.93, which shows that the scale is quite reliable. Among the subscales of the scale, the Cronbach’s alpha of Factor I was 0.95, and the Cronbach’s alpha of Factor II was 0.91 (Table  2 ). This shows that the subdimensions of the scale are also quite reliable. There was a significant correlation between the total scale and item scores of the scale ( p  < 0.001).

We examined the psychometric properties of the version of the TPM-S adapted to Turkish culture. In Türkiye, measurement tools that determine the maternal role of primiparous mothers and their transition to motherhood are inadequate, and no measurement tool measures the transition to motherhood. This study is the first to psychometrically test the TPM-S among primiparous mothers in Türkiye. The psychometric results of the adapted scale were consistent with the results of the original scale [ 4 ]. Although the study was close to the original scale despite the lack of subscales and items, it could not be compared or discussed with other studies because the scale was not adapted to different cultures.

Research results show that the TPM-S is a valid and reliable measurement tool for primiparous mothers in Turkey. The TPM-S is a measurement tool that can be applied internationally and translated into other languages. The Cronbach’s alpha value of the TPM-S was 0.93, exceeding the recommended value [ 39 , 40 ]. The Cronbach’s alpha of Factor I was 0.95, the Cronbach’s alpha of Factor II was 0.91, and our results were similar to those of the original TPM-S. The Cronbach’s alpha values ​​corresponding to the factors in the original scale were 0.75 and 0.87 [ 4 ]. Since this is the first study to determine the validity and reliability of the TPM-S on primiparous mothers from different cultures, further multicenter validation studies should be conducted with larger samples from various cultures to confirm our results.

Test-retest tests were used to assess the reliability and internal consistency of the TPM-S. There was a significant correlation between the two tests performed within a specific period ( r  = 0.96, p  < 0.001). A parallel form was used instead of a test-retest to check the reliability of the original scale. Since there is no other research on the TPM-S for primiparous mothers, further validity and reliability studies are needed to test our results. In the item analysis based on lower and upper groups to test internal consistency, the adapted scale is distinctive in measuring the desired quality, as there was a difference in the significance level between the lower and upper group item scores for each subdimension.

Exploratory and confirmatory factor analyses were conducted to determine the scale’s construct validity. KMO was 0.93, and the Bartlett test of sphericity was significant ( p  < 0.001), indicating that the minimum number of participants recommended for each test was provided for the sample to be considered sufficient [ 41 , 42 ] and that there was an adequate correlation between the variables for factor analysis. Exploratory factor analysis (EFA) was used to examine the factor loadings of the TPM-S items. Factor loading describes the relationships between items and factors. Başol [ 34 ] stated that the discrimination power of an item, expressed as the coefficient of determination or validity of the measured feature, should have values ​​of 0.40 and above. Five items with factor loadings below 0.40 were removed from the scale. Five subdimensions in the original scale were reduced to two subdimensions, and 30 items were reduced to 25 items [ 4 ]. Items with factor loads of 0.32–0.44 indicate poor, items with 0.45–0.49 indicate fair, items with 0.50–0.62 indicate good, items with 0.63–0.70 indicate very good, and items with ≥ 0.71 indicate excellent [ 43 , 44 ]. The factor loadings of the remaining items of the TPM-S were between 0.58 and 0.81. In the original scale, the common factor loadings of each item were > 0.40, suggesting that each item was highly related to the identified factor [ 4 ].

The CR value of the model tested with first-level CFA showed that the scale was at a sufficient level of reliability, and the AVE value indicated that the scales achieved convergent validity. CR values ​​must be greater than 0.70 [ 45 ], and AVE values ​​must be 0.50 or greater [ 46 ]. The AVE and CR values ​​were acceptable.

CFA was performed to assess the construct validity of the original scale when it was adapted to another culture. The aim is to determine the similarities and differences between the adjusted and original scales. Degrees of freedom are an essential criterion for the chi-square test. The chi-square value is the most basic measurement used to test the general suitability of a model. When determining the level of fit of the model with the data, multiple model fit indices should be examined. This model is constructed with degrees of freedom and a chi-square test. The chi-square/degrees of freedom ratio is used as the appropriateness criterion, and a ratio less than 5 indicates an acceptable value for the scale [ 47 ].

For the RMSEA, a measure of the discrepancy between the population and the observed covariance per degree of freedom, a value below 0.08 indicates an acceptable fit [ 37 , 48 ]. The study found the RMSEA to be at an acceptable level.

CFI obtains results by comparing the null model with the sample covariance matrix, taking values ​​between 0 and 1. The model’s fit increases as it approaches one and is a measure of model fit relative to the fit of an independent model that is assumed to provide a poor fit to the data [ 49 ]. In this study, the CFI value indicates the suitability of the model. The GFI is basically the result of the ratio of model covariances and variances to the measured variances and covariances. In short, it is the proportional comparison of the real and the modeled data [ 50 ]. The GFI statistic takes values ​​between 0 and 1 and moves inversely proportional to the degrees of freedom. Therefore, the ratio of sample size to degrees of freedom tends to increase as the sample size increases [ 51 ]. Traditionally, a threshold value of 0.90 is recommended, but when small sample sizes and factor loadings are found to be low, an evaluation up to a threshold value of 0.95 can be made [ 51 ].

The GFI in the study was 0.817. A GFI value < 0.90 indicates a relatively more minor observable variable and a smaller GFI [ 52 ]. Therefore, the number of items may have affected the results of this study. However, increasing the number of items will cause the response rate to decrease and the number of people leaving the survey unfinished and inappropriate responses to increase. Finally, the normalized fit index (NFI) ranges from 0 to 1, with an NFI greater than or equal to 0.90, indicating an acceptable fit [ 53 ]. When necessary, as the number of items and the sample increase, the acceptable value may be > 0.80 [ 54 ], and the NFI value in this study was at an acceptable level. Studies with larger samples are needed to better verify and investigate the model fit of the scale. However, NFI may provide less fit than that found in models studied with (small) samples below 200 [ 54 ].

To eliminate this problem, the TLI is used as an alternative to the NFI. There are many different opinions in the literature regarding the TLI threshold value. A threshold value of TLI > 0.80 is acceptable [ 49 ]. The TLI value in the study was 0.88, and the model fit well in this study, which was conducted with 305 samples.

Scale practicality

The current study adapted the original scale to Turkish society based on interviews with Turkish primiparas up to 6 months postpartum while adjusting to life as a mother. Unlike the original scale, the adapted scale was introduced to Turkish society and the literature with 25 items and two subdimensions. While scoring according to the overall scale and its subdimensions, it can be used to reveal which part of life as a mother the participants included in the study were accustomed to and to what extent they became accustomed to it.

In addition to giving high or low scores to specific parameters, the scale is essential for awareness of the individual’s place in the motherhood role. With this scale, mothers will be able to determine at what stage they are in parenthood. In line with the original scale’s recommendations, the scores obtained from the scale and its subdimensions were evaluated in an adaptation study made for Turkish society. It was found that as the total score of the subdimensions and the total score of the overall scale increased, the adaptation of primiparous mothers in the transition to motherhood increased.

Strengths and limitations

The first of the most vital aspects of this study is that a valid and reliable measurement tool that researchers and health professionals in Türkiye use for primiparous mothers has been adapted to Turkish society and culture. Second, the scale we adapted is the first of its kind since there is no measurement tool to determine the transition process of primiparous mothers to motherhood in Türkiye. The most important limitation was that since no equivalent scale was developed for primiparous mothers, reliability analysis could not be performed, and the test-retest method was used instead.

The study’s original English version of the TPM-S was translated into Turkish and administered to 305 primiparous mothers. The translated version had acceptable goodness-of-fit values ​​and a high reliability coefficient. The TPM-S is a valid and reliable measurement tool that can be used to determine the transition experiences of primiparous mothers to motherhood. Health professionals and researchers can use this scale to identify primiparous mothers who need support in parenting and thus effectively support primiparous mothers who perceive themselves as inadequate in childcare and the maternal role. Health professionals can routinely apply the form to primiparous mothers in the first six months after birth to determine their transition to motherhood. They can also consider education and support initiatives in the care practices they plan to plan for first-time mothers.

Data availability

The corresponding author, upon reasonable request, will provide data supporting the findings of this study.

Arnold-Baker C. The process of becoming: maternal identity in the transition to motherhood existential analysis. J Soc Existential Anal. 2019; 30(2).

Barimani M, Vikström A, Rosander M, ForslundFrykedal K, Berlin A. Facilitating and inhibiting factors in transition to parenthood–ways in which health professionals can support parents. Scand J Caring Sci. 2017;31(3):537–46.

Article   PubMed   Google Scholar  

Meleis AI. Facilitating and managing transitions: an ımperative for quality care. Investigación en Enfermería: Imagen Y Desarrollo. 2019; 21(1).

Katou Y, Okamura M, Ohira M. Development of an assessment tool for the transition of Japanese primiparas becoming mothers: reliability and validity. Midwifery. 2022;103485.

Finlayson K, Crossland N, Bonet M, Downe S. What matters to women in the postnatal period: a meta-synthesis of qualitative studies? PLoS ONE. 2020;15(4):e0231415.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Mercer RT. Becoming a mother versus maternal role attainment. J Nurs Scholarsh. 2004;363:226–32.

Article   Google Scholar  

Çelik AA, Akdeniz G. A theoretical look at the change in consumption habits with women’s transition to motherhood. J Consumer Consum Res. 2020;12(2):367–402.

Google Scholar  

Slade A, Cohen LJ, Sadler LS, Miller M. The psychology and psychopathology of pregnancy. In: Zeanah CH, editor. Handbook of Infant Mental Health. New York, NY: Guilford Press; 2009. pp. 22–39.

Khandan S, Riazi H, Amir Ali Akbari S, Nasiri M, Montazeri A. Adaptation to maternal role and infant development: a crosssectional study. J Reproductive Infant Psychol. 2018;36(3):289–301.

Harding JF, Zief S, Farb A, Margolis A. Supporting expectant and parenting teens: new evidence to ınform future programming and research. Matern Child Health J. 2020;24(2):67–75.

Article   PubMed   PubMed Central   Google Scholar  

Ha JY, Kim YJ. Factors influencing self-confidence in the maternal role among early postpartum mothers. Korean Soc Women Health Nurs. 2013;19:48–55.

Shorey S, Chan W, Chong Y, He H. Maternal parental self-efficacy in newborn care and social support needs in Singapore: a correlational study. J Clin Nurs. 2014;23(15–16):2272–83.

Barkin JL, Wisner KL. The role of maternal self-care in new motherhood. Midwifery. 2013;29(9):1050–5.

Mercer RT, Walker LO. A review of nursing interventions to foster becoming a mother. J Obstetric Gynecologic Neonatal Nurs. 2006;35(5):568–82.

Pontoppidan M, Andrade SB, Kristensen IH, Mortensen EL. Maternal confidence after birth in at-risk and not-at-risk mothers: internal and external validity of the Danish version of the Karitane parenting confidence scale (KPCS). J Patient-Rep. 2019;3:33.

Shrestha S, Adachi K, Shrestha S. Translation and validation of the Karitane parenting confidence scale in Nepali language. Midwifery. 2016;36:86–91.

Bilgin Z, Ecevit Alpar Ş. Scale for maternity role perceptions. Health Care Women Int. 2021;42:485–502.

Doster A, Wallwiener S, Müller M, Matthies LM, Plewniok K, Feller S, et al. Reliability and validity of the German version of the maternal–fetal attachment scale. Arch Gynecol Obstet. 2018;297:1157–67.

Wardani DA, Rachmawati IN, Gayatri D. Maternal self-efficacy of pregnant Indonesian teens: development and validation of an Indonesian version of the young adult maternal confidence scale and measurement of its validity and reliability. Comprehens Child Adolesc Nurs. 2017;40:145–51.

Sánchez-Rodríguez R, Callahan S, Séjourné N. Development and preliminary validation of the maternal burnout scale (MBS) in a French sample of mothers: bifactorial structure, reliability, and validity. Arch Women’s Ment Health. 2020;23:573–83.

Yiğit F, Arpacı Kızıldağ İ. The effect of the education given to primiparous pregnant women with two different methods (Face-to-Face/Web-Based) in the last trimester on their acquisition of the motherhood role (Thesis No. 785426) [Doctoral thesis, Hasan Kalyoncu University]. 2023.

Ganjekar S, Prakash A, Thippeswamy H, Desai G, Chandra PS. The NIMHANS (National Institute of Mental Health and Neuro Sciences) maternal behavior scale (Nimbus): development and validation of a scale for assessment of maternal behavior among mothers with postpartum severe mental illness in low resource settings. Asian J Psychiatr. 2020; 47.

Fallon V, Halford JCG, Bennett KM, Harrold JA. The Postpartum specific anxiety scale: development and preliminary validation. Arch Women’s Mental Health 2016; 19.

Özdemir K, Menekse D, Çınar N. Development of obsessive and compulsive behaviors scale of mothers in postpartum period regarding baby care: validity and reliability. Perspect Psychiatr Care. 2020;56:379–85.

Kline P. An Easy Guide to factor analysis. 1st ed. Routledge; 1994.

Hogarty KY, Hines CV, Kromrey JD, Ferron JM, Mumford KR. The quality of factor solutions in exploratory factor analysis: the influence of sample size, communality, and overdetermination. Educ Psychol Meas. 2005;65(2):202–26.

DeVellis RF, Thorpe CT. Scale development: theory and applications. Sage; 2021.

Mercer RT. Becoming a mother. New York: Springer; 1995.

Beaton D, Bombardier C, Guillemin F, Ferraz MB. Recommendations for the cross-cultural adaptation of the DASH & Quick DASH Outcome measures. Institute for Work & Health; 2007. pp. 1–45.

Brislin Richard W. Back-translation for Cross-cultural Research. J Cross-Cultural Psychol. 1970;1:185–216.

Davis LL. Instrument review: getting the most from a panel of experts. Appl Nurs Res. 1992;5:194–7.

Altunışık R, Coşkun R, Yıldırım E. Sosyal Bilimlerde Araştırma Yöntemleri SPSS Uygulamalı. Sakarya: Sakarya Yayıncılık. 2010.

Şencan H. Sosyal ve davranışsal ölçümlerde güvenilirlik ve geçerlilik. Ankara: Seçkin Yayıncılık. 2005.

Başol G. Eğitimde ölçme ve değerlendirme (6. Baskı). Pegem Akademi. 2019.

Büyüköztürk Ş, Kılıç-Çakmak E, Akgün Ö, Karadeniz Ş, Demirel F. Bilimsel araştırma yöntemleri. 2008.

Mohajan HK. Two criteria for good measurements in research: validity and reliability. ASHU- ES. 2017;17(4):59–82.

Rigdon EE. CFI versus RMSEA: a comparison of two fit indices for structural equation modeling. Struct Equation Modeling: Multidisciplinary J. 1996;3(4):369–79.

Streiner DL, Norman GR, Cairney J. Health Measurement scales: a practical guide to their development and use. 5th ed. USA: Oxford University Press; 2015.

Book   Google Scholar  

Esin MN. Veri toplama yöntem ve araçları & veri toplama araçlarının güvenirlik ve geçerliği. İn: Erdoğan. In: Nahcivan S, Esin N MN, editors. Çev. 1. Baskı. Hemşirelikte Araştırma Süreç, Uygulama Ve Kritik. İstanbul: Nobel Tıp Kitapevleri; 2014. pp. 193–231.

Tavşancıl E. Tutumların Ölçülmesi ve SPSS Ile Veri Analizi. 6. Baskı. Ankara: Nobel Akademik Yayıncılık; 2019.

Field A. Research methods II: reliability analysis. London: Sage; 2006.

Karasar N. Scientific Research Method, Research Education Consultancy. Ankara: Nobel Academic Publishing; 2019.

Bursal M. SPSS ile Temel Veri Analizleri. 2. Baskı. Ankara: Anı Yayıncılık. 2019.

Hajjar ST. Statistical analysis: internal-consistency reliability and construct validity. Int J Quant Qual Res Methods. 2018;6(1):46–57.

Hair JF, Ringle CM, Sarstedt M. PLS-SEM: indeed a silver bullet. J Mark Theory Pract. 2011;19(2):139–52.

Hair JF, Risher JJ, Sarstedt M, Ringle CM. When to use and how to report the results of PLS-SEM. Eur Bus Rev. 2019;31(1):2–24.

Wheaton B, Muthen B, Alwin DF, Summers GF. Assessing reliability and stability in panel models. Sociol Methodol. 1977;8:84–136.

Vieira AL. Preparation of the analysis. Interactive LISREL in Practice London. Springer; 2011.

Byrne BM. Structural equation modeling with AMOS Basic concepts, applications, and programming (Multivariate Applications Series). New York: Routledge; 2011.

Maiti SS, Mukherjee BN. 1991. Two new goodness-of‐fit indices for covariance matrices with linear structures. British Journal of Mathematical and Statistical Psychology, 1991; 44(1): 153–180.

Shevlin M, Miles JN. Effects of sample size, model specification and factor loadings on the GFI in confirmatory factor analysis. Pers Indiv Differ. 1998; 25(1).

Ishii H. Tokeibunseki No Kokoga Shiritai. Bunkodo. 2005.

Schermelleh-Engel K, Moosbrugger H, Müller H. Evaluating the fit of structural equation models: tests of significance and descriptive goodness-of-fit measures. Methodsf Psychol Res Online. 2003;8:23–74.

Çokluk Ö, Şekercioğlu G, Büyüköztürk S. Sosyal bilimler için çok degişkenli istatistik: SPSS ve LISREL uygulamaları. Pegem Atıf Indeksi. 2018; 001–414.

Download references

Acknowledgements

The authors thank all mothers for their contributions to the study. This study was presented as a summary at the 9th International 13th National Midwifery Students Congress with ID number 8670261.

No funding was received for this article.

Author information

Authors and affiliations.

Department of Nursing, Faculty of Health Sciences, Bayburt University, Bayburt, Türkiye

Zila Özlem Kirbaş

Department of Midwifery, Faculty of Health Sciences, Bayburt University, Bayburt, Türkiye

Elif Odabaşi Aktaş

Department of Midwifery, Faculty of Health Sciences, Atatürk University, Erzurum, Türkiye

You can also search for this author in PubMed   Google Scholar

Contributions

ZOK Conceptualization; Investigation; Writing - original draft; Funding acquisition; Methodology; Validation; Writing - review & editing; Visualization; Software; Formal analysis; Project administration; Data curation; Supervision; Resources. EOA Supervision; Resources; Writing - review & editing; Writing - original draft; Data curation. HO Writing - original draft; Writing - review & editing.

Corresponding author

Correspondence to Elif Odabaşi Aktaş .

Ethics declarations

Ethical approval.

Ethics committee approval (25.11.2022/Decision no: 230/12) from Bayburt University and institutional permission from Bayburt Provincial Health Directorate were obtained before the study. The Declaration of Helsinki informed participants about the investigation, and their consent was obtained with an informed consent form. Volunteer participants were included in the study. Informed consent was obtained from the legal representatives of the illiterate and underage participants in the study.

Consent for publication

Not applicable.

Conflicts of interest

No potential conflicts of interest were reported by the author(s).

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kirbaş, Z.Ö., Odabaşi Aktaş, E. & Özkan, H. Validity and reliability of the Turkish version of the transition of primiparas becoming mothers scale. BMC Pregnancy Childbirth 24 , 259 (2024). https://doi.org/10.1186/s12884-024-06438-7

Download citation

Received : 16 October 2023

Accepted : 22 March 2024

Published : 11 April 2024

DOI : https://doi.org/10.1186/s12884-024-06438-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Translation
  • Psychometrics

BMC Pregnancy and Childbirth

ISSN: 1471-2393

validity and reliability research example

IMAGES

  1. Reliability vs. Validity: Useful Difference between Validity vs

    validity and reliability research example

  2. Validity vs reliability as data research quality evaluation outline

    validity and reliability research example

  3. Validity and reliability in research example

    validity and reliability research example

  4. What does Reliability and Validity mean in Research

    validity and reliability research example

  5. The Concepts of Reliability and Validity Explained With Examples

    validity and reliability research example

  6. Reliability and Validity Examples

    validity and reliability research example

VIDEO

  1. Validity and Reliability in Research

  2. Reliability & Validity in Research Studies

  3. Validity and Reliability of Testing || Research || Psychology

  4. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

  5. Observation as a data collection technique (Urdu/Hindi)

  6. RELIABILITY AND VALIDITY IN RESEARCH

COMMENTS

  1. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  2. Validity & Reliability In Research: Simple Explainer + Examples

    As with validity, reliability is an attribute of a measurement instrument - for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the "thing" it's supposed to be measuring, reliability is concerned with consistency and stability.

  3. Reliability vs Validity: Differences & Examples

    Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good.

  4. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  5. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  6. Validity

    Example 1: In an experiment, a researcher manipulates the independent variable (e.g., a new drug) and controls for other variables to ensure that any observed effects on the dependent variable (e.g., symptom reduction) are indeed due to the manipulation. This establishes internal validity.

  7. Reliability and Validity in Research: Definitions, Examples

    Reliability is a measure of the stability or consistency of test scores. You can also think of it as the ability for a test or research findings to be repeatable. For example, a medical thermometer is a reliable tool that would measure the correct temperature each time it is used. In the same way, a reliable math test will accurately measure ...

  8. Reliability vs. Validity in Research: Types & Examples

    Example of Reliability and Validity in Research. In this section, we'll explore instances that highlight the differences between reliability and validity and how they play a crucial role in ensuring the credibility of research findings. Example of reliability; Imagine you are studying the reliability of a smartphone's battery life measurement.

  9. Reliability vs Validity in Research: Types & Examples

    However, in research and testing, reliability and validity are not the same things. When it comes to data analysis, reliability refers to how easily replicable an outcome is. For example, if you measure a cup of rice three times, and you get the same result each time, that result is reliable. The validity, on the other hand, refers to the ...

  10. The 4 Types of Validity in Research

    For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test). On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more ...

  11. The 4 Types of Reliability in Research

    Reliability is a key concept in research that measures how consistent and trustworthy the results are. In this article, you will learn about the four types of reliability in research: test-retest, inter-rater, parallel forms, and internal consistency. You will also find definitions and examples of each type, as well as tips on how to improve reliability in your own research.

  12. Validity In Psychology Research: Types & Examples

    Types of Validity In Psychology. Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion. Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items ...

  13. Reliability and Validity of Measurement

    Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.

  14. Validity and reliability in quantitative studies

    Validity. Validity is defined as the extent to which a concept is accurately measured in a quantitative study. For example, a survey designed to explore depression but which actually measures anxiety would not be considered valid. The second measure of quality in a quantitative study is reliability, or the accuracy of an instrument.In other words, the extent to which a research instrument ...

  15. The 4 Types of Reliability in Research

    Interrater reliability. Inter-rater reliability (also called inter-observer reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables.. Example: Inter-rater reliability In an observational study where a team of researchers collect ...

  16. Reliability In Psychology Research: Definitions & Examples

    Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

  17. PDF Validity and reliability in quantitative studies

    A good example of the process used in assessing inter-rater reliability is the scores of judges for a skating com-petition. The level of consistency across all judges in the scores given to skating participants is the measure of inter-rater reliability. An example in research is when researchers are asked to give a score for the relevancy of

  18. Validity, reliability, and generalizability in qualitative research

    Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative ...

  19. Reliability and validity: Importance in Medical Research

    Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtained and the degree to which any measuring tool ...

  20. The Concepts of Reliability and Validity Explained With Examples

    All research is conducted via the use of scientific tests and measures, which yield certain observations and data. But for this data to be of any use, the tests must possess certain properties like reliability and validity, that ensure unbiased, accurate, and authentic results. This PsycholoGenie post explores these properties and explains them with the help of examples.

  21. Validity and Reliability of the Research Instrument; How to Test the

    The module had good content validity with a content validity coefficient of 0.87 and excellent reliability with a module reliability coefficient of 0.90, according to the experts' consensus.

  22. (PDF) Validity and Reliability in Quantitative Research

    Abstract and Figures. The validity and reliability of the scales used in research are important factors that enable the research to yield healthy results. For this reason, it is useful to ...

  23. PDF CHAPTER 3 VALIDITY AND RELIABILITY

    3.1 INTRODUCTION. In Chapter 2, the study's aims of exploring how objects can influence the level of construct validity of a Picture Vocabulary Test were discussed, and a review conducted of the literature on the various factors that play a role as to how the validity level can be influenced. In this chapter validity and reliability are ...

  24. Rating Patients in Different Languages: Reliability and Validity

    As examples, outcomes in depression may be measured using the Hamilton Rating Scale for Depression (HAM-D) and the Beck Depression Inventory (BDI). These instruments are widely used in English language versions. In many countries, however, English is not the native language. In India, in particular, a single research center may serve patients ...

  25. The Fremantle Back Awareness Questionnaire: cross-cultural adaptation

    This research describes the cross-cultural adaptation of the FreBAQ and the assessment of its validity, reliability, and measurement error, in Italian-speaking people with chronic non-specific LBP. The FreBAQ-I demonstrated unidimensionality, good validity, and adequate reliability.

  26. A first look at the reliability, validity and responsiveness of L-PF-35

    Dyspnea impairs quality of life (QOL) in patients with fibrotic hypersensitivity pneumonitis (FHP). The Living with Pulmonary Fibrosis questionnaire (L-PF) assesses symptoms, their impacts and PF-related QOL in patients with any form of PF. Its scores have not undergone validation analyses in an FHP cohort. We used data from the Pirfenidone in FHP trial to examine reliability, validity and ...

  27. Validity and reliability of anxiety literacy (A-Lit) and its

    Methods: This research was conducted on 690 people in Iran in 2022. In this study, people were selected by proportional stratified sampling, and the validity and reliability of the A-Lit designed by Griffiths were assessed. Validity of A-Lit was assessed by face validity, content validity, and confirmatory factor analysis.

  28. Frontiers

    IntroductionWe investigated the reliability and validity of the 2-min step test (2MST) for assessing the exercise endurance of individuals with stroke and lower-limb musculoskeletal disorders.Participants and methodsThe participants were 39 individuals with stroke and 42 with lower-limb musculoskeletal disorders (mainly hip fractures) from the convalescent rehabilitation wards of four hospitals.

  29. Frontiers

    Content validity and reliability were evaluated using the content validity index and item-total correlation coefficient, and Cronbach's alpha coefficients respectively. ... Further research should consider a broader sample of nursing students across China to reinforce the scale's applicability. Keywords: metacognitive awareness, nursing ...

  30. Validity and reliability of the Turkish version of the transition of

    The transition to motherhood is an important life event in a woman's life and represents an important developmental process that brings physical, psychological and social changes to gain a new role. However, research on the transition to motherhood in Turkish society is scarce. There is a need for a comprehensive, practical and reliable tool to evaluate the transition to motherhood in ...