Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Market Research
  • Artificial Intelligence
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology

Research Design | Step-by-Step Guide with Examples

Published on 5 May 2022 by Shona McCombes . Revised on 20 March 2023.

A research design is a strategy for answering your research question  using empirical data. Creating a research design means making decisions about:

  • Your overall aims and approach
  • The type of research design you’ll use
  • Your sampling methods or criteria for selecting subjects
  • Your data collection methods
  • The procedures you’ll follow to collect data
  • Your data analysis methods

A well-planned research design helps ensure that your methods match your research aims and that you use the right kind of analysis for your data.

Table of contents

Step 1: consider your aims and approach, step 2: choose a type of research design, step 3: identify your population and sampling method, step 4: choose your data collection methods, step 5: plan your data collection procedures, step 6: decide on your data analysis strategies, frequently asked questions.

  • Introduction

Before you can start designing your research, you should already have a clear idea of the research question you want to investigate.

There are many different ways you could go about answering this question. Your research design choices should be driven by your aims and priorities – start by thinking carefully about what you want to achieve.

The first choice you need to make is whether you’ll take a qualitative or quantitative approach.

Qualitative research designs tend to be more flexible and inductive , allowing you to adjust your approach based on what you find throughout the research process.

Quantitative research designs tend to be more fixed and deductive , with variables and hypotheses clearly defined in advance of data collection.

It’s also possible to use a mixed methods design that integrates aspects of both approaches. By combining qualitative and quantitative insights, you can gain a more complete picture of the problem you’re studying and strengthen the credibility of your conclusions.

Practical and ethical considerations when designing research

As well as scientific considerations, you need to think practically when designing your research. If your research involves people or animals, you also need to consider research ethics .

  • How much time do you have to collect data and write up the research?
  • Will you be able to gain access to the data you need (e.g., by travelling to a specific location or contacting specific people)?
  • Do you have the necessary research skills (e.g., statistical analysis or interview techniques)?
  • Will you need ethical approval ?

At each stage of the research design process, make sure that your choices are practically feasible.

Prevent plagiarism, run a free check.

Within both qualitative and quantitative approaches, there are several types of research design to choose from. Each type provides a framework for the overall shape of your research.

Types of quantitative research designs

Quantitative designs can be split into four main types. Experimental and   quasi-experimental designs allow you to test cause-and-effect relationships, while descriptive and correlational designs allow you to measure variables and describe relationships between them.

With descriptive and correlational designs, you can get a clear picture of characteristics, trends, and relationships as they exist in the real world. However, you can’t draw conclusions about cause and effect (because correlation doesn’t imply causation ).

Experiments are the strongest way to test cause-and-effect relationships without the risk of other variables influencing the results. However, their controlled conditions may not always reflect how things work in the real world. They’re often also more difficult and expensive to implement.

Types of qualitative research designs

Qualitative designs are less strictly defined. This approach is about gaining a rich, detailed understanding of a specific context or phenomenon, and you can often be more creative and flexible in designing your research.

The table below shows some common types of qualitative design. They often have similar approaches in terms of data collection, but focus on different aspects when analysing the data.

Your research design should clearly define who or what your research will focus on, and how you’ll go about choosing your participants or subjects.

In research, a population is the entire group that you want to draw conclusions about, while a sample is the smaller group of individuals you’ll actually collect data from.

Defining the population

A population can be made up of anything you want to study – plants, animals, organisations, texts, countries, etc. In the social sciences, it most often refers to a group of people.

For example, will you focus on people from a specific demographic, region, or background? Are you interested in people with a certain job or medical condition, or users of a particular product?

The more precisely you define your population, the easier it will be to gather a representative sample.

Sampling methods

Even with a narrowly defined population, it’s rarely possible to collect data from every individual. Instead, you’ll collect data from a sample.

To select a sample, there are two main approaches: probability sampling and non-probability sampling . The sampling method you use affects how confidently you can generalise your results to the population as a whole.

Probability sampling is the most statistically valid option, but it’s often difficult to achieve unless you’re dealing with a very small and accessible population.

For practical reasons, many studies use non-probability sampling, but it’s important to be aware of the limitations and carefully consider potential biases. You should always make an effort to gather a sample that’s as representative as possible of the population.

Case selection in qualitative research

In some types of qualitative designs, sampling may not be relevant.

For example, in an ethnography or a case study, your aim is to deeply understand a specific context, not to generalise to a population. Instead of sampling, you may simply aim to collect as much data as possible about the context you are studying.

In these types of design, you still have to carefully consider your choice of case or community. You should have a clear rationale for why this particular case is suitable for answering your research question.

For example, you might choose a case study that reveals an unusual or neglected aspect of your research problem, or you might choose several very similar or very different cases in order to compare them.

Data collection methods are ways of directly measuring variables and gathering information. They allow you to gain first-hand knowledge and original insights into your research problem.

You can choose just one data collection method, or use several methods in the same study.

Survey methods

Surveys allow you to collect data about opinions, behaviours, experiences, and characteristics by asking people directly. There are two main survey methods to choose from: questionnaires and interviews.

Observation methods

Observations allow you to collect data unobtrusively, observing characteristics, behaviours, or social interactions without relying on self-reporting.

Observations may be conducted in real time, taking notes as you observe, or you might make audiovisual recordings for later analysis. They can be qualitative or quantitative.

Other methods of data collection

There are many other ways you might collect data depending on your field and topic.

If you’re not sure which methods will work best for your research design, try reading some papers in your field to see what data collection methods they used.

Secondary data

If you don’t have the time or resources to collect data from the population you’re interested in, you can also choose to use secondary data that other researchers already collected – for example, datasets from government surveys or previous studies on your topic.

With this raw data, you can do your own analysis to answer new research questions that weren’t addressed by the original study.

Using secondary data can expand the scope of your research, as you may be able to access much larger and more varied samples than you could collect yourself.

However, it also means you don’t have any control over which variables to measure or how to measure them, so the conclusions you can draw may be limited.

As well as deciding on your methods, you need to plan exactly how you’ll use these methods to collect data that’s consistent, accurate, and unbiased.

Planning systematic procedures is especially important in quantitative research, where you need to precisely define your variables and ensure your measurements are reliable and valid.

Operationalisation

Some variables, like height or age, are easily measured. But often you’ll be dealing with more abstract concepts, like satisfaction, anxiety, or competence. Operationalisation means turning these fuzzy ideas into measurable indicators.

If you’re using observations , which events or actions will you count?

If you’re using surveys , which questions will you ask and what range of responses will be offered?

You may also choose to use or adapt existing materials designed to measure the concept you’re interested in – for example, questionnaires or inventories whose reliability and validity has already been established.

Reliability and validity

Reliability means your results can be consistently reproduced , while validity means that you’re actually measuring the concept you’re interested in.

For valid and reliable results, your measurement materials should be thoroughly researched and carefully designed. Plan your procedures to make sure you carry out the same steps in the same way for each participant.

If you’re developing a new questionnaire or other instrument to measure a specific concept, running a pilot study allows you to check its validity and reliability in advance.

Sampling procedures

As well as choosing an appropriate sampling method, you need a concrete plan for how you’ll actually contact and recruit your selected sample.

That means making decisions about things like:

  • How many participants do you need for an adequate sample size?
  • What inclusion and exclusion criteria will you use to identify eligible participants?
  • How will you contact your sample – by mail, online, by phone, or in person?

If you’re using a probability sampling method, it’s important that everyone who is randomly selected actually participates in the study. How will you ensure a high response rate?

If you’re using a non-probability method, how will you avoid bias and ensure a representative sample?

Data management

It’s also important to create a data management plan for organising and storing your data.

Will you need to transcribe interviews or perform data entry for observations? You should anonymise and safeguard any sensitive data, and make sure it’s backed up regularly.

Keeping your data well organised will save time when it comes to analysing them. It can also help other researchers validate and add to your findings.

On their own, raw data can’t answer your research question. The last step of designing your research is planning how you’ll analyse the data.

Quantitative data analysis

In quantitative research, you’ll most likely use some form of statistical analysis . With statistics, you can summarise your sample data, make estimates, and test hypotheses.

Using descriptive statistics , you can summarise your sample data in terms of:

  • The distribution of the data (e.g., the frequency of each score on a test)
  • The central tendency of the data (e.g., the mean to describe the average score)
  • The variability of the data (e.g., the standard deviation to describe how spread out the scores are)

The specific calculations you can do depend on the level of measurement of your variables.

Using inferential statistics , you can:

  • Make estimates about the population based on your sample data.
  • Test hypotheses about a relationship between variables.

Regression and correlation tests look for associations between two or more variables, while comparison tests (such as t tests and ANOVAs ) look for differences in the outcomes of different groups.

Your choice of statistical test depends on various aspects of your research design, including the types of variables you’re dealing with and the distribution of your data.

Qualitative data analysis

In qualitative research, your data will usually be very dense with information and ideas. Instead of summing it up in numbers, you’ll need to comb through the data in detail, interpret its meanings, identify patterns, and extract the parts that are most relevant to your research question.

Two of the most common approaches to doing this are thematic analysis and discourse analysis .

There are many other ways of analysing qualitative data depending on the aims of your research. To get a sense of potential approaches, try reading some qualitative research papers in your field.

A sample is a subset of individuals from a larger population. Sampling means selecting the group that you will actually collect data from in your research.

For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

Statistical sampling allows you to test a hypothesis about the characteristics of a population. There are various sampling methods you can use to ensure that your sample is representative of the population as a whole.

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts, and meanings, use qualitative methods .
  • If you want to analyse a large amount of readily available data, use secondary data. If you want data specific to your purposes with control over how they are generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

McCombes, S. (2023, March 20). Research Design | Step-by-Step Guide with Examples. Scribbr. Retrieved 22 April 2024, from https://www.scribbr.co.uk/research-methods/research-design/

Is this article helpful?

Shona McCombes

Shona McCombes

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Documentary Analysis

Documentary Analysis – Methods, Applications and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Graphical Methods

Graphical Methods – Types, Examples and Guide

  • How it works

A Beginner’s Guide to Regression Analysis

Published by Owen Ingram at September 1st, 2021 , Revised On July 5, 2022

Are you good with data-driven decisions at work? If not, why? What is stopping you from getting on the crest of a wave? There could be just one answer to these questions, and that is “too much data getting in the way.” Do not worry; there is a solution to every problem in this world, and there is definitely one for parsing through tons of data.

Yes, you heard it right! You will not have to get in trouble with the number crunching and counting with this solution. What is the solution?

Well, without further ado, we would like to introduce you to “regression,” which precisely is allowing one to see into the future.

What is Regression Analysis?

Here is a scenario to help you understand what regression is and how it helps you make better strategic decisions in research.

Let’s say you are the CEO of a company and are trying to predict the profit margin for the next month. Now you might have a lot of factors in your mind that can affect the number. Be it the number of sales you get in the month, the number of employees not taking leaves, or the number of hours each worker gives daily. But what if things do not go as planned? The “what if” list here has no stop; it can go on forever.  All these impacting factors here are variables, and regression analysis is the process of mathematically figuring out which of these variables actually have an impact and which are not plausible.

So, we can say that regression analysis helps you find the relationship between a set of dependent and independent variables. There are different ways to find this relationship between variables, which in statistics is named “ regression models .”

We will learn about each in the next heading.

Types of Regression Models

If you are not sure which type of regression model you should use for a particular study, this section might help you.

Though there are numerous types of regression models depending on the type of variables , these are the most common ones.

Linear Regression

Logistic regression, ridge regression, lasso regression, polynomial regression, bayesian linear regression.

Linear regression is the real workhorse of the industry and probably is the first type that comes to mind. It is often known as Linear Least Squares and Ordinary Least Squares . This model consists of a dependent variable and a predictable variable that align with each other. Hence, the name linear regression. If the data you are dealing with contains more than one independent variable , then the linear regression here would be Multi-Linear Regression .

Logistic Regression comes into play when the dependent variable is discrete. This means that the target value will only have one or two values. For instance, a true or false, a yes or no, a 0 or 1, and so on. In this case, a sigmoid curve describes the relationship between the independent and dependent variables .

When using this regression model for the data analysis process , two things should strictly be taken into consideration:

  • Make sure there is no multi-linearity (like that in the linear regression model) or correlation between the two variables in the dataset
  • Also, ensure that the size of data is big with the equal manifestation of values to come in targeted variables

When there is a high correlation between the independent and dependent variables, this type of regression is used. It is simply because, with multi collinear data, least-square estimates give impartial numbers. However, if the collinearity is high, there might be a slight chance of unfair judgment.

Thus, a bias matrix is brought to the surface in ridge regression. This powerful type of regression is less vulnerable to overfitting. Are you familiar with the ‘overfitting’ word?

Overfitting in statistics is a modeling error that one makes when the function is too closely brought into line with limited data points. When a model in research has been compromised with this error, it might lose its value all at once.

Lasso Regression is best suitable for performing regularization alongside feature selection. This type of regression hinders the absolute size of the regression coefficient. What happens next? The coefficient value will almost come nearer zero, which the complete opposite of what happened in Ridge Regression.

This is why feature selection utilizes this regression model that helps to select a set of features from the dataset. Only required and limited features are used in Lasso Regression, and all the other features are zero. Researchers get rid of the overfitting in the model by doing this. But what if the independent variables are highly collinear?

In that case, this model will only choose one variable and turn the others to zero. We can say that it is somewhat like the Ridge Regression but with variable selection.

This is another type of regression that is almost the same as Multi-Linear Regression but with some changes. In the Polynomial Regression Model, the relationship between the two variables, dependent and independent , is denoted by the nth degree. While in a Multi-Linear Regression Model, the line is linear, here it is the opposite. The best fit line in Polynomial Regression passing through all the points is curved. This curve either depends on the value of n or the value of X.

This model is also prone to overfitting. It is best to assess the curve towards the end as the higher polynomials might give strange and unexpected results on extrapolation.

The last type of regression model we are going to discuss is the Bayesian Linear Regression. Have you heard of the Bayes theorem? Well, this regression type basically uses that to figure out the value of regression coefficients.

It is a lot like both Ridge Regression and Linear Regression, but the stability here is much higher. In this model, we find the value of the posterior distribution of the features instead of working on the least squares.

FAQs About Regression Analysis

What is regression.

It is a technique to find out the relationship between the dependent and independent variables

What is a linear regression model?

Linear Regression Model helps determine the relationship between different continuous variables by fitting a linear equation for dealing with data.

What is the difference between multi-linear regression and polynomial regression?

The only difference between Multi-Linear Regression and polynomial repression is that in the latter relationship between ‘x’ and ‘y’ is denoted by the nth value, so the line here is a curve. While in Multi-Linear, the line is straight.

What is overfitting in statistics?

When a function in statistics corresponds too closely to a particular set of data, some modeling error is possible. This modeling error is called overfitting.

What is ridge regression?

It is a method of finding the coefficients of multiple regression models in which the independent variables are highly correlated. In other words, it is a method to develop a parsimonious model when the number of predictable variables is higher than the observations in a set.

You May Also Like

Numerical data is a kind of data expressed in numbers. It is also sometimes referred to as quantitative data, which is always collected in the form of numbers.

In statistics, regression analysis is a technique used to study the relationship between an independent and dependent variable. In this method, one tries to ‘regress’ the value of ‘y,’ an dependent variable, with respect to ‘x,’ an independent variables.

This comprehensive guide introduces what median is, how it’s calculated and represented and its importance, along with some simple examples.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

research design regression analysis

Home Market Research

Regression Analysis: Definition, Types, Usage & Advantages

research design regression analysis

Regression analysis is perhaps one of the most widely used statistical methods for investigating or estimating the relationship between a set of independent and dependent variables. In statistical analysis , distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.

It is also used as a blanket term for various data analysis techniques utilized in a qualitative research method for modeling and analyzing numerous variables. In the regression method, the dependent variable is a predictor or an explanatory element, and the dependent variable is the outcome or a response to a specific query.

LEARN ABOUT:   Statistical Analysis Methods

Content Index

Definition of Regression Analysis

Types of regression analysis, regression analysis usage in market research, how regression analysis derives insights from surveys, advantages of using regression analysis in an online survey.

Regression analysis is often used to model or analyze data. Most survey analysts use it to understand the relationship between the variables, which can be further utilized to predict the precise outcome.

For Example – Suppose a soft drink company wants to expand its manufacturing unit to a newer location. Before moving forward, the company wants to analyze its revenue generation model and the various factors that might impact it. Hence, the company conducts an online survey with a specific questionnaire.

After using regression analysis, it becomes easier for the company to analyze the survey results and understand the relationship between different variables like electricity and revenue – here, revenue is the dependent variable.

LEARN ABOUT: Level of Analysis

In addition, understanding the relationship between different independent variables like pricing, number of workers, and logistics with the revenue helps the company estimate the impact of varied factors on sales and profits.

Survey researchers often use this technique to examine and find a correlation between different variables of interest. It provides an opportunity to gauge the influence of different independent variables on a dependent variable.

Overall, regression analysis saves the survey researchers’ additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

Create a Free Account

Researchers usually start by learning linear and logistic regression first. Due to the widespread knowledge of these two methods and ease of application, many analysts think there are only two types of models. Each model has its own specialty and ability to perform if specific conditions are met.

This blog explains the commonly used seven types of multiple regression analysis methods that can be used to interpret the enumerated data in various formats.

01. Linear Regression Analysis

It is one of the most widely known modeling techniques, as it is amongst the first elite regression analysis methods picked up by people at the time of learning predictive modeling. Here, the dependent variable is continuous, and the independent variable is more often continuous or discreet with a linear regression line.

Please note that multiple linear regression has more than one independent variable than simple linear regression. Thus, linear regression is best to be used only when there is a linear relationship between the independent and a dependent variable.

A business can use linear regression to measure the effectiveness of the marketing campaigns, pricing, and promotions on sales of a product. Suppose a company selling sports equipment wants to understand if the funds they have invested in the marketing and branding of their products have given them substantial returns or not.

Linear regression is the best statistical method to interpret the results. The best thing about linear regression is it also helps in analyzing the obscure impact of each marketing and branding activity, yet controlling the constituent’s potential to regulate the sales.

If the company is running two or more advertising campaigns simultaneously, one on television and two on radio, then linear regression can easily analyze the independent and combined influence of running these advertisements together.

LEARN ABOUT: Data Analytics Projects

02. Logistic Regression Analysis

Logistic regression is commonly used to determine the probability of event success and event failure. Logistic regression is used whenever the dependent variable is binary, like 0/1, True/False, or Yes/No. Thus, it can be said that logistic regression is used to analyze either the close-ended questions in a survey or the questions demanding numeric responses in a survey.

Please note logistic regression does not need a linear relationship between a dependent and an independent variable, just like linear regression. Logistic regression applies a non-linear log transformation for predicting the odds ratio; therefore, it easily handles various types of relationships between a dependent and an independent variable.

Logistic regression is widely used to analyze categorical data, particularly for binary response data in business data modeling. More often, logistic regression is used when the dependent variable is categorical, like to predict whether the health claim made by a person is real(1) or fraudulent, to understand if the tumor is malignant(1) or not.

Businesses use logistic regression to predict whether the consumers in a particular demographic will purchase their product or will buy from the competitors based on age, income, gender, race, state of residence, previous purchase, etc.

03. Polynomial Regression Analysis

Polynomial regression is commonly used to analyze curvilinear data when an independent variable’s power is more than 1. In this regression analysis method, the best-fit line is never a ‘straight line’ but always a ‘curve line’ fitting into the data points.

Please note that polynomial regression is better to use when two or more variables have exponents and a few do not.

Additionally, it can model non-linearly separable data offering the liberty to choose the exact exponent for each variable, and that too with full control over the modeling features available.

When combined with response surface analysis, polynomial regression is considered one of the sophisticated statistical methods commonly used in multisource feedback research. Polynomial regression is used mostly in finance and insurance-related industries where the relationship between dependent and independent variables is curvilinear.

Suppose a person wants to budget expense planning by determining how long it would take to earn a definitive sum. Polynomial regression, by taking into account his/her income and predicting expenses, can easily determine the precise time he/she needs to work to earn that specific sum amount.

04. Stepwise Regression Analysis

This is a semi-automated process with which a statistical model is built either by adding or removing the dependent variable on the t-statistics of their estimated coefficients.

If used properly, the stepwise regression will provide you with more powerful data at your fingertips than any method. It works well when you are working with a large number of independent variables. It just fine-tunes the unit of analysis model by poking variables randomly.

Stepwise regression analysis is recommended to be used when there are multiple independent variables, wherein the selection of independent variables is done automatically without human intervention.

Please note, in stepwise regression modeling, the variable is added or subtracted from the set of explanatory variables. The set of added or removed variables is chosen depending on the test statistics of the estimated coefficient.

Suppose you have a set of independent variables like age, weight, body surface area, duration of hypertension, basal pulse, and stress index based on which you want to analyze its impact on the blood pressure.

In stepwise regression, the best subset of the independent variable is automatically chosen; it either starts by choosing no variable to proceed further (as it adds one variable at a time) or starts with all variables in the model and proceeds backward (removes one variable at a time).

Thus, using regression analysis, you can calculate the impact of each or a group of variables on blood pressure.

05. Ridge Regression Analysis

Ridge regression is based on an ordinary least square method which is used to analyze multicollinearity data (data where independent variables are highly correlated). Collinearity can be explained as a near-linear relationship between variables.

Whenever there is multicollinearity, the estimates of least squares will be unbiased, but if the difference between them is larger, then it may be far away from the true value. However, ridge regression eliminates the standard errors by appending some degree of bias to the regression estimates with a motive to provide more reliable estimates.

If you want, you can also learn about Selection Bias through our blog.

Please note, Assumptions derived through the ridge regression are similar to the least squared regression, the only difference being the normality. Although the value of the coefficient is constricted in the ridge regression, it never reaches zero suggesting the inability to select variables.

Suppose you are crazy about two guitarists performing live at an event near you, and you go to watch their performance with a motive to find out who is a better guitarist. But when the performance starts, you notice that both are playing black-and-blue notes at the same time.

Is it possible to find out the best guitarist having the biggest impact on sound among them when they are both playing loud and fast? As both of them are playing different notes, it is substantially difficult to differentiate them, making it the best case of multicollinearity, which tends to increase the standard errors of the coefficients.

Ridge regression addresses multicollinearity in cases like these and includes bias or a shrinkage estimation to derive results.

06. Lasso Regression Analysis

Lasso (Least Absolute Shrinkage and Selection Operator) is similar to ridge regression; however, it uses an absolute value bias instead of the square bias used in ridge regression.

It was developed way back in 1989 as an alternative to the traditional least-squares estimate with the intention to deduce the majority of problems related to overfitting when the data has a large number of independent variables.

Lasso has the capability to perform both – selecting variables and regularizing them along with a soft threshold. Applying lasso regression makes it easier to derive a subset of predictors from minimizing prediction errors while analyzing a quantitative response.

Please note that regression coefficients reaching zero value after shrinkage are excluded from the lasso model. On the contrary, regression coefficients having more value than zero are strongly associated with the response variables, wherein the explanatory variables can be either quantitative, categorical, or both.

Suppose an automobile company wants to perform a research analysis on average fuel consumption by cars in the US. For samples, they chose 32 models of car and 10 features of automobile design – Number of cylinders, Displacement, Gross horsepower, Rear axle ratio, Weight, ¼ mile time, v/s engine, transmission, number of gears, and number of carburetors.

As you can see a correlation between the response variable mpg (miles per gallon) is extremely correlated to some variables like weight, displacement, number of cylinders, and horsepower. The problem can be analyzed by using the glmnet package in R and lasso regression for feature selection.

07. Elastic Net Regression Analysis

It is a mixture of ridge and lasso regression models trained with L1 and L2 norms. The elastic net brings about a grouping effect wherein strongly correlated predictors tend to be in/out of the model together. Using the elastic net regression model is recommended when the number of predictors is far greater than the number of observations.

Please note that the elastic net regression model came into existence as an option to the lasso regression model as lasso’s variable section was too much dependent on data, making it unstable. By using elastic net regression, statisticians became capable of over-bridging the penalties of ridge and lasso regression only to get the best out of both models.

A clinical research team having access to a microarray data set on leukemia (LEU) was interested in constructing a diagnostic rule based on the expression level of presented gene samples for predicting the type of leukemia. The data set they had, consisted of a large number of genes and a few samples.

Apart from that, they were given a specific set of samples to be used as training samples, out of which some were infected with type 1 leukemia (acute lymphoblastic leukemia) and some with type 2 leukemia (acute myeloid leukemia).

Model fitting and tuning parameter selection by tenfold CV were carried out on the training data. Then they compared the performance of those methods by computing their prediction mean-squared error on the test data to get the necessary results.

A market research survey focuses on three major matrices; Customer Satisfaction , Customer Loyalty , and Customer Advocacy . Remember, although these matrices tell us about customer health and intentions, they fail to tell us ways of improving the position. Therefore, an in-depth survey questionnaire intended to ask consumers the reason behind their dissatisfaction is definitely a way to gain practical insights.

However, it has been found that people often struggle to put forth their motivation or demotivation or describe their satisfaction or dissatisfaction. In addition to that, people always give undue importance to some rational factors, such as price, packaging, etc. Overall, it acts as a predictive analytic and forecasting tool in market research.

When used as a forecasting tool, regression analysis can determine an organization’s sales figures by taking into account external market data. A multinational company conducts a market research survey to understand the impact of various factors such as GDP (Gross Domestic Product), CPI (Consumer Price Index), and other similar factors on its revenue generation model.

Obviously, regression analysis in consideration of forecasted marketing indicators was used to predict a tentative revenue that will be generated in future quarters and even in future years. However, the more forward you go in the future, the data will become more unreliable, leaving a wide margin of error .

Case study of using regression analysis

A water purifier company wanted to understand the factors leading to brand favorability. The survey was the best medium for reaching out to existing and prospective customers. A large-scale consumer survey was planned, and a discreet questionnaire was prepared using the best survey tool .

A number of questions related to the brand, favorability, satisfaction, and probable dissatisfaction were effectively asked in the survey. After getting optimum responses to the survey, regression analysis was used to narrow down the top ten factors responsible for driving brand favorability.

All the ten attributes derived (mentioned in the image below) in one or the other way highlighted their importance in impacting the favorability of that specific water purifier brand.

Regression Analysis in Market Research

It is easy to run a regression analysis using Excel or SPSS, but while doing so, the importance of four numbers in interpreting the data must be understood.

The first two numbers out of the four numbers directly relate to the regression model itself.

  • F-Value: It helps in measuring the statistical significance of the survey model. Remember, an F-Value significantly less than 0.05 is considered to be more meaningful. Less than 0.05 F-Value ensures survey analysis output is not by chance.
  • R-Squared: This is the value wherein the independent variables try to explain the amount of movement by dependent variables. Considering the R-Squared value is 0.7, a tested independent variable can explain 70% of the dependent variable’s movement. It means the survey analysis output we will be getting is highly predictive in nature and can be considered accurate.

The other two numbers relate to each of the independent variables while interpreting regression analysis.

  • P-Value: Like F-Value, even the P-Value is statistically significant. Moreover, here it indicates how relevant and statistically significant the independent variable’s effect is. Once again, we are looking for a value of less than 0.05.
  • Interpretation: The fourth number relates to the coefficient achieved after measuring the impact of variables. For instance, we test multiple independent variables to get a coefficient. It tells us, ‘by what value the dependent variable is expected to increase when independent variables (which we are considering) increase by one when all other independent variables are stagnant at the same value.

In a few cases, the simple coefficient is replaced by a standardized coefficient demonstrating the contribution from each independent variable to move or bring about a change in the dependent variable.

01. Get access to predictive analytics

Do you know utilizing regression analysis to understand the outcome of a business survey is like having the power to unveil future opportunities and risks?

For example, after seeing a particular television advertisement slot, we can predict the exact number of businesses using that data to estimate a maximum bid for that slot. The finance and insurance industry as a whole depends a lot on regression analysis of survey data to identify trends and opportunities for more accurate planning and decision-making.

02. Enhance operational efficiency

Do you know businesses use regression analysis to optimize their business processes?

For example, before launching a new product line, businesses conduct consumer surveys to better understand the impact of various factors on the product’s production, packaging, distribution, and consumption.

A data-driven foresight helps eliminate the guesswork, hypothesis, and internal politics from decision-making. A deeper understanding of the areas impacting operational efficiencies and revenues leads to better business optimization.

03. Quantitative support for decision-making

Business surveys today generate a lot of data related to finance, revenue, operation, purchases, etc., and business owners are heavily dependent on various data analysis models to make informed business decisions.

For example, regression analysis helps enterprises to make informed strategic workforce decisions. Conducting and interpreting the outcome of employee surveys like Employee Engagement Surveys, Employee Satisfaction Surveys, Employer Improvement Surveys, Employee Exit Surveys, etc., boosts the understanding of the relationship between employees and the enterprise.

It also helps get a fair idea of certain issues impacting the organization’s working culture, working environment, and productivity. Furthermore, intelligent business-oriented interpretations reduce the huge pile of raw data into actionable information to make a more informed decision.

04. Prevent mistakes from happening due to intuitions

By knowing how to use regression analysis for interpreting survey results, one can easily provide factual support to management for making informed decisions. ; but do you know that it also helps in keeping out faults in the judgment?

For example, a mall manager thinks if he extends the closing time of the mall, then it will result in more sales. Regression analysis contradicts the belief that predicting increased revenue due to increased sales won’t support the increased operating expenses arising from longer working hours.

Regression analysis is a useful statistical method for modeling and comprehending the relationships between variables. It provides numerous advantages to various data types and interactions. Researchers and analysts may gain useful insights into the factors influencing a dependent variable and use the results to make informed decisions. 

With QuestionPro Research, you can improve the efficiency and accuracy of regression analysis by streamlining the data gathering, analysis, and reporting processes. The platform’s user-friendly interface and wide range of features make it a valuable tool for researchers and analysts conducting regression analysis as part of their research projects.

Sign up for the free trial today and let your research dreams fly!

FREE TRIAL         LEARN MORE

MORE LIKE THIS

customer advocacy software

21 Best Customer Advocacy Software for Customers in 2024

Apr 19, 2024

quantitative data analysis software

10 Quantitative Data Analysis Software for Every Data Scientist

Apr 18, 2024

Enterprise Feedback Management software

11 Best Enterprise Feedback Management Software in 2024

online reputation management software

17 Best Online Reputation Management Software in 2024

Apr 17, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

research design regression analysis

Quantitative Research Methods

  • Introduction
  • Descriptive and Inferential Statistics
  • Hypothesis Testing
  • Regression and Correlation
  • Time Series
  • Meta-Analysis
  • Mixed Methods
  • Additional Resources
  • Get Research Help

research design regression analysis

Correlation is the relationship or association between two variables. There are multiple ways to measure correlation, but the most common is Pearson's correlation coefficient (r), which tells you the strength of the linear relationship between two variables. The value of r has a range of -1 to 1 (0 indicates no relationship). Values of r closer to -1 or 1 indicate a stronger relationship and values closer to 0 indicate a weaker relationship.  Because Pearson's coefficient only picks up on linear relationships, and there are many other ways for variables to be associated, it's always best to plot your variables on a scatter plot, so that you can visually inspect them for other types of correlation.

  • Correlation Penn State University tutorial
  • Correlation and Causation Australian Bureau of Statistics Article

Spurious Relationships

It's important to remember that correlation does not always indicate causation. Two variables can be correlated without either variable causing the other. For instance, ice cream sales and drownings might be correlated, but that doesn't mean that ice cream causes drownings—instead, both ice cream sales and drownings increase when the weather is hot. Relationships like this are called spurious correlations.

  • Spuriousness Harvard Business Review article.
  • New Evidence for Theory of The Stork A satirical article demonstrating the dangers of confusing correlation with causation.

research design regression analysis

Regression is a statistical method for estimating the relationship between two or more variables. In theory, regression can be used to predict the value of one variable (the dependent variable) from the value of one or more other variables (the independent variable/s or predictor/s). There are many different types of regression, depending on the number of variables and the properties of the data that one is working with, and each makes assumptions about the relationship between the variables. (For instance, most types of regression assume that the variables have a linear relationship.) Therefore, it is important to understand the assumptions underlying the type of regression that you use and how to properly interpret its results. Because regression will always output a relationship, whether or not the variables are truly causally associated, it is also important to carefully select your predictor variables.

  • A Refresher on Regression Analysis Harvard Business Review article.
  • Introductory Business Statistics - Regression

Simple Linear Regression

Simple linear regression estimates a linear relationship between one dependent variable and one independent variable.

  • Simple Linear Regression Tutorial Penn State University Tutorial
  • Statistics 101: Linear Regression, The Very Basics YouTube video from Brandon Foltz.

Multiple Linear Regression

Multiple linear regression estimates a linear relationship between one dependent variable and two or more independent variables.

  • Multiple Linear Regression Tutorial Penn State University Tutorial
  • Multiple Regression Basics NYU course materials.
  • Statistics 101: Multiple Linear Regression, The Very Basics YouTube video from Brandon Foltz.

If you do a subject search for Regression Analysis you'll see that the library has over 200 books about regression.  Select books are listed below.  Also, note that econometrics texts will often include regression analysis and other related methods.  

research design regression analysis

Search for ebooks using Quicksearch .  Use keywords to search for e-books about Regression .  

research design regression analysis

  • << Previous: Hypothesis Testing
  • Next: ANOVA >>
  • Last Updated: Aug 18, 2023 11:55 AM
  • URL: https://guides.library.duq.edu/quant-methods

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Regression Analysis

research design regression analysis

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

research design regression analysis

Partner Center

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Clin Kidney J
  • v.14(11); 2021 Nov

Logo of ckj

Conducting correlation analysis: important limitations and pitfalls

Roemer j janse.

Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands

Tiny Hoekstra

Department of Nephrology, Amsterdam Cardiovascular Sciences, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands

Kitty J Jager

ERA-EDTA Registry, Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

Carmine Zoccali

CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy

Giovanni Tripepi

Friedo w dekker, merel van diepen.

The correlation coefficient is a statistical measure often used in studies to show an association between variables or to look at the agreement between two methods. In this paper, we will discuss not only the basics of the correlation coefficient, such as its assumptions and how it is interpreted, but also important limitations when using the correlation coefficient, such as its assumption of a linear association and its sensitivity to the range of observations. We will also discuss why the coefficient is invalid when used to assess agreement of two methods aiming to measure a certain value, and discuss better alternatives, such as the intraclass coefficient and Bland–Altman’s limits of agreement. The concepts discussed in this paper are supported with examples from literature in the field of nephrology.

‘Correlation is not causation’: a saying not rarely uttered when a person infers causality from two variables occurring together, without them truly affecting each other. Yet, though causation may not always be understood correctly, correlation too is a concept in which mistakes are easily made. Nonetheless, the correlation coefficient has often been reported within the medical literature. It estimates the association between two variables (e.g. blood pressure and kidney function), or is used for the estimation of agreement between two methods of measurement that aim to measure the same variable (e.g. the Modification of Diet in Renal Disease (MDRD) formula and the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula for estimating the glomerular filtration rate (eGFR)]. Despite the wide use of the correlation coefficient, limitations and pitfalls for both situations exist, of which one should be aware when drawing conclusions from correlation coefficients. In this paper, we aim to describe the correlation coefficient and its limitations, together with methods that can be applied to avoid these limitations.

The basics: the correlation coefficient

Fundamentals.

The correlation coefficient was described over a hundred years ago by Karl Pearson [ 1 ], taking inspiration from a similar idea of correlation from Sir Francis Galton, who developed linear regression and was the not-so-well-known half-cousin of Charles Darwin [ 2 ]. In short, the correlation coefficient, denoted with the Greek character rho ( ρ ) for the true (theoretical) population and r for a sample of the true population, aims to estimate the strength of the linear association between two variables. If we have variables X and Y that are plotted against each other in a scatter plot, the correlation coefficient indicates how well a straight line fits these data. The coefficient ranges from −1 to 1 and is dimensionless (i.e., it has no unit). Two correlations with r = −1 and r  = 1 are shown in Figure 1A and B , respectively. The values of −1 and 1 indicate that all observations can be described perfectly using a straight line, which in turn means that if X is known, Y can be determined deterministically and vice versa. Here, the minus sign indicates an inverse association: if X increases, Y decreases. Nonetheless, real-world data are often not perfectly summarized using a straight line. In a scatterplot as shown in Figure 1C , the correlation coefficient represents how well a linear association fits the data.

An external file that holds a picture, illustration, etc.
Object name is sfab085f1.jpg

Different shapes of data and their correlation coefficients. ( A ) Linear association with r = −1. ( B ) A linear association with r  = 1. ( C ) A scatterplot through which a straight line could plausibly be drawn, with r  = 0.50. ( D ) A sinusoidal association with r  = 0. ( E ) A quadratic association with r  = 0. ( F ) An exponential association with r  = 0.50.

It is also possible to test the hypothesis of whether X and Y are correlated, which yields a P-value indicating the chance of finding the correlation coefficient’s observed value or any value indicating a higher degree of correlation, given that the two variables are not actually correlated. Though the correlation coefficient will not vary depending on sample size, the P-value yielded with the t -test will.

The value of the correlation coefficient is also not influenced by the units of measurement, but it is influenced by measurement error. If more error (also known as noise) is present in the variables X and Y , variability in X will be partially due to the error in X , and thus not solely explainable by Y . Moreover, the correlation coefficient is also sensitive to the range of observations, which we will discuss later in this paper.

An assumption of the Pearson correlation coefficient is that the joint distribution of the variables is normal. However, it has been shown that the correlation coefficient is quite robust with regard to this assumption, meaning that Pearson’s correlation coefficient may still be validly estimated in skewed distributions [ 3 ]. If desired, a non-parametric method is also available to estimate correlation; namely, the Spearman’s rank correlation coefficient. Instead of the actual values of observations, the Spearman’s correlation coefficient uses the rank of the observations when ordering observations from small to large, hence the ‘rank’ in its name [ 4 ]. This usage of the rank makes it robust against outliers [ 4 ].

Explained variance and interpretation

One may also translate the correlation coefficient into a measure of the explained variance (also known as R 2 ), by taking its square. The result can be interpreted as the proportion of statistical variability (i.e. variance) in one variable that can be explained by the other variable. In other words, to what degree can variable X be explained by Y and vice versa. For instance, as mentioned above, a correlation of −1 or +1 would both allow us to determine X from Y and vice versa without error, which is also shown in the coefficient of determination, which would be (−1) 2 or 1 2 = 1, indicating that 100% of variability in one variable can be explained by the other variable.

In some cases, the interpretation of the strength of correlation coefficient is based on rules of thumb, as is often the case with P-values (P-value <0.05 is statistically significant, P-value >0.05 is not statistically significant). However, such rules of thumb should not be used for correlations. Instead, the interpretation should always depend on context and purposes [ 5 ]. For instance, when studying the association of renin–angiotensin–system inhibitors (RASi) with blood pressure, patients with increased blood pressure may receive the perfect dosage of RASi until their blood pressure is exactly normal. Those with an already exactly normal blood pressure will not receive RASi. However, as the perfect dosage of RASi makes the blood pressure of the RASi users exactly normal, and thus equal to the blood pressure of the RASi non-users, no variation is left between users and non-users. Because of this, the correlation will be 0.

The linearity of correlation

An important limitation of the correlation coefficient is that it assumes a linear association. This also means that any linear transformation and any scale transformation of either variable X or Y , or both, will not affect the correlation coefficient. However, variables X and Y may also have a non-linear association, which could still yield a low correlation coefficient, as seen in Figure 1D and E , even though variables X and Y are clearly related. Nonetheless, the correlation coefficient will not always return 0 in case of a non-linear association, as portrayed in Figure 1F with an exponential correlation with r  = 0.5. In short, a correlation coefficient is not a measure of the best-fitted line through the observations, but only the degree to which the observations lie on one straight line.

In general, before calculating a correlation coefficient, it is advised to inspect a scatterplot of the observations in order to assess whether the data could possibly be described with a linear association and whether calculating a correlation coefficient makes sense. For instance, the scatterplot in Figure 1C could plausibly fit a straight line, and a correlation coefficient would therefore be suitable to describe the association in the data.

The range of observations for correlation

An important pitfall of the correlation coefficient is that it is influenced by the range of observations. In Figure 2A , we illustrate hypothetical data with 50 observations, with r  = 0.87. Included in the figure is an ellipse that shows the variance of the full observed data, and an ellipse that shows the variance of only the 25 lowest observations. If we subsequently analyse these 25 observations independently as shown in Figure 2B , we will see that the ellipse has shortened. If we determine the correlation coefficient for Figure 2B , we will also find a substantially lower correlation: r  = 0.57.

An external file that holds a picture, illustration, etc.
Object name is sfab085f2.jpg

The effect of the range of observations on the correlation coefficient, as shown with ellipses. ( A ) Set of 50 observations from hypothetical dataset X with r  = 0.87, with an illustrative ellipse showing length and width of the whole dataset, and an ellipse showing only the first 25 observations. ( B ) Set of only the 25 lowest observations from hypothetical dataset X with r  = 0.57, with an illustrative ellipse showing length and width.

The importance of the range of observations can further be illustrated using an example from a paper by Pierrat et al. [ 6 ] in which the correlation between the eGFR calculated using inulin clearance and eGFR calculated using the Cockcroft–Gault formula was studied both in adults and children. Children had a higher correlation coefficient than adults ( r  = 0.81 versus r  = 0.67), after which the authors mentioned: ‘The coefficients of correlation were even better […] in children than in adults.’ However, the range of observations in children was larger than the range of observations in adults, which in itself could explain the higher correlation coefficient observed in children. One can thus not simply conclude that the Cockcroft–Gault formula for eGFR correlates better with inulin in children than in adults. Because the range of the correlation influences the correlation coefficient, it is important to realize that correlation coefficients cannot be readily compared between groups or studies. Another consequence of this is that researchers could inflate the correlation coefficient by including additional low and high eGFR values.

The non-causality of correlation

Another important pitfall of the correlation coefficient is that it cannot be interpreted as causal. It is of course possible that there is a causal effect of one variable on the other, but there may also be other possible explanations that the correlation coefficient does not take into account. Take for example the phenomenon of confounding. We can study the association of prescribing angiotensin-converting enzyme (ACE)-inhibitors with a decline in kidney function. These two variables would be highly correlated, which may be due to the underlying factor albuminuria. A patient with albuminuria is more likely to receive ACE-inhibitors, but is also more likely to have a decline in kidney function. So ACE-inhibitors and a decline in kidney function are correlated not because of ACE-inhibitors causing a decline in kidney function, but because they have a shared underlying cause (also known as common cause) [ 7 ]. More reasons why associations may be biased exist, which are explained elsewhere [ 8 , 9 ].

It is however possible to adjust for such confounding effects, for example by using multivariable regression. Whereas a univariable (or ‘crude’) linear regression analysis is no different than calculating the correlation coefficient, a multivariable regression analysis allows one to adjust for possible confounder variables. Other factors need to be taken into account to estimate causal effects, but these are beyond the scope of this paper.

Agreement between methods

We have discussed the correlation coefficient and its limitations when studying the association between two variables. However, the correlation coefficient is also often incorrectly used to study the agreement between two methods that aim to estimate the same variable. Again, also here, the correlation coefficient is an invalid measure.

The correlation coefficient aims to represent to what degree a straight line fits the data. This is not the same as agreement between methods (i.e. whether X  =  Y ). If methods completely agree, all observations would fall on the line of equality (i.e. the line on which the observations would be situated if X and Y had equal values). Yet the correlation coefficient looks at the best-fitted straight line through the data, which is not per se the line of equality. As a result, any method that would consistently measure a twice as large value as the other method would still correlate perfectly with the other method. This is shown in Figure 3 , where the dashed line shows the line of equality, and the other lines portray different linear associations, all with perfect correlation, but no agreement between X and Y . These linear associations may portray a systematic difference, better known as bias, in one of the methods.

An external file that holds a picture, illustration, etc.
Object name is sfab085f3.jpg

A set of linear associations, with the dashed line (- - -) showing the line of equality where X  =  Y . The equations and correlations for the other lines are shown as well, which shows that only a linear association is needed for r  = 1, and not specifically agreement.

This limitation applies to all comparisons of methods, where it is studied whether methods can be used interchangeably, and it also applies to situations where two individuals measure a value and where the results are then compared (inter-observer variation or agreement; here the individuals can be seen as the ‘methods’), and to situations where it is studied whether one method measures consistently at two different time points (also known as repeatability). Fortunately, other methods exist to compare methods [ 10 , 11 ], of which one was proposed by Bland and Altman themselves [ 12 ].

Intraclass coefficient

One valid method to assess interchangeability is the intraclass coefficient (ICC), which is a generalization of Cohen’s κ , a measure for the assessment of intra- and interobserver agreement. The ICC shows the proportion of the variability in the new method that is due to the normal variability between individuals. The measure takes into account both the correlation and the systematic difference (i.e. bias), which makes it a measure of both the consistency and agreement of two methods. Nonetheless, like the correlation coefficient, it is influenced by the range of observations. However, an important advantage of the ICC is that it allows comparison between multiple variables or observers. Similar to the ICC is the concordance correlation coefficient (CCC), though it has been stated that the CCC yields values similar to the ICC [ 13 ]. Nonetheless, the CCC may also be found in the literature [ 14 ].

The 95% limits of agreement and the Bland–Altman plot

When they published their critique on the use of the correlation coefficient for the measurement of agreement, Bland and Altman also published an alternative method to measure agreement, which they called the limits of agreement (also referred to as a Bland–Altman plot) [ 12 ]. To illustrate the method of the limits of agreement, an artificial dataset was created using the MASS package (version 7.3-53) for R version 4.0.4 (R Corps, Vienna, Austria). Two sets of observations (two observations per person) were derived from a normal distribution with a mean ( µ ) of 120 and a randomly chosen standard deviation ( σ ) between 5 and 15. The mean of 120 was chosen with the aim to have the values resemble measurements of high eGFR, where the first set of observed eGFRs was hypothetically acquired using the MDRD formula, and the second set of observed eGFRs was hypothetically acquired using the CKD-EPI formula. The observations can be found in Table 1 .

Artificial data portraying hypothetically observed MDRD measurements and CKD-EPI measurements

The 95% limits of agreement can be easily calculated using the mean of the differences ( d ¯ ) and the standard deviation (SD) of the differences. The upper limit (UL) of the limits of agreement would then be UL = d ¯ + 1.96 * SD and the lower limit (LL) would be LL = d ¯ - 1.96 * SD . If we apply this to the data from Table 1 , we would find d ¯ = 0.32 and SD = 4.09. Subsequently, UL = 0.32 + 1.96 * 4.09 = 8.34 and LL = 0.32 − 1.96 * 4.09 = −7.70. Our limits of agreement are thus −7.70 to 8.34. We can now decide whether these limits of agreement are too broad. Imagine we decide that if we want to replace the MDRD formula with the CKD-EPI formula, we say that the difference may not be larger than 7 mL/min/1.73 m 2 . Thus, on the basis of these (hypothetical) data, the MDRD and CKD-EPI formulas cannot be used interchangeably in our case. It should also be noted that, as the limits of agreement are statistical parameters, they are also subject to uncertainty. The uncertainty can be determined by calculating 95% confidence intervals for the limits of agreement, on which Bland and Altman elaborate in their paper [ 12 ].

The limits of agreement are also subject to two assumptions: (i) the mean and SD of the differences should be constant over the range of observations and (ii) the differences are approximately normally distributed. To check these assumptions, two plots were proposed: the Bland–Altman plot, which is the differences plotted against the means of their measurements, and a histogram of the differences. If in the Bland–Altman plot the means and SDs of the differences appear to be equal along the x -axis, the first assumption is met. The histogram of the differences should follow the pattern of a normal distribution. We checked these assumptions by creating a Bland–Altman plot in Figure 4A and a histogram of the differences in Figure 4B . As often done, we also added the limits of agreement to the Bland–Altman plot, between which approximately 95% of datapoints are expected to be. In Figure 4A , we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. Nonetheless, the second assumption is met, because our differences follow a normal distribution, as shown in Figure 4B . Our failure to meet the first assumption can be due to a number of reasons, for which Bland and Altman also proposed solutions [ 15 ]. For example, data may be skewed. However, in that case, log-transforming variables may be a solution [ 16 ].

An external file that holds a picture, illustration, etc.
Object name is sfab085f4.jpg

Plots to check assumptions for the limits of agreement. ( A ) The Bland–Altman plot for the assumption that the mean and SD of the differences are constant over the range of observations. In our case, we see that the mean of the differences appears to be equal along the x -axis; i.e., these datapoints could plausibly fit the horizontal line of the total mean across the whole x -axis. Nonetheless, the SD does not appear to be distributed equally: the means of the differences at the lower values of the x -axis are closer to the total mean (thus a lower SD) than the means of the differences at the middle values of the x -axis (thus a higher SD). Therefore, the first assumption is not met. The limits of agreement and the mean are added as dashed (- - -) lines. ( B ) A histogram of the distribution of differences to ascertain the assumption of whether the differences are normally distributed. In our case, the observations follow a normal distribution and thus, the assumption is met.

It is often mistakenly thought that the Bland–Altman plot alone is the analysis to determine the agreement between methods, but the authors themselves spoke strongly against this [ 15 ]. We suggest that authors should both report the limits of agreement and show the Bland–Altman plot, to allow readers to assess for themselves whether they think the agreement is met.

The correlation coefficient is easy to calculate and provides a measure of the strength of linear association in the data. However, it also has important limitations and pitfalls, both when studying the association between two variables and when studying agreement between methods. These limitations and pitfalls should be taken into account when using and interpreting it. If necessary, researchers should look into alternatives to the correlation coefficient, such as regression analysis for causal research, and the ICC and the limits of agreement combined with a Bland–Altman plot when comparing methods.

CONFLICT OF INTEREST STATEMENT

None declared.

Contributor Information

Roemer J Janse, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

Tiny Hoekstra, Department of Nephrology, Amsterdam Cardiovascular Sciences, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.

Kitty J Jager, ERA-EDTA Registry, Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.

Carmine Zoccali, CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy.

Giovanni Tripepi, CNR-IFC, Center of Clinical Physiology, Clinical Epidemiology of Renal Diseases and Hypertension, Reggio Calabria, Italy.

Friedo W Dekker, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

Merel van Diepen, Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.

How to construct regression models for observational studies (and how NOT to do it!)

Affiliations.

  • 1 Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada. [email protected].
  • 2 Applied Health Research Centre (AHRC), Li Ka Shing Knowledge Institute of St. Michael's Hospital, Toronto, ON, Canada. [email protected].
  • PMID: 28236060
  • DOI: 10.1007/s12630-017-0833-0

Publication types

  • Anesthesiology
  • Logistic Models
  • Models, Statistical
  • Observational Studies as Topic / methods*
  • Observational Studies as Topic / statistics & numerical data*
  • Regression Analysis*
  • Research Design

REVIEW article

Regression discontinuity design for the study of health effects of exposures acting early in life.

Maja Popovic

  • 1 Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO-Piemonte, Turin, Italy
  • 2 MRC Integrative Epidemiology Unit at the University of Bristol, University of Bristol, Bristol, United Kingdom
  • 3 Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, United Kingdom

Regression discontinuity design (RDD) is a quasi-experimental approach to study the causal effect of an exposure on later outcomes by exploiting the discontinuity in the exposure probability at an assignment variable cut-off. With the intent of facilitating the use of RDD in the Developmental Origins of Health and Disease (DOHaD) research, we describe the main aspects of the study design and review the studies, assignment variables and exposures that have been investigated to identify short- and long-term health effects of early life exposures. We also provide a brief overview of some of the methodological considerations for the RDD identification using an example of a DOHaD study. An increasing number of studies investigating the effects of early life environmental stressors on health outcomes use RDD, mostly in the context of education, social and welfare policies, healthcare organization and insurance, and clinical management. Age and calendar time are the mostly used assignment variables to study the effects of various early life policies and programs, shock events and guidelines. Maternal and newborn characteristics, such as age, birth weight and gestational age are frequently used assignment variables to study the effects of the type of neonatal care, health insurance, and newborn benefits, while socioeconomic measures have been used to study the effects of social and welfare programs. RDD has advantages, including intuitive interpretation, and transparent and simple graphical representation. It provides valid causal estimates if the assumptions, relatively weak compared to other non-experimental study designs, are met. Its use to study health effects of exposures acting early in life has been limited to studies based on registries and administrative databases, while birth cohort data has not been exploited so far using this design. Local causal effect around the cut-off, difficulty in reaching high statistical power compared to other study designs, and the rarity of settings outside of policy and program evaluations hamper the widespread use of RDD in the DOHaD research. Still, the assignment variables’ cut-offs for exposures applied in previous studies can be used, if appropriate, in other settings and with additional outcomes to address different research questions.

1 Introduction

The Developmental Origins of Health and Disease (DOHaD) has been consolidated as a concept asserting the causal effects of early life environmental stressors on health outcomes and identifying critical windows for prevention of later diseases ( 1 ). While early DOHaD research was mostly based on administrative data and registries, an increasing number of birth cohort studies have provided extensive datasets, covering information across multiple life stages, for current and future studies. Even with a growing number of data sources and the widespread availability of data, the main methodological challenges in DOHaD research and life course epidemiology remain the selection of appropriate study design for complex research questions, managing multiple relationships of biological and contextual variables, dealing with repeated measures over time, and, especially, controlling for confounders to mitigate residual confounding ( 2 ).

The issue of uncontrolled confounding, i.e., the violation of the exchangeability assumption, is probably the main obstacle to causal inference within the context of non-experimental studies, and several analytical and design approaches have been developed to control for confounding and obtain a potentially unbiased estimate of the exposure effect. Here, we focus on regression discontinuity design (RDD), a quasi-experimental approach that has been widely applied in the context of natural experiments ( 3 , 4 ), and is becoming more common also in DOHaD and lifecourse epidemiology literature ( 4 – 6 ). With the intent of facilitating the use of RDD in these contexts, we describe the main aspects of the study design and review the studies, assignment variables and exposures that have been used to identify health effects of early life exposures. Finally, we provide a brief overview of some of the methodological considerations for the RDD identification and the design validity checks using an example of a study on health effects of an early life exposure.

2 Regression discontinuity design – basic concepts and frameworks

RDD, introduced in the 1960s, Thistlethwaite and Campbell ( 7 ) is a quasi-experimental design that shares similarities with randomized controlled trials, but lacks the completely random assignment to the intervention (intervention, treatment, or exposure, hereafter referred to as exposure in general). It typically implies that whoever imposes a certain policy, program, or clinical decision, controls the assignment to the exposure using an a priori decided criterion (e.g., an eligibility rule, or a clinical decision-making guideline). The exposure assignment in RDD studies is thus based on the cut-off value of an assignment variable (also referred to in the literature as the “forcing,” “rating” or the “running” variable) that creates a discontinuity in the probability of the exposure at the cut-off point. The assignment variable can be any continuous or discrete variable that individuals cannot manipulate to systematically place themselves above or below the cut-off. The stronger is the individuals’ inability to control their own value of the assignment variable the more valid is the design to identify the causal effects. The exogeneity of the cut-off value in the assignment variable implies that individuals just below the cut-off are on average similar in all observed and unobserved baseline characteristics to those just above the cut-off except for the exposure of interest, i.e., they are exchangeable. The exposure groups are only exchangeable very close to the cut-off, rendering the validity of the design plausible for relatively narrow windows around the assignment variable cut-off. If the assumption of exchangeability holds, any difference in the outcome (or in its probability function in the case of binary outcomes) on the two sides of the assignment variable cut-off will be caused by the exposure. The magnitude of the discontinuity in the outcome at the cut-off represents thus the average effect of the assignment rule around the cut-off point.

In summary, the RDD draws on a continuous or discrete pre-exposure variable with a clearly defined cut-off value for the exposure assignment that cannot be manipulated by the individuals. The cut-off refers to a specific exposure that can be studied in relationship with multiple outcomes. This is appealing for DOHaD research as many exposures acting at critical time windows early in life often have multiple short- and long-term health effects, which offers the opportunity to study different research questions using the same RDD setting.

There are two main conceptual and inference frameworks in RDD: the continuity-based approach, and the local randomization approach ( 8 – 11 ). The continuity-based framework assumes the continuity of average potential outcomes near the cut-off, and it typically uses polynomial methods to approximate the regression functions on the two sides of the cut-off (polynomial of the observed outcome on the assignment variable) ( 9 , 10 , 12 – 14 ). In other words, it aims at estimating the difference between the two average potential outcomes at the cutoff; since this difference cannot be observed, it assumes that the average potential outcomes change continuously and in parallel around the cutoff. According to the local randomization framework, instead, RDD is seen as a randomized experiment near the cut-off, which assumes random assignment in a narrow window around the cut-off with the assignment variable being unrelated to the average potential outcomes ( 11 , 15 – 17 ). It is thus assumed perfect exchangeability within the narrow window, with the aim of estimating the difference between the two average potential outcomes in the narrow window. The difference between the two frameworks is well explained for example in a recent tutorial by Cattaneo et al. ( 18 ), which Figure 1 provides a clear graphical representation of the assumed behavior of the average potential outcomes in the two frameworks. Local randomization imposes stronger assumptions than the continuity framework and is generally used when the sample size around the cutoff is small or in cases where the continuity framework cannot be applied because the assignment variable is discrete.

www.frontiersin.org

Figure 2 . Summary of the settings (A) and assignment variables (B) used in RDD studies on health effects of early life exposures ( N  = 125). The Sankey diagram in (C) depicts the connections between assignment variables and the study settings in which they were used.

3 RDD applications for the study of health effects of exposures acting early in life

3.1 studies.

Three review articles have evaluated the application of RDD in health research ( 4 – 6 ), and one recent tutorial provided a guidance to RDD analysis with empirical examples from medical research ( 18 ). The most recent and the only systematic review ( 4 ), that performed searches of articles published until 2019 in several economic, social, and medical databases, identified 325 studies using RDD in the context of health research. The authors showed an increasing popularity of this design with most studies being applied in the context of specific policies, social programs, health insurance, and education. The review summarizes the mostly used assignment variables (e.g., age, date, socio-economic, clinical, and environmental measures) and the cut-off rules (program eligibility, legislation cut-offs, date of sudden events, and clinical decision-making rules) ( 4 ).

From the systematic review ( 4 ) and by updating the search until January 1, 2024, we identified RDD studies on health effects of exposures acting in fetal life, infancy, childhood, or adolescence with the aim of understanding the potential of promoting the use of RDD in DOHaD research. Identification of studies, search strategy, and the selected studies are detailed in Supplementary material S1 (Supplementary Methods, Identification of studies focused on health effects of exposures acting early in life, Supplementary Tables S1–S3 ). Overall, we identified 125 RDD studies on health effects of early life exposures ( Figure 2 ).

www.frontiersin.org

Figure 1 . Flow diagram for the selection of RDD studies on the health effects of exposures acting early in life.

Figure 2 shows the distribution of settings and assignment variables used in the identified RDD studies. Sixty percent of the identified studies (75/125) were conducted in the context of education (mostly based on educational reforms on schooling initiation and duration), social and welfare programs and policies (e.g., conditional cash transfers, child supplements, and parental leaves), and healthcare organization and insurance schemes. About 15% of the studies evaluated research questions related to clinical settings and patient care ( N  = 19). Interestingly, almost one third of the RDD studies on health effects of early life exposures published since 2019 focused on this setting, which was quite neglected in the past publications (6.8% of the studies published until 2019 and 27.5% of the studies published thereafter, Supplementary Figure S1 ). The number of RDD studies based on shock events, social and environmental factors has also increased substantially since 2019 ( Supplementary Figure S1 ), probably also due to the increasing number of studies focused on the recent COVID-19 pandemic ( 19 – 21 ).

Almost 60% of the RDD studies on health effects of early life exposures used time and/or age as an assignment variable ( Figure 2B ). These assignment variables were also applied in almost all the studies conducted within the educational setting and settings based on shock events, social and environmental factors ( Figure 2C ). Frequently used assignment variables in previous studies also include welfare- and income-based individual and population measures and perinatal characteristics, mostly birthweight and gestational age. The latter were largely used within the clinical obstetrician/neonatal management and healthcare insurance settings.

As PubMed is the most extensively used database and search engine in DOHaD research, we described, among studies identified from the previous systematic review ( 4 ), the proportion of those present in PubMed. More than a half of the studies in each of the settings, except the clinical setting, are not available in PubMed ( Supplementary Figure S1 ). Similarly, some of the assignment variables, such as geographical position or distance, environmental, population structure, and school-based measures were exclusively used in studies not available through a PubMed search ( Supplementary Figure S2 ). Many of these articles, although with the focus on health effects of early life exposures, do not get the attention of DOHaD researchers.

Most of the identified RDD studies were setting-specific evaluating the effect of specific programs, policies, and sudden events, and are, thus, difficult to implement or replicate in different contexts and populations. However, there are some previous applications that used data that are typically collected in population registries and birth cohorts and may serve as motivating examples for future studies. Table 1 summarizes some of the assignment variables used for identification of discontinuities in studies on health effects of early life exposures that could be replicated or extended for future DOHaD research.

www.frontiersin.org

Table 1 . Assignment variables and exposures as possible RDD models for DOHaD research.

3.2 Methodological considerations

3.2.1 assignment rule condition.

In RDD, the exposure is determined by the assignment rule either completely (deterministically) or partially (probabilistically). When the assignment rule perfectly determines the exposure (from 0 to 1 at the cut-off), the regression discontinuity takes a sharp design. This means that all individuals above the cut-off are assigned to an exposure and are exposed, while all those below the cut-off are assigned to the unexposed group, with no crossovers. If the assignment rule affects the probability of exposure creating a discontinuous change at the threshold, without an extreme 0 to 1 jump, regression discontinuity takes a fuzzy design. In this setting, there are exposed and unexposed individuals both above and below the cut-off, but the probability of being exposed jumps discontinuously at the cut-off ( 3 , 8 ). Most of the applications of the RDD on the health effects of exposures acting early in life used a fuzzy design.

An example of the sharp and fuzzy RDD is illustrated in Figure 3 using simulated data motivated by the study of Daysal et al. ( 22 ). The study investigated the effect of the obstetrician supervision of deliveries on the short-term infant health outcomes using a national rule of 37 gestational weeks (259 days) at delivery for obstetrician instead of midwife delivery supervision. Data simulation is detailed in Supplementary material S1 (Methods). As shown in Figure 3A , a sharp design would imply that all deliveries before 37 gestational weeks were supervised by obstetricians and all those of at least 37 weeks were supervised by midwifes only. A fuzzy design, instead, would look like in Figure 3B in which the cut-off point decreases the probability of obstetrician supervision but does not completely determine it. This happens, for example when some deliveries after 37 gestational weeks are under the care of an obstetrician for reasons other than prematurity, such as complications during delivery and slow delivery progression.

www.frontiersin.org

Figure 3 . Hypothetical sharp and fuzzy regression discontinuity design. (A) Sharp (deterministic) regression discontinuity design. (B) Fuzzy (probabilistic) regression discontinuity design. Simulated data.

Several modifications to the two general RDD settings have been proposed in literature, as for example the kink design where the assignment variable cut-off determines the change in the first derivative of the exposure probability ( 23 ), assignment variables with multiple cut-offs ( 24 ), RDD with multiple assignment variables ( 25 , 26 ), or designs that exploit calendar time as the assignment variable (RDD in time) ( 27 ). The latter is an increasingly popular application of the RDD that uses time as the assignment variable, with an exposure date as the cut-off ( Figure 1 ). RDD in time is closely related to other time-series designs, such as interrupted time series or pre-post analyses. The decision to use RDD in time compared to other time-series methods is particularly determined by the context and whether an exposure evolves over time, the type of collected data and the number of observations near the cut-off. If there are enough observations near the cut-off, an exposure does not change in time, and individual data are collected, the RDD assumptions may be highly plausible ( 27 ).

Whether the RDD has a sharp or a fuzzy design has implications on the assumptions summarized in Table 2 . As the validity of RDD relies on these key underlying assumptions, Table 2 also summarizes their potential to be verified empirically. Using the simulated example described above, we briefly describe these assumptions and possible sensitivity analyses and falsification tests used to provide empirical evidence in case some of them are violated.

www.frontiersin.org

Table 2 . The main assumptions of RDD.

3.2.2 The main assumptions of RDD

3.2.2.1 relevance assumption.

The assignment rule can be assessed empirically by plotting the relationship between the exposure and the assignment variable. Returning to the previous example illustrated in Figure 3 , a discontinuous change in the probability of the obstetrician vs. midwife delivery supervision at the gestational age cut-off of 259 days suggests the possibility of using RDD, as the assignment variable at the cut-off should cause the exposure. For example, in all the identified RDD studies which used gestational age as the assignment variable for the obstetrician vs. midwife delivery supervision (37 weeks cut-off) ( 22 , 28 ), antenatal corticosteroid administration ( 29 , 30 ), or probiotic supplementation (34 weeks cut-off) ( 31 ) the relevance assumption was assessed by plotting and/or estimating the relationship between gestational age cut-off and the exposure of interest. The relevance assumption in RDD is analogous to the same assumption in an instrumental variable (IV) setting. As with an IV, the stronger the relationship between the instrument and the exposure variable, that is, the larger the discontinuity at the cut-off is, the more efficient and less prone to weak instrument bias the estimates from RDD are. While in sharp RDD the graphical representation is usually enough, fuzzy RDD with small discontinuities may require formal tests, such as F-statistics. As a rule of thumb, the instrument is considered weak if the F-statistic is less than 10 ( 32 ).

In addition to the exposure discontinuity at the cut-off, a graphical presentation of the relationship between the exposure and the assignment variable allows the examination of discontinuities in the exposure at locations other than the cut-off. While the absence of any such additional discontinuities is not a necessary condition to validate the RDD, their existence might indicate other exposures that could confound the estimate of the causal effect of the main exposure.

3.2.2.2 Exogeneity assumption and the lack of manipulation in the assignment variable

The assignment variable should not only be a strong determinant of the exposure, but the cut-off value should also be exogenous and there should be no manipulation of the assignment variable by individuals. The exogeneity assumption implies that, conditional on the assignment variable, there are no other systematic differences between those below and above the assignment variable cut-off. In other words, any differences in outcomes between the exposure and control groups can be attributed solely to the exposure itself, rather than to any other confounding variables that might be correlated with both the assignment variable and the outcome. This assumption can be partly verified by checking whether the observed baseline characteristics have a similar distribution above and below the cut-off (see below). The exogeneity assumption also implies that the individuals cannot influence whether they are placed above or below the assignment variable cut-off. In practice, when there is a benefit in receiving an exposure, the manipulation in assignment variable occurs when the exposure assignment rule is public knowledge and individuals just barely qualifying for a desired exposure manage to cross the cut-off, with few individuals remaining on the other side of the cut-off. For instance, in a RDD studying the impact of a program that offers financial aid to students who score above a certain grade threshold on an exam, students might, by studying harder, attempt to manipulate their scores around the cut-off to ensure they qualify for the aid. Thus, it is crucial to have a deep understanding of the data generation process underlying the assignment rule. There is empirical evidence of manipulation when the distribution of the assignment variable shows a discontinuity at the cut-off. This can be checked visually using a density plot of the assignment variable as shown in Supplementary Figure S3 with our simulated example and applied in four out of five identified studies which used gestational age as the assignment variable ( 22 , 28 , 30 , 31 ). Although previous studies found no evidence of manipulation near the 37 gestational week cut-off ( 22 , 28 ), induction of labor for medical reasons at 37 weeks is not uncommon practice. Manipulation in the assignment variable at the cut-off can be formally tested using the McCrary density test ( 33 ), which tests the null hypothesis that the marginal density of the assignment variable is continuous around the cut-off.

3.2.2.3 Exclusion restriction assumption

The exclusion restriction assumption is characteristic of the IV design, and it requires that the assignment rule affects the outcome exclusively through its effect on the exposure. The underlying assignment process in RDD must be known a priori , and alternative hypotheses must be excluded by providing evidence (often only theoretical) that the same assignment variable cut-off value is not used to assign the individuals to other exposures that could affect the outcome. In the context of studies assessing health effects of early life exposure this assumption may be violated if the same assignment variable cut-off, as for example widely used 2,500 grams birth weight cut-off for low birth weight babies or 37 gestational weeks cut-off for preterm birth, is used to determine several clinical decisions, like extra neonatal care, admission to neonatal intensive care unit, specific treatments or welfare benefits. If more than one of these exposures is likely to influence the outcome of interest it will not be possible to attribute the causal effect estimated using the RDD approach to a single exposure. Since the exclusion restriction assumption is untestable, it is important to provide theoretical evidence that the assignment variable cut-off is used uniquely to determine the exposure of interest. It can also be checked whether other exposures in question show discontinuity at the cut-off. For example, Bommer et al. ( 31 ) used 34 completed weeks of gestation as the cut-off for routine probiotics supplementation for neonates, and checked for the discontinuities in alternative treatments, like antibiotics, analgesics, and several other treatments. Similarly, Daysal et al. ( 28 ) in the study comparing obstetrician vs. midwife delivery supervision at the cut-off of 37 gestational weeks verified additional discontinuities in the use of vacuum/forceps during delivery, admission to NICU and hospital vs. home delivery. As shown in Table 1 , in many DOHaD research contexts justifying this assumption may be challenging.

3.2.2.4 Exchangeability around the assignment variable cut-off

The exchangeability assumption, in the RDD settings also called the continuity assumption, implies that individuals just above and below the cut-off are similar with respect to the distribution of observed and unobserved factors, except for the exposure, and thus they have the same potential outcome for either exposure level ( 9 ). Although exchangeability cannot be tested, it is possible to check if the observed baseline characteristics have a similar distribution above and below the cut-off. A graphical inspection involves a series of simple plots of the relationship between the observed baseline covariates not affected by the exposure and the assignment variable. For example, in our previously described simulated example of the effect of the supervision of deliveries on the infant health outcomes, we can examine the distribution of the observed maternal baseline characteristics around the 259 days of gestation ( Figure 4 ). In this hypothetical example, the observed discontinuity in maternal age and gestational hypertension probability at the cut-off indicate the imbalance of predetermined covariates that may threaten the validity of the design. All five previous studies that used gestational age as the assignment variable performed similar checks on the pre-exposure maternal and pregnancy characteristics ( 22 , 28 – 31 ). It is also advisable to perform formal tests, for example using nonparametric local polynomial techniques within the continuity framework ( 9 , 10 , 12 – 14 ). Both the graphical inspections and the formal testing should be interpreted with caution when there are several observed relevant covariates to assess, as some discontinuities may be observed by random chance only.

www.frontiersin.org

Figure 4 . Graphical representation of RDD for predetermined covariates for a hypothetical example of gestational age as an assignment variable for obstetrician’s instead of midwife’s supervision of delivery. Simulated data. (A) Maternal body mass index (BMI), (B) Maternal age at delivery, (C) Gestational hypertension, (D) Low maternal educational level.

In situations with imbalances at the cut-off in the distribution of the baseline covariates, which are likely to be important determinants of the outcome, RDD fails to provide a valid estimate of the true effect of an exposure of interest. Although covariate adjustment can be incorporated in the RDD estimations, it cannot be used to improve the validity of the design but only to enhance the efficiency of the local polynomial RD estimator ( 34 , 35 ).

3.2.2.5 Local monotonicity assumption

The fuzzy RDD estimates the local average treatment effect (LATE) ( 36 ), which is the average treatment effect for the compliers. Its identification requires that additional monotonicity or “no defiers” assumption ( 37 ) is met. This assumption, which is characteristic of the IV design, implies a monotonic relationship between the variable indicating the assignment and the exposure. Being untestable, the plausibility of this assumption should be investigated by knowledge of the context and observed data patterns ( 37 ). However, this was rarely done in previous RDD studies on health effects of early life exposures.

3.2.3 Sensitivity analyses and diagnostic checks

The sensitivity of the results to small variations in data and estimation procedure can be verified with several additional, strongly advised, sensitivity analyses and checks that are briefly summarized below.

3.2.3.1 Discontinuities in average outcomes at values other than the assignment variable cut-off

The RDD analysis consists in visually depicting and estimating a discontinuity in the outcomes of interest at the two sides of the assignment variable cut-off ( Supplementary Figure S4 ). One of the RDD robustness checks is the comparison of the effects for true and artificial (placebo) cut-offs in the assignment variable. Any discontinuity in artificially imposed cut-offs is an indication of potentially invalid RDD. This can be verified empirically by replacing the true cut-off value by different values of the assignment variable where exposure should not change, and by repeating both the graphical and the estimation analysis, as presented in two articles by Daysal et al. ( 22 , 28 ).

3.2.3.2 Sensitivity to the selection of the window around the assignment variable cut-off

The most frequently used RDD estimation methods are non-parametric or local methods that consider only observations in a selected window around the cut-off. Optimal bandwidth size can be selected either a priori or by data-driven algorithms ( 38 ). In practice, the bandwidth size depends on data availability around the cut-off. Ideally, one would like to use a very narrow window around the cut-off, but this comes at the cost of less precise estimates ( 39 ). Sensitivity analysis with alternative specifications of bandwidth size to check the robustness of the estimated effects is a standard in RDD ( 22 , 28 – 31 ).

3.2.3.3 Sensitivity to observations near the cut-off

Even if there is no evidence of manipulation in the assignment variable, the observations very near the cut-off are likely to be the most influential when fitting local polynomials. The sensitivity “donut hole” method consists of repeating the analysis on different subsamples where observations are removed in a symmetric distance around the cutoff, starting with the closest and then increasing the distance around the cut-off in the attempt to understand the sensitivity of the results to those observations ( 40 , 41 ). Sensitivity donut hole analysis was presented in two studies focused on comparing obstetrician vs. midwife delivery supervision at the cut-off of 37 gestational weeks ( 22 , 28 ).

4 Advantages and limitations of RDD in the context of DOHaD research

RDD has several advantages over other non-experimental study designs. Its estimates and validity checks can be easily presented using simple graphical representations that improve transparency and integrity of the results. The interpretation of the results is intuitive and straightforward. RDD can provide valid causal estimates under weaker assumptions compared to other non-experimental study designs, and many of the assumptions can be assessed empirically. When an assignment variable for an exposure is found, it is possible to identify a (local) causal effect of that exposure for multiple outcomes and, if the assignment variable is not context-specific, in multiple populations.

Most of the previous RDD studies on the health effects of exposures acting early in life were conducted on data from registries and administrative databases, which often lack important details and individual-level data. With a recent exception ( 42 ), the existing birth cohorts, which collect a plenty of detailed data on pregnancy outcomes, newborn, infant, and later childhood health outcomes and represent a unique and valuable source of data for DOHaD research, have not been exploited using RDD. There are several reasons for this. The external validity of RDD studies is often considered the main caveat, as the causal effect estimate is limited to the subpopulation of individuals at the assignment variable cut-off. In the sharp RDD the treatment effect is interpreted as the average treatment effect at the cut-off, and only in some RDD settings and with an additional conditional independence assumption (by conditioning on other predictors of outcome besides the assignment variable) it can be generalized and approximated to the average treatment effect ( 43 ). The LATE obtained in a fuzzy RDD is even less generalizable because it is inferred only to the subpopulation of compliers at the cutoff. In addition, RDD often addresses very setting-specific research questions (e.g., the effect of country-specific policies) that cannot be always replicated in different populations. The utility of RDD also depends on the practical and clinical relevance of the cut-off being studied.

The estimation in RDD implies that we need adequate power for estimating the regression line on both sides of the cut-off, i.e., a lot of observations near the cut-off. While in the context of registry-based research this may be feasible, the existing birth cohorts, although rich with individual-level data, often lack a sufficient sample size. The exact power will depend on the distribution of the assignment variable, the bandwidth chosen for the analysis, and whether the design is sharp or fuzzy (needs more power). If we assume that the assignment variable is normally distributed, the percentage of the original birth cohort available for RDD for different assignment variable cut-offs and bandwidths (expressed in terms of standard deviations [SD] from the mean, and +/− SDs, respectively) is shown in Table 3 . For example, if the cut-off is at 0.5 SD from the mean of the assignment variable and the bandwidth is +/− 0.5 SD, then only 34% of the original cohort is used for the analysis.

www.frontiersin.org

Table 3 . Percentage of the original study available for RDD for different bandwidths and cut-offs, assuming normally distributed assignment variable X .

Finally, despite an increasing popularity of RDD in studying health effects of early life exposures, settings other than policy and program evaluations are still relatively rare. Still, the assignment variables’ cut-offs for exposures applied in previous studies can be used, if appropriate, in other settings and with additional outcomes to address different research questions.

Although providing guidelines and recommendations on how to apply RDD in the context of DOHaD research is beyond our scope, we underline some key points that need to be considered when planning a RDD study in this context. First, researchers should look for large databases, drawing on administrative data or international cohort collaborations, and check that there are enough events among subjects around the cut-off. This is a crucial step, also considering the current international difficulties in data access and data sharing. For example, to check the feasibility of a RDD study aiming at, say, estimating the effect of an exposure on a specific outcome using gestational age as the assignment variable within the context of an international birth cohort collaboration, a researcher would need first to ask all participating cohorts to report the number of events in the children born in the week before and the week after the cut-off. A second key point regards the assignment variables. As reviewed in this article, a number of assignment variables, cut-offs and related exposures and/or interventions have been identified and used in previous RDD studies focused on health effects of early life exposures. These discontinuities have often been employed in multiple studies and with different databases, validating repeatedly the robustness of the RDD settings and verifying the underlying assumptions. We suggest drawing on previous experience to exploit already identified discontinuities, if not too setting specific, for the study of multiple outcomes. The identification of new discontinuities is more complex and may even be considered as a separate specific research objective. Finally, as one of the main strengths of the RDD approach is the possibility to assess empirically several of the underlying assumptions, it is important to verify that the data required to conduct the corresponding sensitivity analyses are available in the study database and/or are included in the plans to collect or to harmonize new data.

5 Conclusion

The regression discontinuity design is a powerful approach for causal inference in DOHaD research. Its widespread use in studying health effects of early life exposures has been hampered by the limited external validity of RDD studies and the rarity of settings outside of program and policy evaluation. The identification of discontinuities and RDD principles should be introduced to researchers who should exploit the utilities of this design in the existing population registries and birth cohorts whenever the setting and research question allow.

Author contributions

MP: Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing. DZ: Methodology, Supervision, Writing – review & editing. KT: Methodology, Writing – review & editing. LR: Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was funded by the European Union’s Horizon 2020 research and innovation program (grant agreement no. 733206 LifeCycle). This manuscript reflects only the authors’ view, and the Commission is not responsible for any use that may be made of the information it contains.

Acknowledgments

The authors are grateful to Luigi Gagliardi, Ghislaine Scelo, Bianca De Stavola and Mario Pagliero for their stimulating discussions and valuable inputs.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpubh.2024.1377456/full#supplementary-material

1. Gluckman, PD, Hanson, MA, and Buklijas, T. A conceptual framework for the developmental origins of health and disease. J Dev Orig Health Dis . (2010) 1:6–18. doi: 10.1017/S20401744099901712

Crossref Full Text | Google Scholar

2. de Stavola, BL, Nitsch, D, dos Santos Silva, I, McCormack, V, Hardy, R, Mann, V, et al. Statistical issues in life course epidemiology. Am J Epidemiol . (2006) 163:84–96. doi: 10.1093/aje/kwj003

3. Lee, DS, and Lemieux, T. Regression discontinuity designs in economics. J Econ Lit . (2010) 48:281–355. doi: 10.1257/jel.48.2.281

4. Hilton Boon, M, Craig, P, Thomson, H, Campbell, M, and Moore, L. Regression discontinuity designs in health: a systematic review. Epidemiology . (2021) 32:87–93. doi: 10.1097/EDE.0000000000001274

PubMed Abstract | Crossref Full Text | Google Scholar

5. Venkataramani, AS, Bor, J, and Jena, AB. Regression discontinuity designs in healthcare research. BMJ . (2016) 352:i1216. doi: 10.1136/bmj.i1216

6. Moscoe, E, Bor, J, and Bärnighausen, T. Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: a review of current and best practice. J Clin Epidemiol . (2015) 68:132–43. doi: 10.1016/j.jclinepi.2014.06.021

7. Thistlethwaite, DL, and Campbell, DT. Regression-discontinuity analysis: an alternative to the ex post facto experiment. J Educ Psychol . (1960) 51:309–17. doi: 10.1037/h0044319

8. Imbens, GW, and Lemieux, T. Regression discontinuity designs: a guide to practice. J Econom . (2008) 142:615–35. doi: 10.1016/j.jeconom.2007.05.001

9. Hahn, J, Todd, P, and Klaauw, W. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica . (2001) 69:201–9. doi: 10.1111/1468-0262.00183

10. Cattaneo, MD, Idrobo, N, and Titiunik, R. A practical introduction to regression discontinuity designs: foundations . Cambridge, United Kingdom: Cambridge University Press (2020).

Google Scholar

11. Cattaneo, MD, Idrobo, N, and Titiunik, R. A Practical Introduction to Regression Discontinuity Designs: Extensions. Cambridge Elements: Quantitative and Computational Methods for Social Science, Cambridge University Press, to appear. (2023). Preliminary draft Available at: https://mdcattaneo.github.io/books/Cattaneo-Idrobo-Titiunik_2023_CUP.pdf

12. Calonico, S, Cattaneo, MD, Arbor, A, and Titiunik, R. Robust data-driven inference in the regression-discontinuity design. Stata J . (2014) 14:909–46. doi: 10.1177/1536867x1401400413

13. Calonico, S, Cattaneo, MD, and Titiunik, R. Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica . (2014) 82:2295–326. doi: 10.3982/ECTA11757

14. Calonico, S, Cattaneo, MD, Arbor, A, Farrell, MH, and Titiunik, R. Rdrobust: software for regression-discontinuity designs. Stata J . (2017) 17:372–404. doi: 10.1177/1536867X1701700208

15. Cattaneo, MD, Frandsen, BR, and Titiunik, R. Randomization inference in the regression discontinuity design: an application to party advantages in the U.S. Senate. J Causal Inference . (2014) 3:1–24. doi: 10.1515/jci-2013-0010

16. Cattaneo, MD, Titiunik, R, and Vazquez-Bare, G. Comparing inference approaches for RD designs: a reexamination of the effect of head start on child mortality. J Policy Anal Manage . (2017) 36:643–81. doi: 10.1002/pam.21985

17. Cattaneo, MD, Arbor, A, Titiunik, R, and Vazquez-Bare, G. Inference in regression discontinuity designs under local randomization. Stata J . (2016) 16:331–67. doi: 10.1177/1536867X160160020

18. Cattaneo, MD, Keele, L, and Titiunik, R. A guide to regression discontinuity designs in medical applications. Stat Med . (2023) 42:4484–513. doi: 10.1002/sim.9861

19. Been, J, Burgos Ochoa, L, Bertens, LCM, Schoenmakers, S, Steegers, EAP, and Reiss, IKM. Impact of COVID-19 mitigation measures on the incidence of preterm birth: a national quasi-experimental study . Lancet . Public Health . (2020) 5:e604–11. doi: 10.1016/S2468-2667(20)30223-1

20. Bakolis, I, Stewart, R, Baldwin, D, Beenstock, J, Bibby, P, Broadbent, M, et al. Changes in daily mental health service use and mortality at the commencement and lifting of COVID-19 “lockdown” policy in 10 UK sites: a regression discontinuity in time design. BMJ Open . (2021) 11:e049721. doi: 10.1136/bmjopen-2021-049721

21. Takaku, R, and Yokoyama, I. What the COVID-19 school closure left in its wake: evidence from a regression discontinuity analysis in Japan. J Public Econ . (2021) 195:104364. doi: 10.1016/j.jpubeco.2020.104364

22. Daysal, NM, Trandafir, M, and van Ewijk, RJ. Returns to Childbirth Technologies: Evidence from Preterm Births, IZA Discussion Papers, No. 7834. Institute for the Study of Labor (IZA), Bonn. (2013).

23. Card, D, Lee, DS, Pei, Z, and Weber, A. Inference on causal effects in a generalized regression kink design. Econometrica . (2015) 83:2453–83. doi: 10.3982/ECTA11224

24. Cattaneo, MD, Keele, L, Titiunik, R, and Vazquez-Bare, G. Interpreting regression discontinuity designs with multiple cutoffs. J Polit . (2016) 78:1229–48. doi: 10.1086/686802

25. Papay, JP, Willett, JB, and Murnane, RJ. Extending the regression-discontinuity approach to multiple assignment variables. J Econom . (2011) 161:203–7. doi: 10.1016/j.jeconom.2010.12.008

26. Wong, VC, Steiner, PM, and Cook, TD. Analyzing regression-discontinuity designs with multiple assignment variables. J Educ Behav Stat . (2013) 38:107–41. doi: 10.3102/1076998611432172

27. Hausman, C, and Rapson, DS. Regression discontinuity in time: considerations for empirical applications. Annu Rev Resour Econ . (2018) 10:533–52. doi: 10.1146/annurev-resource-121517-033306

28. Daysal, NM, Trandafir, M, and van Ewijk, R. Low-risk isn’t no-risk: perinatal treatments and the health of low-income newborns. J Health Econ . (2019) 64:55–67. doi: 10.1016/j.jhealeco.2019.01.006

29. Hutcheon, JA, Harper, S, Liauw, J, Skoll, MA, Srour, M, and Strumpf, EC. Antenatal corticosteroid administration and early school age child development: a regression discontinuity study in British Columbia, Canada. PLoS Med . (2020) 17:e1003435. doi: 10.1371/journal.pmed.1003435

30. Hutcheon, JA, Strumpf, EC, Liauw, J, Skoll, MA, Socha, P, Srour, M, et al. Antenatal corticosteroid administration and attention-deficit/hyperactivity disorder in childhood: a regression discontinuity study. CMAJ . (2022) 194:E235–41. doi: 10.1503/cmaj.211491

31. Bommer, C, Horn, S, and Vollmer, S. The effect of routine probiotics supplementation on preterm newborn health: a regression discontinuity analysis. Am J Clin Nutr . (2020) 112:1219–27. doi: 10.1093/ajcn/nqaa196

32. Staiger, D, and Stock, JH. Instrumental variables regression with weak instruments. Econometrica . (1997) 65:557–86. doi: 10.2307/2171753

33. Mccrary, J. Manipulation of the running variable in the regression discontinuity design: a density test. J Econom . (2008) 142:698–714. doi: 10.1016/j.jeconom.2007.05.005

34. Cattaneo, MD, Keele, L, and Titiunik, R. Covariate adjustment in regression discontinuity designs In: JR Zubizarreta, EA Stuart, DS Small, and PR Rosenbaum, editors. Handbook of Matching and Weighting Adjustments for Causal Inference . 1st ed: New York, NY: Chapman and Hall/CRC (2023)

35. Calonico, S, Cattaneo, MD, Farrell, MH, and Titiunik, R. Regression discontinuity designs using covariates. Rev Econ Stat . (2019) 101:442–51. doi: 10.1162/rest_a_00760

36. Imbens, GW, and Angrist, JD. Identification and estimation of local average treatment effects. Econometrica . (1994) 62:467–75. doi: 10.2307/2951620

37. Edwards, B, Fiorini, M, Stevens, K, and Taylor, M. Is monotonicity in an IV and RD design testable? No, but you can still check on it. Working Papers 2013–06. University of Sydney, School of Economics. (2013). Available at: http://hdl.handle.net/2123/9020 (Accessed August 3, 2023).

38. Imbens, G, and Kalyanaraman, K. Optimal bandwidth choice for the regression discontinuity estimator. Rev Econ Stud . (2012) 79:933–59. doi: 10.1093/restud/rdr043

39. Schochet, PZ. Statistical power for regression discontinuity designs in education evaluations. J Educ Behav Stat . (2009) 34:238–66. doi: 10.3102/1076998609332

40. Barreca, AI, Guldi, M, Lindo, JM, and Waddell, GR. Saving babies? Revisiting the effect of very low birth weight classification. Q J Econ . (2011) 126:2117–23. doi: 10.1093/qje/qjr042

41. Barreca, AI, Lindo, JM, and Waddell, GR. Heaping-induced Bias in regression-discontinuity designs. Econ Inq . (2016) 54:268–93. doi: 10.1111/ecin.12225

42. Broughton, T, Langley, K, Tilling, K, and Collishaw, S. Relative age in the school year and risk of mental health problems in childhood, adolescence and young adulthood. J Child Psychol Psychiatry . (2023) 64:185–96. doi: 10.1111/jcpp.13684

43. Angrist, JD, and Rokkanen, M. Wanna get away? Regression discontinuity estimation of exam school effects away from the cutoff. J Am Stat Assoc . (2015) 110:1331–44. doi: 10.1080/01621459.2015.1012259

Keywords: regression discontinuity, epidemiology, review, DOHaD, RDD, early life exposures, health effects

Citation: Popovic M, Zugna D, Tilling K and Richiardi L (2024) Regression discontinuity design for the study of health effects of exposures acting early in life. Front. Public Health . 12:1377456. doi: 10.3389/fpubh.2024.1377456

Received: 27 January 2024; Accepted: 08 April 2024; Published: 19 April 2024.

Reviewed by:

Copyright © 2024 Popovic, Zugna, Tilling and Richiardi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Maja Popovic, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 22 April 2024

Optimization of wear parameters for ECAP-processed ZK30 alloy using response surface and machine learning approaches: a comparative study

  • Mahmoud Shaban 1 , 2 ,
  • Fahad Nasser Alsunaydih 1 ,
  • Hanan Kouta 3 ,
  • Samar El-Sanabary 3 ,
  • Abdulrahman Alrumayh 4 ,
  • Abdulrahman I. Alateyah 4 ,
  • Majed O. Alawad 5 ,
  • Waleed H. El-Garaihy 4 , 6 &
  • Yasmine El-Taybany 3  

Scientific Reports volume  14 , Article number:  9233 ( 2024 ) Cite this article

Metrics details

  • Engineering
  • Materials science
  • Mathematics and computing

The present research applies different statistical analysis and machine learning (ML) approaches to predict and optimize the processing parameters on the wear behavior of ZK30 alloy processed through equal channel angular pressing (ECAP) technique. Firstly, The ECAPed ZK30 billets have been examined at as-annealed (AA), 1-pass, and 4-passes of route Bc (4Bc). Then, the wear output responses in terms of volume loss (VL) and coefficient of friction (COF) have been experimentally investigated by varying load pressure (P) and speed (V) using design of experiments (DOE). In the second step, statistical analysis of variance (ANOVA), 3D response surface plots, and ML have been employed to predict the output responses. Subsequently, genetic algorithm (GA), hybrid DOE–GA, and multi-objective genetic algorithm techniques have been used to optimize the input variables. The experimental results of ECAP process reveal a significant reduction in the average grain size by 92.7% as it processed through 4Bc compared to AA counterpart. Furthermore, 4Bc exhibited a significant improvement in the VL by 99.8% compared to AA counterpart. Both regression and ML prediction models establish a significant correlation between the projected and the actual data, indicating that the experimental and predicted values agreed exceptionally well. The minimal VL at different ECAP passes was obtained at the highest condition of the wear test. Also, the minimal COF for all ECAP passes was obtained at maximum wear load. However, the optimal speed in the wear process decreased with the number of billets passes for minimum COF. The validation of predicted ML models and VL regression under different wear conditions have an accuracy range of 70–99.7%, respectively.

Similar content being viewed by others

research design regression analysis

A machine-learning-based alloy design platform that enables both forward and inverse predictions for thermo-mechanically controlled processed (TMCP) steel alloys

Jin-Woong Lee, Chaewon Park, … Kee-Sun Sohn

research design regression analysis

Worn surface topography and mathematical modeling of Ti-6Al-3Mo-2Sn-2Zr-2Nb-1.5Cr alloy

Ramadan N. Elshaer, Khaled M. Ibrahim & Ahmed Ismail Zaky Farahat

research design regression analysis

Optimization and prediction of tribological behaviour of filled polytetrafluoroethylene composites using Taguchi Deng and hybrid support vector regression models

Musa Alhaji Ibrahim, Hüseyin Çamur, … S. I. Abba

Introduction

Magnesium (Mg) has demonstrated an impressive role in a wide range of engineering sectors due to its unique properties. Mg is the lightest weight amongst other metals with a density of only 2/3 of aluminum; therefore, it has countless applications in cases where weight reduction is essential, i.e., automotive, aerospace, and structural industries 1 , 2 , 3 . In addition, Mg has a high strength-to-weight ratio, high damping capacity, and good machinability. Furthermore, its remarkable biological and mechanical properties make Mg a promising biodegradable material that has increasingly emerged in recent biomedical applications, including orthopedic implants and cardiovascular stents 4 , 5 . Mg exhibits mechanical properties similar to human bone, such as density and elastic modulus, and fully degrades in the human body. So, no additional surgery is needed for implant removal after the healing of bone tissue. Moreover, Mg shows extreme compatibility with bone cells and doesn’t pose a toxicity risk 6 , 7 .

Despite all the aforementioned merits, the high corrosion rate remains Mg’s major inherent limitation, which is a significant barrier to further biological applications. For instance, Mg corrodes rapidly in the chloride medium inside the human body, leading to fast degradation, mechanical support distortion, and, consequently, failure of the Mg implant before the healing process. Additionally, the corrosion process releases some toxic elements and produces hydrogen gas bubbles that accumulate in the body, causing damage to the implant sites 8 , 9 . Hence, improving Mg’s mechanical and biological performance is a challenging endeavor that has gained a lot of interest from the scientific and medical communities. From this perspective, many attempts have been made to find effective methods for producing biomaterials with the required properties, high biological safety, and reliable performance to develop new biomedical applications. According to the published articles, enhancing the mechanical and corrosion behavior of Mg used for biomedical applications, where friction and wear are involved, can be accomplished by applying either metallurgical or surface modification techniques 10 , 11 , 12 .

Surface microstructural modification is achieved through mechanical processing, specifically utilizing severe plastic deformation (SPD) techniques such as high-pressure torsion (HPT), multi-directional forging (MDF), and equal channel angular pressing (ECAP) 13 . Notably, ECAP surpasses other SPD methods in effectiveness as it produces a homogeneous ultrafine-grained (UFG) structures. This enhancement in microstructure contributes to improved mechanical performance, wear and corrosion resistance without compromising biological response 14 , 15 . Alateyah et al. demonstrated that processing pure Mg through 2-passes of ECAP led to significant ultrafine structure 1 . Sahoo et al. 16 demonstrated a 70% grain refinement in Mg-RZ5 alloy and a 12% increase in strength and 16% improvement in hardness through four-passes of hot ECAP. El-Garaihy et al. experimentally investigated the effect of different ECAP parameters on the performance of ZK30 Mg alloy 7 . Additionally, prediction models using a machine learning approach were created to estimate the ECAP parameters, validating the experimental optimum results 17 . The impact of varying ECAP processing parameters on the mechanical and electrical behaviors of pure copper (Cu) was studied numerically, experimentally, and statistically. Using ECAP dies with angles of 90° and 120°, processing routes (A, Bc, and C), at room temperature, 100 °C, and 200 °C up to 6-passes, the results demonstrated that the 6-Bc route using ~ 110°-die angle at ~ 190 °C was the optimum condition significantly improving grain size, hardness, and ductility 18 , 19 . In contrast, Vaughan et al. 20 reported that ECAP induced grain refinement in Mg-ZKQX6000 alloy but deteriorated its corrosion resistance. In addition, the significance of the ultrafine structure produced through ECAP on wear characteristics is rarely discussed, highlighting the need to emphasize the importance of adapting processing parameters.

Prior studies employed various statistical techniques such as response surface methodology (RSM), genetic algorithms (GA), hybrid design of experiments and GA, and multi-objective genetic algorithms to optimize ECAP analysis. By looking at thirty-one tests created by RSM to look into the ECAP process parameters, Daryadel 21 verified the finite element simulation of the ECAP process of AA7075 with copper casing. The ECAP angle was the most effective ECAP input parameter since, according to the ANOVA analysis, it was anticipated to have the most effect on the response. Alateyah et al. 22 optimized the ECAP parameters of pure Mg using RSM, ANOVA, GA, and RSM-GA. They reported that the most significant parameters in grain refining and Vicker’s microhardness values were obtained via ECAP processing using a die with ɸ = 90° through 4-passes of route Bc.

Recently, machine learning (ML) is a form of artificial intelligence focused on creating algorithms that enable computers to learn and make predictions without explicit programming 23 . ML involves constructing systems capable of analyzing and spotting patterns in data to make informed decisions. ML algorithms learn from historical data, using statistical approaches to recognize patterns, connections, and trends 24 . There are various ML approaches, such as supervised, unsupervised, semi-supervised, and reinforcement learning. Supervised learning, the most common, analyzes training data to identify trends and make forecasts based on historical data 25 . Unsupervised learning involves preparing data, model setup, feature extraction, algorithm selection, training, confirmation, and testing, often working best in combination with supervised learning. It uses the dataset for training to identify effective model parameters, revealing previously unrecognized relationships.

In an ideal world, data from neither the testing nor the training stages would be used to adjust the hyper parameters 26 . Overfitting is a prevalent problem during model training, in which the model matches the dataset excessively without considering the regularization method 27 . In such cases, the trained model seldom performs well during testing validation. When dealing with a small set of data, like the one employed in this study, cross-validation (CV) is utilized to address over fitting issues. The k-fold CV technique divides the training data into several independent subsets, or “folds”. Each fold is used to train the model, while the remaining data is used to evaluate its performance. This technique is looped k times, and the model’s success is determined by averaging the data values across iterations. Although computationally costly, this strategy aids in data preservation, especially when working with small-size datasets 28 . From this point of view, controlling the ECAP processing parameters is crucial as they directly influence the microstructural, mechanical, and wear behavior. From the above literature, in the current work, the main aim is to predict and optimize the ECAP parameters on the wear behavior of ZK30 alloy using statistical analysis and machine learning approaches.

Experimental specifics and methodology

Materials and experimental procedures.

In this study, Mg-3Zn-0.6Zr-0.4Mn, wt% (ZK30) alloy billets were machined with 20 mm diameter and 60 mm length. The ZK30 billets were annealed before ECAP processing for 16 h at 430 °C, followed by furnace cooling. ECAP processing was conducted using a die with an internal channel angle of Φ = 90° and curvature angle of Ψ = 20° (Fig.  1 ). To regulate temperature, the dies were wrapped with a heating element and insulated using a ceramic fiber layer. Temperature measurements were conducted with K-type thermocouples. To ensure uniform temperature distribution during extrusion, monitoring was performed before and throughout the process, revealing a temperature variation of only 3 °C along the inlet channel, and ECAPed at 250 °C. Prior to extrusion, the samples remained in the die for 15 min to attain a steady-state processing temperature. A universal testing machine (Shimadzu 100kNXplus) applied the pressing load and controlled the speed, with a constant ram speed set to 1 mm/s for all experiments. The ZK30 billets processed through a single pass (1-P) and four passes of route Bc (4Bc), with the sample rotated 90° between subsequent passes.

figure 1

Schematic of ECAP process.

Preparing samples for metallographic analysis involved standard mechanical grinding and polishing procedures for both the as-annealed (AA) and ECAP-processed samples. ZK30 billets were sectioned and mounted in conductive epoxy. ZK30 samples were grinded incrementally using silicon-carbide sandpaper (600/800/1000/1200 grit), samples were washed with water and dried using alcohol before switching to higher grit sandpaper. Then the samples were polished using diamond suspensions of particle sizes 6 μm followed by 1 μm. To that end, a final polishing step was conducted; a 0.05-micron colloidal silica formula was used to provide the final polish. Between polishing rounds, the specimen was ultrasonically cleaned in ethanol for 10 min. The ZK30 samples were etched using a solution comprising 6 g picric acid, 5 mL acetic acid, 100 mL ethanol, and 10 mL deionized water. For electron backscattered diffraction (EBSD), the longitudinal plane was interfaced with a scanning electron microscope (SEM) to acquire grain size and grain orientation distribution maps for all ZK30 samples. These data were subsequently processed through the HKL Flamenco Channel 5 software program (Hitachi, Ltd., Tokyo, Japan). To ensure robust data acquisition for meaningful statistical analysis, the SEM operated at 15 kV and 1.5 nA, with a 100 nm step size from the extrusion direction (ED) surface during EBSD. Furthermore, X-ray diffraction (XRD; 6100 Shimadzu) equipped with a CuKa radiation source having a wavelength (k) of 1.5418 was used to analyze the phases structure from 20° to 80°.

The wear behavior of ECAPed ZK30 billets was investigated using a ball-on-flat apparatus (Tribolab, Bruker’s universal mechanical tester, USA). The wear behavior of ZK30 alloy was studied under three different applied loads (1, 3, and 5 N) based on previous studies and according to the material response. Due to the sample diameter of 20 mm which resulted in adapting the stroke for the ball as max 10 mm, since the surface have several tests. The issue comes from the ability of the rotary drive of the Triolab machine since the stroke is short so the speed reaches the maximum rotation speed that transfer the movement of the sample to be reciprocating movement. To that end, three different speeds (64.5, 125, and 250 mm/s) were selected to examine the effect of wear speed on each coefficient of friction and volume loss. Furthermore, the ZK30 samples were tested for 110, 210 and 410 s. As the wear test parameters vary between the force and the speed, the time was selected to maintain the same distance was done by all the conditions. So in in case of speed 64.5 mm/s the time is 410 s, while in case of speed 125 mm/s the time is 210, and for the speed of 250 mm/s the time is 110 to maintain the same wear distance for all wear speeds. All ZK30 samples were ground and polished to a mirror-like finish before performing wear tests. The volume loss and coefficient of friction were measured and analyzed, and the average values were calculated for all the wear parameters of ECAPed ZK30 samples.

Statistical analysis of variance (ANOVA)

Examination of variation in the current investigation, ANOVA was employed to analyze the practical data and determine which variables had the most important impacts of the input parameters (P and V) on the outcomes of the output responses (VL and COF). The Design Expert software has been used during the statistical analysis. An overview of the ANOVA results is provided in Table 1 . At a 95% confidence level, the adjusted R 2 , expected R 2 , p value, adequate precision, and F-value are reported. All the responses had p values that were less than 0.05 and F-values that were larger than 4, suggesting that the predicted models were adequate and that the independent parameters, individual model coefficients, and interaction terms significantly influenced the responses received. For AA, one and four passes, velocity significantly affected VL and COF followed by pressure. To evaluate the validity of the model, the signal-to-noise ratio (S/N) was estimated with adequate precision. The S/N ratio should be greater than four. Because the appropriate precision of the obtained responses was greater than four, suggesting sufficient signal, the model can be utilized for negotiating the design space.

Machine learning (ML) approach

In order to predict the ECAP properties of ZK30 alloys, a precise predictive machine learning model was created. The basic methodologies for constructing these models were linear regression (LR), random forest (RF), Gaussian process regression (GPR), support vector machine for regression (SVR), and gradient boosting (GBoost) algorithms 29 , 30 , 31 , 32 , 33 . The combination of these ML models holds excellent potential for accurately predicting the ECAP parameters, showing significant promise. They use algorithms to discover traits, correlations, and patterns in the data being researched. Some of these techniques are discussed in the following context:

Linear regression (LR)

Linear regression is a simple ML technique that seeks to predict the connection between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The aim is to select the best-fitting line that minimizes the difference between anticipated and actual dependent variable values. The linear regression equation is given by:

where y is the dependent variable, x 1 , x 2 ,…, x n are the independent variables, b 0 is the intercept, and b 1 , b 2 ,…, b n are coefficients that represent the relationship between the independent variables and the dependent variable. The model is trained by predicting the best coefficient values using a method known as ordinary least squares, which minimizes the sum of squared discrepancies between predicted and actual values. Once trained, the model may be used to make predictions by changing the independent variables' values. In some cases, the model is overfitted due to the simplicity of the data pattern. In order to provide a regulated model, a method called regularized linear regression (RLR) is commonly used 27 . It seeks a linear connection between the input and goal variables while minimizing the sum of squared errors and a penalty term. By adding a regularization parameter multiplied by the L1 norm (Lasso regularization) or L2 norm (Ridge regularization) of the regression coefficients, the penalty term helps to regulate the model’s complexity. In Ridge regularization, the loss function (L) can be computed using:

where y is the vector of observed dependent variable values, X is the matrix of independent variables, b is the vector of coefficients, and λ is the regularization parameter, ( y  −  Xb )' is the transposition of the difference between the observed and predicted values, and b'b is the coefficient vector of the squared L2 norm.

Support vector machines (SVM)

SVM is a supervised learning method used for classification and regression problems. SVM seeks an optimum hyperplane that best divides data points from distinct classes or predicts the target variable. In the case of binary classification, SVM selects the hyperplane with the most significant margin between the two classes' closest data points 34 . The linear SVM characteristic equations are listed as follows:

where w represents the weight vector, x is the input vector, and b is the bias term. The optimized values of w and b can be acquired by minimizing the following term:

with the following constraints:

where ε indicates an error tolerance, and C is a compromise between the empirical error and the general term.

Gradient boosting (GBoost)

GBoost is an ensemble approach for creating a robust predictive model by combining numerous weak predictive models (usually decision trees). It constructs the model iteratively, with each successive model focusing on addressing the errors created by earlier models. The final prediction is derived by adding the weak models' predictions and weighting them with a learning rate. The Gboost approach optimizes a loss function (e.g., mean squared error) by repeatedly fitting weak models to the loss function’s negative gradient 33 . The goal of GBoost is to create an approximation of the underlying function F* ( x ) that translates instances x to their associated output values y , denoted as F ( x ). This goal is accomplished by utilizing a training dataset to minimize the expected value of a specified loss function. These fundamental functions can be represented by models such as decision trees as follows:

where y i represents the prediction at iteration i , α is the learning rate, which is a hyperparameter controlling the contribution of each weak model, and F i ( x ) is the weak model’s prediction at iteration i .

Random forest (RF)

RF is an ensemble learning approach that builds many decision trees and combines their forecasts to generate a final prediction. Each decision tree is constructed using a random selection of training data characteristics and samples. The final prediction is generated by aggregating all individual tree forecasts (e.g., a majority vote for classification or average for regression) 31 . The random forest prediction equation is:

where y i represents the prediction of each individual tree, and mode returns the most frequent prediction for classification or the average for regression.

Gaussian process regression (GPR)

GPR is a non-parametric probabilistic regression approach that models the connection between input and target variables. It considers predictions as a Gaussian process, with a mean and a covariance function (kernel) defining the range of probable functions. GPR generates a posterior distribution over the predicted functions, allowing for the assessment of uncertainty 32 . A GP is often defined by its mean function, m ( x ), and covariance function (also known as the kernel function), k ( x , x' ), where x and x' are two occurrences inside the input features matrix x . As a result, the expected y* values may be described as a Gaussian process function as follows:

Overfitting and underfitting are common challenges encountered in machine learning modeling. These problems can be treated as follows: Feature Selection which includes selection the most relevant characteristics to reduce model complexity. This helps to keep the model from fitting noise in the data. Adding a regularization term, such as L1 (Lasso) or L2 (Ridge) regularization, to the loss function. Regularization increases the model’s complexity, deterring overfitting. Cross-validation that evaluate model performance on different data subsets. If the model’s performance differs dramatically among folds, this might imply overfitting. Early Stopping which refers to evaluate the model’s performance on a validation set while training. Stop the training process when the validation error begins to rise, indicating that the model is overfitting the training data. To overcome underfitting, however, feature extraction or modification may be thought of as representing complex interactions between input and output variables. Model complexity can be raised to reflect underlying data patterns by including higher-order or interaction variables in the model. Ensemble methods, such as bagging, boosting, or stacking, can be used to combine different models to improve prediction accuracy. Data augmentation strategies may be used to enhance the quantity of the training data, exposing the model to a wider range of patterns while decreasing underfitting. To address these issues in our model, we used cross-validation and model complexity techniques.

Results and discussions

Experimental results, microstructure evolution.

The ZK30’s AA and ECAPed conditions of the inverse pole figures (IPF) coloring patterns and associated band contrast maps (BC) are shown in Fig.  2 . High-angle grain boundaries (HAGBs) were colored black, while Low-angle grain boundaries (LAGBs) were colored white for AA condition, and it was colored red for 1P and Bc, as shown in Fig.  2 . The grain size distribution and misorientation angle distribution of the AA and ECAPed ZK30 samples is shown in Fig.  3 . From Fig.  2 a, it was clear that the AA condition revealed a bimodal structure where almost equiaxed refined grains coexist with coarse grains and the grain size was ranged between 3.4 up to 76.7 µm (Fig.  3 a) with an average grain size of 26.69 µm. On the other hand, low fraction of LAGBs as depicted in Fig.  3 b. Accordingly, the GB map (Fig.  2 b) showed minimal LAGBs due to the recrystallization process resulting from the annealing process. ECAP processing through 1P exhibited an elongated grain alongside refined grains and the grain size was ranged between 1.13 and 38.1 µm with an average grain size of 3.24 µm which indicated that 1P resulted in a partial recrystallization, as shown in Fig.  2 c,d. As indicated in Fig.  2 b 1P processing experienced a refinement in the average grain size of 87.8% as compared with the AA condition. In addition, from Fig.  2 b it was clear that ECAP processing via 1P resulted a significant increase in the grain aspect ratio due to the uncomplete recrystallization process. In terms of the LAGBs distribution, the GB maps of 1P condition revealed a significant increase in the LAGBs’ fraction (Fig.  2 d). A significant increase in the LAGBs density of 225% after processing via 1P was depicted compared to the AA sample (Fig.  2 c). Accordingly, the UFG structure resulted from ECAP processing through 1P led to increase the fraction of LAGBs which agreed with previous study 35 , 36 . Shana et al. 35 reported that during the early passes of ECAP a generation and multiplication of dislocation is occur which is followed by entanglement of the dislocation forming the LAGBs and hence, the density of LAGBs was increased after processing through 1P. The accumulation of the plastic strain up to 4Bc revealed an almost UFG, which indicated that 4Bc led to a complete dynamic recrystallization (DRX) process (Fig.  2 e). The grain size was ranged between 0.23 up to 11.7 µm with average grain size of 1.94 µm (the average grain size was decreased by 92.7% compared to the AA condition). On the other hand, 4Bc revealed a decrease in the LAGBs density by 25.4% compared to 1P condition due to the dynamic recovery process. The decrease in the LAGBs density after processing through 4Bc was coupled with an increase in the HAGBs by 4.4% compared to 1P condition (Figs.  2 f, 3 b). Accordingly, the rise of the HAGBs after multiple passes can be referred to the transfer of LAGBs into HAGBs during the DRX process.

figure 2

IPF coloring maps and their corresponding BC maps, superimposed for the ZK30 billets in its AA condition ( a , b ), and ECAP processed through ( c , d ) 1P, ( e , f ) 4Bc (with HAGBs in black lines and LAGBs in white lines (AA) and red lines (1P, 4Bc).

figure 3

Relative frequency of ( a ) grain size and ( b ) misorientation angle of all ZK30 samples.

Similar findings were reported in previous studies. Dumitru et al. 36 reported that ECAP processing resulted in the accumulation and re-arrangement of dislocations which resulted in forming a subgrains and equiaxed grains with an UFG structure and a fully homogenous and equiaxed grain structure for ZK30 alloy was attained after the third pass. Furthermore, they reported that the LAGBs is transferred into HAGBs during the multiple passes which leads to the decrease in the LAGBs density. Figueiredo et al. 37 reported that the grains evolved during the early passes of ECAP into a bimodal structure while further processing passes resulted in the achievement of a homogenous UFG structure. Zhou et al. 38 reported that by increasing the processing passes resulted in generation of new grain boundaries which resulted in increasing the misorientation to accommodate the deformation and the Geometrically Necessary Dislocations (GNDs) generated a part of the total dislocations with a HAGBs, thus develop misorientations between the neighbor grains. Tong et al. 39 reported that the fraction of LAGBs is decreased during multiple passes for Mg–Zn–Ca alloy.

Figure  4 a displays X-ray diffraction (XRD) patterns of the AA-ZK30 alloy, 1P, and 4Bc extruded samples, revealing peaks corresponding to primary α-Mg Phase, Mg 7 Zn 3 , and MgZn 2 phases in all extruded alloys, with an absence of diffraction peaks corresponding to oxide inclusions. Following 1P-ECAP, the α-Mg peak intensity exhibits an initial increase, succeeded by a decrease and fluctuations, signaling texture alterations in the alternative Bc route. The identification of the MgZn 2 phase is supported by the equilibrium Mg–Zn binary phase diagram 40 . However, the weakened peak intensity detected for the MgZn 2 phase after the 4Bc–ECAP process indicates that a significant portion of the MgZn 2 dissolved into the Mg matrix, attributed to their poor thermal stability. Furthermore, the atomic ratio of Mg/Zn for this phase is approximately 2.33, leading to the deduction that the second phase is the Mg 7 Zn 3 compound. This finding aligns with recent research on Mg–Zn alloys 41 . Additionally, diffraction patterns of ECAP-processed samples exhibit peak broadening and shifting, indicative of microstructural adjustments during plastic deformation. These alterations undergo analysis for crystallite size and micro-strain using the modified Williamson and Hall (W–H) method 42 , as illustrated in Fig.  4 b. After a single pass of ECAP, there is a reduction in crystallite size and an escalation in induced micro-strain. Subsequent to four passes-Bc, further reductions in crystallite size and heightened micro-strain (36 nm and 1.94 × 10 –3 , respectively) are observed. Divergent shearing patterns among the four processing routes, stemming from disparities in sample rotation, result in distinct evolutions of subgrain boundaries. Route BC, characterized by the most extensive angular range of slip, generates subgrain bands on two shearing directions, expediting the transition of subgrain boundaries into high-angle grain boundaries 43 , 44 . Consequently, dislocation density and induced micro-strains reach their top in route BC, potentially influenced by texture modifications linked to orientation differences in processing routes. Hence, as the number of ECAP passes increases, an intensive level of deformation is observed, leading to the existence of dynamic recrystallization and grain refinement, particularly in the ECAP 4-pass. This enhanced deformation effectively impedes grain growth. Consequently, the number of passes in the ECAP process is intricately linked to the equivalent strain, inducing grain boundary pinning, and resulting in the formation of finer grains. The grain refinement process can be conceptualized as a repetitive sequence of dynamic recovery and recrystallization in each pass. In the case of the 4Bc ECAP process, dynamic recrystallization dominates, leading to a highly uniform grain reduction and, causing the grain boundaries to become less distinct 45 . Figure  4 b indicates that microstructural features vary with ECAP processing routes, aligning well with grain size and mechanical properties.

figure 4

( a ) XRD patterns for the AA ZK30 alloy and after 1P and 4Bc ECAP processing, ( b ) variations of crystallite size and lattice strain as a function of processing condition using the Williamson–Hall method.

Wear behavior

Figure  5 shows the volume loss (VL) and average coefficient of friction (COF) for the AA and ECAPed ZK30 alloy. The AA billets exhibited the highest VL at all wear parameters compared to the ECAPed billets as shown in Fig.  5 . From Fig.  5 a it revealed that performing the wear test at applied load of 1N exhibited the higher VL compared to the other applied forces. In addition, increasing the applied force up to 3 N revealed lower VL compared to 1 N counterpart at all wear speeds. Further increase in the applied load up to 5 N revealed a notable decrease in the VL. Similar behavior was attained for the ECAP-processed billets through 1P (Fig.  5 c) and 4Bc (Fig.  5 e). The VL was improved by increasing the applied load for all samples as shown in Fig.  5 which indicated an enhancement in the wear resistance. Increasing the applied load increases the strain hardening of ZK30 alloy that are in contact as reported by Yasmin et al. 46 and Kori et al. 47 . Accordingly, increasing the applied load resulted in increasing the friction force, which in turn hinder the dislocation motion and resulted in higher deformation, so that ZK30 experienced strain hardening and hence, the resistance to abrasion is increased, leading to improving the wear resistance 48 . Furthermore, increasing the applied load leads to increase the surface in contact with wear ball and hence, increases gripping action of asperities, which help to reduces the wear rate of ZK30 alloy as reported by Thuong et al. 48 . Out of contrary, increasing the wear speed revealed increasing the VL of the AA billets at all wear loads. For the ECAPed billet processed through 1P, the wear speed of 125 mm/s revealed the lowest VL while the wear speed of 250 mm/s showed the highest VL (Fig.  5 c). Similar behaviour was recorded for the 4Bc condition. In addition, from Fig.  5 c, it was clear that 1P condition showed higher VL compared to 4Bc (Fig.  5 e) at all wear parameters, indicating that processing via multiple passes resulted in significant grain size refinement (Fig.  2 ). Hence, higher hardness and better wear behavior were attained which agreed with previous study 7 . In addition, from Fig.  5 , it was clear that increasing the wear speed increased the VL. For the AA billets tested at 1N load the VL was 1.52 × 10 –6 m 3 . ECAP processing via 1P significantly improved the wear behavior as the VL was reduced by 85% compared to the AA condition. While compared to the AA condition, the VL improved by 99.8% while straining through 4Bc, which is accounted for by the considerable refinement that 4Bc provides. A similar trend was observed for the ECAPed ZK30 samples tested at a load of 3 and 5 N (Fig.  5 ). Accordingly, the significant grain refinement after ECAP processing (Fig.  2 ) increased the grain boundaries area; hence, a thicker oxide protective layer can be formed, leading to improve the wear resistance of the ECAPed samples. It is worth to mentioning here that, the grain refinement coupled with refining the secondary phase particle and redistribution resulted from processing through ECAP processing through multiple passes resulted in improving the hardness, wear behavior and mechanical properties according to Hall–Petch equation 7 , 13 , 49 . Similar findings were noted for the ZK30 billets tested at 3 N load, processing through 1P and 4Bc exhibited decreasing the VL by 85%, 99.85%, respectively compared to the AA counterpart. Similar finding was recorded for the findings of ZK30 billets which tested at 5 N load.

figure 5

Volume loss of ZK30 alloy ( a , c , e ) and the average coefficient of friction ( b , d , f ) in its ( a , b ) AA, ( c , d ) 1P and ( e , f ) 4Bc conditions as a function of different wear parameters.

From Fig.  5 , it can be noticed that the COF curves revealed a notable fluctuation with implementing least square method to smoothing the data, confirming that the friction during the testing of ECAPed ZK30 alloy was not steady for such a time. The remarkable change in the COF can be attributed to the smaller applied load on the surface of the ZK30 samples. Furthermore, the results of Fig.  5 revealed that ECAP processing reduced the COF, and hence, better wear behavior was attained. Furthermore, for all ZK30 samples, it was observed that the highest applied load (5 N) coupled with the lowest wear time (110 s) exhibited better COF and better wear behavior was displayed. These findings agreed with Farhat et al. 50 , they reported that decreasing the grain size led to improve the COF and hence improve the wear behavior. Furthermore, they reported that a plastic deformation occurs due to the friction between contacted surface which resisted by the grain boundaries and fine secondary phases. In addition, the strain hardening resulted from ECAP processing leads to decrease the COF and improving the VL 50 . Sankuru et al. 43 reported that ECAP processing foe pure Mg resulted in substantial grain refinement which was reflected in improving both microhardness and wear rate of the ECAPed billets. Furthermore, they found that increasing the number of passes up to 4Bc reduced the wear rate by 50% compared to the AA condition. Based on the applied load and wear velocity and distance, wear mechanism can be classified into mild wear and severe wear regimes 49 . Wear test parameters in the present study (load up to 5 N and speed up to 250 mm/s) falls in the mild wear regime where the delamination wear and oxidation wear mechanisms would predominantly take place 43 , 51 .

The worn surface morphologies of the ZK30-AA billet and ECAPed billet processed through 4Bc are shown in Fig.  6 . From Fig.  6 it can revealed that scores of wear grooves which aligned parallel to the wear direction have been degenerated on the worn surface in both AA (Fig.  6 a) and 4Bc (Fig.  6 b) conditions. Accordingly, the worn surface was included a combination of adhesion regions and a plastic deformation bands along the wear direction. Furthermore, it can be observed that the wear debris were adhered to the ZK30 worn surface which indicated that the abrasion wear mechanism had occur 52 . Lim et al. 53 reported that hard particle between contacting surfaces scratches samples and resulted in removing small fragments and hence, wear process was occurred. In addition, from Fig.  6 a,b it can depicted that the wear grooves on the AA billet were much wider than the counterpart of the 4Bc sample and which confirmed the effectiveness of ECAP processing in improving the wear behavior of the ZK30 alloy. Based on the aforementioned findings it can be concluded that ECAP-processed billets exhibited enhanced wear behavior which can be attributed to the obtained UFG structure 52 .

figure 6

SEM micrograph of the worn surface after the wear test: ( a – c ) AA alloy; ( b ) ECAP-processed through 4Bc.

Prediction of wear behavior

Regression modeling.

Several regression transformations approach and associations among variables that are independent have been investigated in order to model the wear output responses. The association between the supplied parameters and the resulting responses was modeled using quadratic regression. The models created in the course of the experiment are considered statistically significant and can be used to forecast the response parameters in relation to the input control parameters when the highest possible coefficient of regression of prediction (R 2 ) is closer to 1. The regression Eqs. (9)–(14) represent the predicted non-linear model of volume loss (VL) and coefficient and friction (COF) at different passes as a function of velocity (V) and applied load (P), with their associated determination and adjusted coefficients. The current study’s adjusted R 2 and correlation coefficient R 2 values fluctuated between 95.67 and 99.97%, which is extremely near to unity.

The experimental data are plotted in Fig.  7 as a function of the corresponding predicted values for VL and COF for zero pass, one pass, and four passes. The minimal output value is indicated by blue dots, which gradually change to the maximum output value indicated by red points. The effectiveness of the produced regression models was supported by the analysis of these maps, which showed that the practical and projected values matched remarkably well and that the majority of their intersection locations were rather close to the median line.

figure 7

Comparison between VL and COF of experimental and predicted values of ZK30 at AA, 1P, and 4Bc.

As a consequence of wear characteristics (P and V), Fig.  8 displays 3D response plots created using regression models to assess changes in VL and COF at various ECAP passes. For VL, the volume loss and applied load exhibit an inverse proportionality at various ECAP passes, which is apparent in Fig.  8 a–c. It was observed that increasing the applied load in the wear process will minimize VL. So, the optimal amount of VL was obtained at an applied load of 5N. There is an inverse relation between V of the wear process and VL at different ECAP passes. There is a clear need to change wear speeds for bullets with varying numbers of passes. As a result, the increased number of passes will need a lower wear speed to minimize VL. The minimal VL at zero pass is 1.50085E−06 m 3 obtained at 5N and 250 mm/s. Also, at a single pass, the optimal VL is 2.2266028E−07 m 3 obtained at 5 N and 148 mm/s. Finally, the minimum VL at four passes is 2.07783E−08 m 3 at 5N and 64.5 mm/s.

figure 8

Three-dimensional plot of VL ( a – c ) and COF ( d – f ) of ZK30 at AA, 1P, and 4Bc.

Figure  8 d–f presents the effect of wear parameters P and V on the COF for ECAPed ZK30 billets at zero, one, and four passes. There is an inverse proportionate between the applied load in the wear process and the coefficient of friction. As a result, the minimum optimum value of COF of the ZK30 billet at different process passes was obtained at 5 N. On the other hand, the speed used in the wear process decreased with the number of billet passes. The wear test rates for billets at zero, one, and four passes are 250, 64.5, and 64.5 mm/s, respectively. The minimum COF at zero pass is 0.380134639, obtained at 5N and 250 mm/s. At 5N and 64.5 mm/s, the lowest COF at one pass is 0.220277466. Finally, the minimum COF at four passes is 0.23130154 at 5N and 64.5 mm/s.

Machine learning prediction models

The previously mentioned modern ML algorithms have been used here to provide a solid foundation for analyzing the obtained data and gaining significant insights. The following section will give the results acquired by employing these approaches and thoroughly discuss the findings.

The correlation plots and correlation coefficients (Fig.  9 ) between the input variables, force, and speed, and the six output variables (VL_P0, VL_P1, VL_P4, COF_P0, COF_P1, and COF_P4) for data preprocessing of ML models give valuable insights into the interactions between these variables. Correlation charts help to investigate the strength and direction of a linear relationship between model input and output variables. We can initially observe if there is a positive, negative, or no correlation between each two variables by inspecting the scatterplots. This knowledge aids in comprehending how changes in one variable effect changes in the other. In contrast, the correlation coefficient offers a numerical assessment of the strength and direction of the linear relationship. It ranges from − 1 to 1, with near − 1 indicating a strong negative correlation, close to 1 indicating a strong positive correlation, and close to 0 indicating no or weak association. It is critical to examine the size and importance of the correlation coefficients when examining the correlation between the force and speed input variables and the six output variables (VL_P0, VL_P1, VL_P4, COF_P0, COF_P1, and COF_P4). A high positive correlation coefficient implies that a rise in one variable is connected with an increase in the other. In contrast, a high negative correlation coefficient indicates that an increase in one variable is associated with an increase in the other. From Fig.  9 it was clear that for all ZK30 billets, the both VL and COP were reversely proportional with the applied (in the range of 1-up to- 5N). Regarding the wear speed, the VL of both the AA and 1P conditions exhibited an inversed proportional with the wear speed while 4Bc exhibited a direct proportional with the wear speed (in the range of 64.5- up to- 250 mm/s) despite of the COP for all samples revealed an inversed proportional with the wear speed. The VL of AA condition (P0) revealed strong negative correlation coefficient of − 0.82 with the applied load while it displayed intermediate negative coefficient of − 0.49 with the wear speed. For 1P condition, VL showed a strong negative correlation of − 0.74 with the applied load whereas it showed a very weak negative correlation coefficient of − 0.13 with the speed. Furthermore, the VL of 4Bc condition displayed a strong negative correlation of − 0.99 with the applied load while it displayed a wear positive correlation coefficient of 0.08 with the speed. Similar trend was observed for the COF, the AA, 1P and 4Bc samples displayed intermediate negative coefficient of − 0.047, − 0.65 and − 0.61, respectively with the applied load while it showed a weak negative coefficient of − 0.4, − 0.05 and − 0.22, respectively with wear speed.

figure 9

Correlation plots of input and output variables showcasing the strength and direction of relationships between each input–output variable using correlation coefficients.

Figure  10 shows the predicted train and test VL values compared to the original data, indicating that the VL prediction model performed well utilizing the LR (Linear Regression) technique. The R 2 -score is a popular statistic for assessing the goodness of fit of a regression model. It runs from 0 to 1, with higher values indicating better performance. In this scenario, the R 2 -scores for both the training and test datasets range from 0.55 to 0.99, indicating that the ML model has established a significant correlation between the projected VL values and the actual data. This shows that the model can account for a considerable percentage of the variability in VL values.

figure 10

Predicted train and predicted test VL versus actual data computed for different applied loads and number of passes of ( a ) 0P (AA), ( b ) 1P, and ( c ) 4Bc: evaluating the performance of the VL prediction best model achieved using LR algorithm.

The R 2 -scores for training and testing three distinct ML models for the output variables ‘VL_P0’, ‘VL_P1’, and ‘VL_P4’ are summarized in Fig.  11 . The R 2 -score, also known as the coefficient of determination, is a number ranging from 0 to 1 that indicates how well the model fits the data. For VL_P0, R 2 for testing is 0.69, and that for training is 0.96, indicating that the ML model predicts the VL_P0 variable with reasonable accuracy on unknown data. On the other hand, the R 2 value of 0.96 for training suggests that the model fits the training data rather well. In summary, the performance of the ML models changes depending on the output variables. With R 2 values of 0.98 for both training and testing, the model predicts 'VL_P4' with great accuracy. However, the model’s performance for 'VL_P0' is reasonable, with an R 2 score of 0.69 for testing and a high R 2 score of 0.96 for training. The model’s performance for 'VL_P1' is relatively poor, with R 2 values of 0.55 for testing and 0.57 for training. Additional assessment measures must be considered to understand the models' prediction capabilities well. Therefore, as presented in the following section, we did no-linear polynomial fitting with extracted equations that accurately link the output and input variables.

figure 11

Result summary of ML train and test sets displaying R 2 -score for each model.

Furthermore, the data was subjected to polynomial fitting with first- and second-degree models (Fig.  12 ). The fitting accuracy of the data was assessed using the R 2 -score, which ranged from 0.92 to 0.98, indicating a good fit. The following equations (Eqs.  15 to 17 ) were extracted from fitting the experimental dataset of the volume loss at different conditions of applied load (P) and the speed (V) as follows:

figure 12

Predicted versus actual ( a ) VL_P0 fitted to Eq.  15 with R 2 -score of 0.92, ( b ) VL_P1 fitted to Eq.  16 with R 2 -score of 0.96, ( c ) VL_P4 fitted to Eq.  17 with R 2 -score of 0.98.

Figure  13 depicts the predicted train and test coefficients of friction (COF) values placed against the actual data. The figure seeks to assess the performance of the best models obtained using the SVM (Support Vector Machine) and GPR (Gaussian Process Regression) algorithms for various applied loads and number of passes (0, 1P, and 4P). The figure assesses the accuracy and efficacy of the COF prediction models by showing the predicted train and test COF values alongside the actual data. By comparing projected and actual data points, we may see how closely the models match the true values. The ML models trained and evaluated on the output variables 'COF_P0', 'COF_P1', and 'COF_P4' using SVM and GPR algorithms show great accuracy and performance, as summarized in Fig.  13 . The R 2 ratings for testing vary from 0.97 to 0.99, showing that the models efficiently capture the predicted variables' variability efficiently. Furthermore, the training R 2 scores are consistently high at 0.99, demonstrating a solid fit to the training data. These findings imply that the ML models can accurately predict the values of 'COF_P0', 'COF_P1', and 'COF_P4' and generalize well to new unseen data.

figure 13

Predicted train and predicted test COF versus actual data computed for different applied loads and number of passes of ( a ) 0P (AA), ( b ) 1P, and ( c ) 4Bc: evaluating the performance of the COF prediction best model achieved using SVM and GPR algorithms.

Figure  14 presents a summary of the results obtained through machine learning modeling. The R 2 values achieved for COF modeling using SVM and GPR are 0.99 for the training set and range from 0.97 to 0.99 for the testing dataset. These values indicate that the models have successfully captured and accurately represented the trends in the dataset.

figure 14

Optimization of wear behavior

Optimization by response surface methodology (rsm).

The results of the RSM optimization carried out on the volume loss and coefficient of friction at zero pass (AA), along with the relevant variables, are shown in Appendix A - 1 . The red and blue dots represented the wear circumstance (P and V) and responses (VL and COF) for each of the ensuing optimization findings. The volume loss and coefficient of friction optimization objective were formed to “in range,” using “minimize” as the solution target, and the expected result of the desirability function was in the format of “smaller-is-better” attributes. The values of (A) P = 5 N and (B) V = 250 mm/s were the optimal conditions for volume loss. Appendix A - 1 (a) shows that this resulted in the lowest volume loss value attainable of 1.50127E-6 m 3 . Also, the optimal friction coefficient conditions were (A) P = 2.911 N and (B) V = 250 mm/s. This led to the lowest coefficient of friction value possible, which was 0.324575, as shown in Appendix A - 1 (b).

Appendix A - 2 displays the outcomes of the RSM optimization performed on the volume loss and coefficient of friction at one pass, together with the appropriate variables. The volume loss and coefficient of friction optimization objectives were designed to be "in range," with "minimize" as the solution objective. It was anticipated that the intended function would provide "smaller-is-better" traits. The ideal conditions for volume loss were (A) P = 4.95 N and (B) V = 136.381 mm/s. This yielded the lowest volume loss value feasible of 2.22725E-7 m 3 , as seen in Appendix A - 2 (a). The optimal P and V values for the coefficient of friction were found to be (A) P = 5 N and (B) V = 64.5 mm/s. As demonstrated in Appendix A - 2 (b), this resulted in the lowest coefficient of friction value achievable, which was 0.220198.

Similarly, Appendix A - 3 displays the outcomes of the RSM optimization performed on the volume loss and coefficient of friction at four passes, together with the appropriate variables. The volume loss and coefficient of friction optimization objectives were designed to be "in range," with "minimize" as the solution objective. The desired function’s expected result would provide of "smaller-is-better" characteristics. The optimal conditions for volume loss were (A) P = 5 N and (B) V = 77.6915 mm/s. This yielded the lowest volume loss value feasible of 2.12638E-8 m 3 , as seen in Appendix A - 1 (a). The optimal P and V values for the coefficient of friction were found to be (A) P = 4.95612 N and (B) V = 64.9861 mm/s. As seen in Appendix A - 1 (b), this resulted in the lowest coefficient of friction value achievable, which was 0.235109.

Optimization by genetic algorithm and hybrid DOE-GA

The most appropriate combination of wear-independent factors that contribute to the minimal feasible volume loss and coefficient of friction was determined using a genetic algorithm (GA). Based on genetic algorithm technique, the goal function for each response was determined by taking Eqs. (9)–(14) and subjecting them to the wear boundary conditions, P and V. The following expression applies to the recommended functions for objective: Minimize (VL, COF), subjected to ranges of wear conditions: 1 ≤ P ≤ 5 (N), 64.5 ≤ V ≤ 250 (mm/s).

Figures  15 and 16 show the GA optimization technique’s performance in terms of fitness value and the running solver view, which were derived from MATLAB, together with the related wear requirements for the lowest VL and COF at zero pass. VL and COF were suggested to be minimized by Eqs. (9) and (10), which were then used as the function of fitness and exposed to the wear boundary limit. According to Fig.  15 a, the lowest value of VL that GA could find was 1.50085E−6 m 3 at P = 5N and V = 249.993 mm/s. Furthermore, the GA yielded a minimum COF value of 0.322531 at P = 2.91 N and V = 250 mm/s (Fig.  15 b).

figure 15

Optimum VL ( a ) and COF ( b ) by GA at AA condition.

figure 16

Optimum VL ( a ) and COF ( b ) by hybrid DOE-GA at AA condition.

The DOE–GA hybrid analysis was carried out to enhance the GA outcomes. Wear optimal conditions of VL and COF at zero pass are used to determine the initial populations of hybrid DOE–GA. The hybrid DOE–GA yielded a minimum VL value of 1.50085E-6 m 3 at a speed of 249.993 mm/s and a load of 5N (Fig.  16 a). Similarly, at a 2.91 N and 250 mm/s speed load, the hybrid DOE–GA yielded a minimum COF (Fig.  16 b) of 0.322531.

The fitness function, as defined by Eqs. 11 and 12, was the depreciation of VL and COF at a 1P, subject to the wear boundary condition. Figure  17 a,b display the optimal values of VL and COF by GA, which were 2.2266E−7 m 3 and 0.220278, respectively. The lowest VL measured at 147.313 mm/s and 5 N. In comparison, 5 N and 64.5 mm/s were the optimum wear conditions of COF as determined by GA. Hybrid DOE–GA results of minimum VL and COF at a single pass were 2.2266 E-7 m 3 and 0.220278, respectively, obtained at 147.313 mm/s and 5 N for VL as shown in Fig.  18 a and 5 N and 64.5 mm/s for COF as shown in Fig.  18 b.

figure 17

Optimum VL ( a ) and COF ( b ) by GA at 1P condition.

figure 18

Optimum VL ( a ) and COF ( b ) by hybrid DOE-GA at 1P condition.

Subject to the wear boundary condition, the fitness function was the minimization of VL and COF at four passes, as defined by Eqs. 13 and 14. The optimum values of VL and COF via GA shown in Fig.  19 a,b were 2.12638E−8 m 3 and 0.231302, respectively. The lowest reported VL was 5 N and 77.762 mm/s. However, GA found that the optimal wear conditions for COF were 5 N and 64.5 mm/s. In Fig.  20 a,b, the hybrid DOE–GA findings for the minimum VL and COF at four passes were 2.12638E−8 m 3 and 0.231302, respectively. These results were achieved at 77.762 mm/s and 5 N for VL and 5 N and 64.5 mm/s for COF.

figure 19

Optimum VL ( a ) and COF ( b ) by GA at 4Bc condition.

figure 20

Optimum VL ( a ) and COF ( b ) by hybrid DOE-GA at 4Bc condition.

Optimization by multi-objective genetic algorithm (MOGA)

A mathematical model whose input process parameters influence the quality of the output replies was solved using the multi-objective genetic algorithm (MOGA) technique 54 . In the current study, the multi-objective optimization using genetic algorithm (MOGA) as the objective function, regression models, was implemented using the GA Toolbox in MATLAB 2020 and the P and V are input wear parameter values served as the top and lower bounds, and the number of parameters was set to three. After that, the following MOGA parameters were selected: There were fifty individuals in the initial population, 300 generations in the generation, 20 migration intervals, 0.2 migration fractions, and 0.35 Pareto fractions. Constraint-dependent mutation and intermediary crossover with a coefficient of chance of 0.8 were used for optimization. The Pareto optimum, also known as a non-dominated solution, is the outcome of MOGA. It is a group of solutions that consider all of the objectives without sacrificing any of them 55 .

By addressing both as multi-objective functions was utilized to identify the lowest possible values of the volume loss and coefficient of friction at zero pass. Equations (9) and (10) were the fitness functions for volume loss and coefficient of friction at zero pass for ZK30. The Pareto front values for the volume loss and coefficient of friction at zero pass, as determined by MOGA, are listed in Table 2 . The volume loss (Objective 1) and coefficient of friction (Objective 2) Pareto chart points at zero pass are shown in Fig.  21 . A friction coefficient reduction due to excessive volume loss was observed. As a result, giving up a decrease in the coefficient of friction can increase volume loss. For zero pass, the best volume loss was 1.50096E−06 m 3 with a sacrifice coefficient of friction of 0.402941. However, the worst volume loss was 1.50541E−06 m 3 , with the best coefficient of friction being 0.341073.

The genetic algorithm was used for the multi-objective functions of minimal volume loss and coefficient of friction. The fitness functions for volume loss and coefficient of friction at one pass were represented by Eqs. (11) and (12), respectively. Table 3 displays the Pareto front points of volume loss and coefficient of friction at one pass. Figure  22 presents the volume loss (Objective 1) and coefficient of friction (Objective 2) Pareto chart points for a single pass. It was discovered that the coefficient of friction decreases as the volume loss increases. As a result, the volume loss can be reduced at the expense of a higher coefficient of friction. The best volume loss for a single pass was 2.22699E−07 m 3 , with the worst maximum coefficient of friction being 0.242371 and the best minimum coefficient of friction being 0.224776 at a volume loss of 2.23405E−07 m 3 .

The multi-objective functions of minimal volume loss and coefficient of friction were handled by Eqs. (13) and (14), respectively, served as the fitness functions for volume loss and coefficient of friction at four passes. The Pareto front points of volume loss and coefficient of friction at four passes are shown in Table 4 . The Pareto chart points for the volume loss (Objective 1) and coefficient of friction (Objective 2) for four passes are shown in Fig.  23 . It was shown that when the volume loss increases, the coefficient of friction lowers. The volume loss can be decreased as a result, however, at the expense of an increased coefficient of friction. The best minimum coefficient of friction was 0.2313046 at a volume loss of 2.12663E−08 m 3 , and the best minimum volume loss was 2.126397E−08 m 3 at a coefficient of friction of 0.245145 for four passes. In addition, Table 5 compares wear response values at DOE, RSM, GA, hybrid RSM-GA, and MOGA.

Optimization of large space

This section proposed the optimal wear parameters of different responses, namely VL and COF of ZK30. The presented optimal wear parameters, such as P and V, are based on previous studies of ZK30 that recommended the applied load from one to 30 N and speed from 64.5 to 1000 mm/s. Table 6 presents the optimal condition of the wear process of different responses by genetic algorithm (GA).

Validation of RSM and ML models for ZK30 processing through ECAP

Validation of rsm models.

Table 7 displays the validity of wear’s regression model for VL under several circumstances. The wear models' validation was achieved under various load and speed conditions. The volume loss response models had the lowest error % between the practical and regression models and were the most accurate, based on the validation data. Table 7 indicates that the data unambiguously shows that the predictive molding performance has been validated, as shown by the reasonably high accuracy obtained, ranging from 69.7 to 99.9%.

Validation of ML models

Equations ( 15 to 17 ) provide insights into the relationship that links the volume loss with applied load and speed, allowing us to understand how changes in these factors affect the volume loss in the given system. The validity of this modeling was further examined using a new unseen dataset by which the prediction error and accuracy were calculated, as shown in Table 8 . Table 8 shows that the data clearly demonstrates that the predictive molding performance has been validated, as evidenced by the obtained accuracy ranging from 69.7 to 99.9%, which is reasonably high.

Conclusions

This research presents a comparative study of the wear behavior of ECAPed ZK30 alloys using experimental, statistical, and machine learning techniques. Different ECAP processing conditions have been implemented, including the processing routes and number of pressing passes. The wear behavior of ECAPed ZK30 alloy has been thoroughly examined in terms of volume loss and coefficient of friction under applied loads and speeds. Prediction and optimization of the wear test parameters of the ECAPed ZK30 samples have been performed via different statistical and machine learning approaches. Finally, another set of experimental conditions has validated models obtained from RSM and ML. The following conclusions could be drawn:

ECAP process leads to significant grain refinement, particularly with 4Bc, results in the formation of fine grains. The average grain size of ECAPed ZK30 has significantly decreased by 92.7% compared to the AA condition, reaching an average size of 1.94 µm.

ECAP processing, through 1P and 4Bc routes, demonstrates a substantial enhancement in wear resistance. The wear volume loss (VL) has shown remarkable reductions of 85% and 99.8%, respectively, compared to the AA condition.

The fluctuation in coefficient of friction (COF) curves during testing of ECAPed ZK30 alloy, attributed to smaller applied loads, indicates non-steady friction behavior. However, overall, ECAP results in a reduction in COF, signifying improved wear behavior.

The regression models of VL and COF have correlation coefficient R 2 , and adjusted R 2 values in the present research ranged from 95.67 to 99.97%, indicating that the experimental and predicted values agreed exceptionally well.

The 3D plots reveal that the minimal VL at different ECAP passes was obtained at the highest condition of the wear test.

The minimal COF for all ECAP passes was obtained at maximum wear load. However, the optimal speed in the wear process decreased with the number of passes.

ML prediction model has established a significant correlation between the projected and the actual data with R 2 -score ranging from 0.92 to 0.98 for VL and from 0.97 to 0.99 for COF.

There is good overlap between the wear response values of the DOE-obtained experimental findings and the optimization results from RSM, GA, MOGA, and hybrid DOE–GA.

The validation of predicted ML models and VL regression under different wear conditions have an accuracy range of 70% to 99.7%.

Data availability

Data is provided within the manuscript and the supplementary information files.

Alateyah, A. I. Effect of ECAP die angle and route type on the experimental evolution, crystallographic texture, and mechanical properties of pure magnesium. Open Eng. 13 , 12–14 (2023).

Article   Google Scholar  

Tan, J. & Ramakrishna, S. Applications of magnesium and its alloys: A review. Appl. Sci. 11 , 6861 (2021).

Article   CAS   Google Scholar  

Yang, Y. et al. Research advances of magnesium and magnesium alloys worldwide in 2022. J. Magnes. Alloys 11 , 2611–2654 (2023).

Mostaed, E. et al. Microstructure, texture evolution, mechanical properties and corrosion behavior of ECAP processed ZK60 magnesium alloy for biodegradable applications. J. Mech. Behav. Biomed. Mater. 37 , 307–322 (2014).

Article   CAS   PubMed   Google Scholar  

Zhang, T. et al. A review on magnesium alloys for biomedical applications. Front. Bioeng. Biotechnol. 10 , 953344 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Nasr Azadani, M., Zahedi, A., Bowoto, O. K. & Oladapo, B. I. A review of current challenges and prospects of magnesium and its alloy for bone implant applications. Prog. Biomater. 11 , 1–26 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Alateyah, A. I., Alawad, M. O., Aljohani, T. A. & El-Garaihy, W. H. Effect of ECAP route type on the microstructural evolution, crystallographic texture, electrochemical behavior and mechanical properties of ZK30 biodegradable magnesium alloy. Materials 15 , 6088 (2022).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Riaz, U., Shabib, I. & Haider, W. The current trends of Mg alloys in biomedical applications—A review. J. Biomed. Mater. Res. Part B Appl. Biomater. 107 , 1970–1996 (2019).

Ali, M., Hussein, M. A. & Al-Aqeeli, N. Magnesium-based composites and alloys for medical applications: A review of mechanical and corrosion properties. J. Alloys Compd. 792 , 1162–1190 (2019).

Li, N. & Zheng, Y. Novel magnesium alloys developed for biomedical application: A review. J. Mater. Sci. Technol. 29 , 489–502 (2013).

Mert, F. Wear behaviour of hot rolled AZ31B magnesium alloy as candidate for biodegradable implant material. Trans. Nonferrous Met. Soc. China 27 , 2598–2606 (2017).

Sun, H. Q., Shi, Y.-N. & Zhang, M.-X. Wear behaviour of AZ91D magnesium alloy with a nanocrystalline surface layer. Surf. Coat. Technol. 202 , 2859–2864 (2008).

El-Garaihy, W. H. et al. Improving in-vitro corrosion and degradation performance of Mg–Zn–Ca alloy for biomedical applications by equal channel angular pressing. Met. Mater. Int. https://doi.org/10.1007/s12540-023-01599-0 (2024).

Chen, J. et al. Effects of different rare earth elements on the degradation and mechanical properties of the ECAP extruded Mg alloys. Materials 15 , 627 (2022).

Medeiros, M. P., Lopes, D. R., Kawasaki, M., Langdon, T. G. & Figueiredo, R. B. An overview on the effect of severe plastic deformation on the performance of magnesium for biomedical applications. Materials 16 , 2401 (2023).

Sahoo, P. S. et al. Investigation of severe plastic deformation effects on magnesium RZ5 alloy sheets using a modified multi-pass equal channel angular pressing (ECAP) technique. Materials 16 , 5158 (2023).

El-Garaihy, W. H. et al. The impact of ECAP parameters on the structural and mechanical behavior of pure Mg: A combination of experimental and machine learning approaches. Appl. Sci. 13 , 1–28 (2023).

Shaban, M. et al. Investigation of the effect of ECAP parameters on hardness, tensile properties, impact toughness, and electrical the conductivity of pure Cu through machine learning predictive models. Materials 15 , 9032 (2022).

Alateyah, A. I. et al. Optimizing the ECAP processing parameters of pure Cu through experimental, finite element, and response surface approaches. Rev. Adv. Mater. Sci. 62 , 20220297 (2023).

Vaughan, M. W. et al. The effects of severe plastic deformation on the mechanical and corrosion characteristics of a bioresorbable Mg-ZKQX6000 alloy. Mater. Sci. Eng. C 115 , 111130 (2020).

Daryadel, M. Study on equal channel angular pressing process of AA7075 with copper casing by finite element-response surface couple method. Int. J. Eng. 33 , 2538–2548 (2020).

CAS   Google Scholar  

Alateyah, A. I. et al. The effect of ECAP processing conditions on microstructural evolution and mechanical properties of pure magnesium—Experimental, mathematical empirical and response surface approach. Materials 15 , 1–24 (2022).

Mitchell, T. M. M. learning . Machine learning (1997).

Pereira, F., Mitchell, T. & Botvinick, M. Machine learning classifiers and fMRI: A tutorial overview. NeuroImage 45 , S199–S209 (2009).

Article   PubMed   Google Scholar  

Ih, S. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2 , 1–21 (2021).

Google Scholar  

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 , 281 (2012).

MathSciNet   Google Scholar  

Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction Vol. 2 (Springer, 2009).

Book   Google Scholar  

Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence (IJCAI) (1995).

Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

Smola, A. J. & Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 14 , 199–222 (2004).

Article   MathSciNet   Google Scholar  

Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001).

Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).

Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29 , 1189–1232 (2001).

Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20 , 273–297 (1995).

Shana, Z. et al. Extraordinary mechanical properties of AZ61 alloy processed by ECAP with 160° channel angle and EPT. J. Magnes. Alloys 9 , 548–559. https://doi.org/10.1016/j.jma.2020.02.028 (2021).

Dumitru, F. D., Higuera-Cobos, O. F. & Cabrera, J. M. ZK60 alloy processed by ECAP: Microstructural, physical and mechanical characterization. Mater. Sci. Eng. A 594 , 32–37. https://doi.org/10.1016/j.msea.2013.11.050 (2014).

Figueiredo, R. B. & Langdon, T. G. Principles of grain refinement in magnesium alloys processed by equal-channel angular pressing. J. Mater. Sci. 44 , 4758. https://doi.org/10.1007/s10853-009-3725-z (2009).

Article   ADS   CAS   Google Scholar  

Zhou, W., Yu, Y., Lin, J. & Dean, T. A. Manufacturing a curved profile with fine grains and high strength by differential velocity sideways extrusion. Int. J. Mach. Tools Manuf. 140 , 77–88. https://doi.org/10.1016/j.ijmachtools.2019.03.002 (2019).

Tong, L. B. et al. Influence of ECAP routes on microstructure and mechanical properties of Mg–Zn–Ca alloy. Mater. Sci. Eng. A 527 , 4250–4256. https://doi.org/10.1016/j.msea.2010.03.062 (2010).

Zareian, Z. et al. Tailoring the mechanical properties of Mg–Zn magnesium alloy by calcium addition and hot extrusion process. Mater. Sci. Eng. A 774 , 138929 (2020).

Golrang, M., Mobasheri, M., Mirzadeh, H. & Emamy, M. Effect of Zn addition on the microstructure and mechanical properties of Mg-0.5Ca-0.5RE magnesium alloy. J. Alloys Compd. 815 , 152380 (2020).

Borbély, A. & Groma, I. Variance method for the evaluation of particle size and dislocation density from x-ray Bragg peaks. Appl. Phys. Lett. 79 , 1772–1774 (2001).

Article   ADS   Google Scholar  

Sankuru, A. B. et al. Effect of processing route on microstructure, mechanical and dry sliding wear behavior of commercially pure magnesium processed by ECAP with back pressure. Trans. Indian Inst. Met. 74 , 2659–2669 (2021).

Zuo, D., Li, T., Liang, W., Wen, X. & Yang, F. Microstructures and mechanical behavior of magnesium processed by ECAP at ice-water temperature. J. Phys. D. Appl. Phys. 51 , 185302 (2018).

Dwiyati, S. T., Kiswanto, G. & Supriadi, S. Grain refinement of pure magnesium for microforming application. J. Manuf. Mater. Process. 7 , 140 (2023).

Yasmin, T., Khalid, A. A. & Haque, M. Tribological (wear) properties of aluminum–silicon eutectic base alloy under dry sliding condition. J. Mater. Process. Technol. 153 , 833–838. https://doi.org/10.1016/j.jmatprotec.2004.04.147 (2004).

Kori, S. & Chandrashekharaiah, T. Studies on the dry sliding wear behaviour of hypoeutectic and eutectic Al–Si alloys. Wear 263 , 745–755. https://doi.org/10.1016/j.wear.2006.11.026 (2007).

Thuong, N. V., Zuhailawati, H., Seman, A. A., Huy, T. D. & Dhindaw, B. K. Microstructural evolution and wear characteristics of equal channel angular pressing processed semi-solid-cast hypoeutectic aluminum alloys. Mater. Design 67 , 448–456. https://doi.org/10.1016/j.matdes.2014.11.054 (2015).

Manjunath, G. K. et al. Microstructure and wear performance of ECAP processed cast Al–Zn–Mg alloys. Trans. Indian Inst. Met. 71 , 1919–1931. https://doi.org/10.1007/s12666-018-1328-6 (2018).

Farhat, Z. N., Ding, Y., Northwood, D. O. & Alpas, A. T. Effect of grain size on friction and wear of nanocrystalline aluminum. Mater. Sci. Eng. A 206 , 302. https://doi.org/10.1016/0921-5093(95)10016-4 (1996).

Chegini, M. & Shaeri, M. H. Effect of equal channel angular pressing on the mechanical and tribological behavior of Al–Zn–Mg–Cu alloy. Mater. Charact. 140 , 147. https://doi.org/10.1016/j.matchar.2018.03.045 (2018).

Hu, H. J. et al. Dry sliding wear behavior of ES processed AZ31B magnesium alloy. Russ. J. Non-Ferrous Metals 56 (4), 392–398. https://doi.org/10.3103/S1067821215040057 (2015).

Lim, C. Y. H., Leo, D. K., Ang, J. J. S. & Gupta, M. Wear of magnesium composites reinforced with nano-sized alumina particulates. Wear 259 , 620–625 (2005).

Antil, P., Singh, S., Kumar, S., Manna, A. & Katal, N. Taguchi and multi-objective genetic algorithm-based optimization during ECDM of SiCp/GLASS fibers reinforced PMCS. Indian J. Eng. Mater. Sci. 26 , 211–219 (2019).

Janahiraman, T. V. & Ahmad, N. Multi objective optimization for turning operation using hybrid extreme learning machine and multi objective genetic algorithm. Int. J. Eng. Technol. 7 , 876 (2018).

Download references

Acknowledgements

Researchers would like to thank the Deanship of Scientific Research, Qassim University, for funding the publication of this project.

Author information

Authors and affiliations.

Department of Electrical Engineering, College of Engineering, Qassim University, 56452, Unaizah, Saudi Arabia

Mahmoud Shaban & Fahad Nasser Alsunaydih

Department of Electrical Engineering, Faculty of Engineering, Aswan University, Aswan, 81542, Egypt

Mahmoud Shaban

Department of Production Engineering and Mechanical Design, Port Said University, Port Fouad, 42526, Egypt

Hanan Kouta, Samar El-Sanabary & Yasmine El-Taybany

Department of Mechanical Engineering, College of Engineering, Qassim University, 56452, Unaizah, Saudi Arabia

Abdulrahman Alrumayh, Abdulrahman I. Alateyah & Waleed H. El-Garaihy

Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology (KACST), 12354, Riyadh, Saudi Arabia

Majed O. Alawad

Mechanical Engineering Department, Faculty of Engineering, Suez Canal University, Ismailia, 41522, Egypt

Waleed H. El-Garaihy

You can also search for this author in PubMed   Google Scholar

Contributions

Mahmoud Shaban: Formal analysis, Software, Writing – original draft. Fahad Nasser Alsunaydih: Formal analysis, Software, Writing – review & editing. Hanan Kouta: software, Validation, data curation. Samar El-Sanabary: data curation, Validation, Writing – original draft. Abdulrahman Alrumayh: Methodology, Writing – original draft. Abdulrahman I. Alateyah: Project administration, Conceptualization, Project Supervision. Majed O. Alawad: Methodology, investigation, Writing – original draft. Waleed H. El-Garaihy: Project administration, Conceptualization, Methodology, Investigation, Writing – original draft, Writing – review & editing. Yasmine El-Taybany: Formal analysis, Writing – original draft.

Corresponding authors

Correspondence to Abdulrahman I. Alateyah or Waleed H. El-Garaihy .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Shaban, M., Alsunaydih, F.N., Kouta, H. et al. Optimization of wear parameters for ECAP-processed ZK30 alloy using response surface and machine learning approaches: a comparative study. Sci Rep 14 , 9233 (2024). https://doi.org/10.1038/s41598-024-59880-0

Download citation

Received : 07 December 2023

Accepted : 16 April 2024

Published : 22 April 2024

DOI : https://doi.org/10.1038/s41598-024-59880-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Equal channel angular pressing
  • Wear performance
  • Response surface methodology
  • Machine learning

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research design regression analysis

  • Open access
  • Published: 20 April 2024

L-shaped association between lean body mass to visceral fat mass ratio with hyperuricemia: a cross-sectional study

  • Longti Li 1 , 2 ,
  • Ya Shao 2 , 3 ,
  • Huiqin Zhong 2 ,
  • Yu Wang 3 ,
  • Rong Zhang 2 ,
  • Boxiong Gong 2 &
  • Xiaoxv Yin 1  

Lipids in Health and Disease volume  23 , Article number:  116 ( 2024 ) Cite this article

94 Accesses

Metrics details

Insufficient attention has been given to examining the correlation between body composition and hyperuricemia, leading to inconsistent findings. The primary objective of this research is to explore the association between lean body mass index (LMI), visceral fat mass index (VFMI), and hyperuricemia. A specific emphasis will be placed on assessing the link between the ratio of lean body mass to visceral fat mass (LMI/VFMI) and hyperuricemia.

The present study employed a cross-sectional design and involved a total of 9,646 individuals who participated in the National Health and Nutrition Examination Survey (NHANES). To explore the associations among the variables, logistic and linear regressions were employed. Additionally, subgroup analyses and sensitivity analyses were conducted based on various characteristics.

The results showed that LMI was positively associated with hyperuricemia (for Per-SD: OR = 1.88, 95%CI: 1.75, 2.01; for quartiles [Q4:Q1]: OR = 5.37, 95%CI: 4.31, 6.69). Meanwhile, VFMI showed a positive association with hyperuricemia (for Per-SD: OR = 2.02, 95%CI: 1.88, 2.16; for quartiles [Q4:Q1]: OR =8.37, 95%CI: 6.70, 10.47). When considering the effects of In LMI/VFMI, an L-shaped negative association with hyperuricemia was observed (for Per-SD: OR = 0.45, 95%CI: 0.42, 0.49; for quartiles [Q4:Q1]: OR = 0.16, 95%CI: 0.13, 0.20). Subgroup and sensitivity analyses demonstrated the robustness of this association across different subgroups. Additionally, the segmented regression analysis indicated a saturation effect of 5.64 for the In LMI/VFMI with hyperuricemia (OR = 0.20, 95%CI: 0.17, 0.24). For every 2.72-fold increase of In LMI/VFMI, the risk of hyperuricemia was reduced by 80%.

The LMI/VFMI ratio is non-linearly associated with serum uric acid. Whether this association is causal needs to be confirmed in further longitudinal studies or Mendelian randomization.

Introduction

Globally, hyperuricemia is on the rise, posing a significant health threat, as evidenced by a recent U.S. study reported that 20% of adults aged 20 or older had hyperuricemia [ 1 ]. Similarly, another survey conducted in China among adults aged 18 to 59 reported a prevalence rate of hyperuricemia at 15% [ 2 ]. Health outcomes across a wide variety of diseases are robustly correlated with hyperuricemia [ 3 ], encompassing but not restricted to hypertension [ 4 ], diabetes mellitus [ 5 ], cardiovascular and cerebrovascular disease [ 6 ], and all-cause mortality [ 7 ]. Consequently, it is imperative to ascertain the factors linked to hyperuricemia.

Among the risk factors for hyperuricemia, obesity is an important one. Researchers have investigated the connection between hyperuricemia and conventional body metrics like waist circumference (WC) and body mass index (BMI) [ 8 , 9 ]. Moreover, researchers have examined the correlation between other alternative indicators for assessing obesity and hyperuricemia, such as lipid accumulation product, body roundness index, and visceral adiposity index [ 10 , 11 , 12 ]. However, these proxies are derived indirectly from physical measurements or a combination of physical measures (such as BMI, WC, or height) and blood markers (such as triglycerides or high-density lipoprotein cholesterol). Accordingly, they do not facilitate a comprehensive and precise visual assessment of obesity severity and body fat distribution across the entire body.

Recent progress has been made in assay methodologies for assessing body composition [ 13 ]. These new methods offer enhanced precision in discerning muscle and adipose tissue distribution. Many investigations have substantiated the adverse influence of adipose tissue on hyperuricemia [ 14 , 15 ]. However, it must be acknowledged that there may also be some degree of association between muscle tissue and serum uric acid (SUA) levels, a relationship confirmed by Chen et al. [ 16 ]. Exploring the correlation between adipose tissue and hyperuricemia in isolation may be confounded by other tissues in the body composition, especially muscle tissue. This construct has usually yet to be considered in previous research.

Contemporary investigations have further revealed that maintaining an optimal proportion of lean body mass to adipose tissue yields advantageous outcomes in mitigating metabolic risk [ 17 ]. We are dedicated to researching the relationship between body composition and metabolic health [ 18 , 19 ]. However, the existing evidence is inadequate to establish a correlation between the proportion of lean body mass to visceral fat and hyperuricemia. Consequently, this specific association was the primary objective of our research.

Material and methods

Study design.

This cross-sectional study comprised four cycles of the National Health and Nutrition Examination Survey (NHANES) conducted in the U.S. between 2011 and 2018. Approximately 5,000 individuals were selected from 15 countries across the U.S. each year to participate in the survey. For more information about the survey, see the NHANES Plan and Operations manual [ 20 ]. The National Center for Health Statistics Research Ethics Review Board approved the survey (Protocol #2011-17 and Protocol #2018-01). The present study is not subject to ethical review as a secondary analysis of information from that survey.

Study population

In the period spanning from 2011 to 2018, 22,617 individuals over 20 years participated in four cycles. Among these participants, 10,896 completed the body composition assessment, while 10,380 completed the SUA measurement. Twenty-eight individuals were excluded because they were missing height information to calculate standardized indexes of lean body and visceral fat mass. To account for potential confounding factors of renal disease, we excluded individuals who underwent dialysis the previous year and had an estimated glomerular filtration rate (eGFR) lower than 30 ml/min/1.73m2. Additionally, individuals who are obese with a BMI exceeding 40 kg/m 2 were also excluded, as this population is known to experience a multitude of metabolic disorders that may interfere with the study outcomes. According to the abovementioned criteria, a cumulative count of 9,646 study participants remained for further evaluation. The graphical representation of participant inclusion can be observed in Fig.  1 . This study is reported in accordance with the STROBE statement (Supplementary File S 1 ).

figure 1

Flowchart for the selection of subjects

Measurement of lean and visceral fat mass

Lean and visceral fat mass were measured using Dual-Energy X-ray Absorptiometry in a mobile examination center, where lean body mass excluded bone mineral content. In the NHANES survey, individuals aged 8-59 were eligible for the screening, with exclusions for pregnancy, recent ingestion of radiographic material, and individuals weighing over 450 pounds or were over 6 feet 5 inches tall. Before the test, participants were instructed to remove all metal objects from their bodies. The scanning process follows stringent quality control procedures, beginning with the involvement of trained and certified radiologic technologists who scan all exams. The Hologic Anthropomorphic Spine Phantom in the mobile examination canter was scanned daily to ensure accurate equipment calibration. Additionally, the NHANES Quality Control Center conducts expert reviews of all participant scans to ensure consistency of results [ 21 ]. Considering the potential effect of height on the variables, we calculated the lean body mass index (LMI) and visceral fat mass index (VFMI) following the common practice of previous related studies [ 22 ].

Hyperuricemia assessment

SUA was measured after participants collected blood samples at the mobile examination center, and the samples were cryogenically stored until transported to the collaborating laboratory for analysis. Standardized trained technicians tested SUA concentrations using the Beckman Coulter UniCel® DxC800 (2011-2016) and the Roche Cobas 6000 (2017-2018) [ 23 ]. Hyperuricemia was initially identified as SUA levels exceeding 7.0 mg/dL in men and 6.0 mg/dL in women [ 24 ]. However, due to the ongoing debate regarding the appropriate cutoff value for elevated SUA levels, we performed a sensitivity analysis employing a cutoff value of SUA ≥6.8 mg/dL. This particular value was chosen as it aligns with the solubility of uric acid under normal physiological pH and temperature conditions [ 25 ].

Based on prior knowledge and existing literature [ 26 , 27 ], a broad range of confounders were considered, including sex, age, race, education level, poverty-to-income ratio (PIR), WC, and BMI. A health technician measured WC and BMI at a mobile examination center. Participants were divided into three groups based on BMI: normal weight (BMI<25), overweight (25≤BMI<30), and obese (BMI≥30). WC was classified as healthy (male WC <94, female WC <80) and unhealthy (male WC ≥94, female WC ≥80) [ 28 ]. Additionally, activity, smoking, and alcohol were also taken into account. We assessed participants' activity levels using metabolic equivalent (MET) scores, calculated by quantifying the time they spent each week engaged in a range of work-related vigorous/moderate activities, amateur physical activity, and walking or cycling. Individuals with MET scores < 600 per week were defined as having low activity levels; scores between 600-3000 indicated moderate, and higher than 3000 were identified as vigorous [ 29 ]. The evaluation of smoking and alcohol consumption was conducted through Alcohol and Cigarette Use Questionnaires. A 'Current smoker' was identified as someone who has smoked 100 or more cigarettes in the past and does so now [ 30 ]. Alcohol was categorized based on consuming more than 12 drinks per year [ 31 ].

The present study evaluated the main disease statuses, including hypertension, diabetes, cardiovascular disease (CVD), and gout. Hypertension was operationally defined as being previously diagnosed by a healthcare professional, involving antihypertensive medication, or exhibiting systolic or diastolic blood pressure that exceeds the current recommended standard (140/90 mmHg) [ 32 ]. Diabetes mellitus was defined by a prior diagnosis, glucose-lowering medications or insulin treatment, fasting blood glucose, 2-hour oral glucose tolerance tests, or glycated hemoglobin above the current diagnostic criteria [ 33 ]. CVD was determined based on a medical status questionnaire, including whether a doctor or health-related professional had told participants to have heart failure, coronary heart disease, angina/angina pectoris, heart attack, or stroke. Gout was achieved through a questionnaire that asked participants whether they had been previously diagnosed by healthcare workers. In addition, blood markers, including total cholesterol (TC), alanine aminotransferase (ALT), and eGFR were also considered. The eGFR was used in the CKD-EPI 2021 creatinine equation [ 34 ].

A directed acyclic graph (DAG) was utilized to pinpoint potential covariates that required adjustment in the multivariable analysis [ 35 ]. Referring to the DAG (see supplementary figure 1 ), a minimal set of variables was selected for adjustment: sex, age, race, education level, PIR, physical activity, alcohol consumption, and smoking.

Statistical analysis

The mean ± standard deviation (SD) represents the characteristics of the participating population that adhere to the normal distribution. At the same time, the median and interquartile range was used to describe those characteristics that deviate from normality. Percentages were used to report categorical variables. Multivariable logistic regression models were applied to investigate the associations between LMI, VFMI, and the ratio of LMI to VFMI with hyperuricemia. To account for the skewed distribution of LMI/VFMI, we applied a natural logarithmic transformation (In LMI/VFMI) to ensure its normal distribution. Before the regression analysis, a diagnostic assessment of multicollinearity was conducted to identify any issues about covariance among the independent variables. A variance inflation factor value below 10 indicates acceptable levels of multicollinearity [ 36 ]. The missing covariate data were estimated using the chained equation method of multiple imputations (MICE), and a total of five imputed datasets were created.

Three models were used in the regression analysis. Model 1 did not adjust for any variables, model 2 adjusted for gender, age, and race, and model 3 adjusted for sex, age, race, education level, PIR, MET scores, alcohol, and smoking. An evaluation of the dose-response relationship between exposure variables and hyperuricemia was carried out using generalized additive models (with a logit link), a method widely used to evaluate non-linear relationships between variables [ 37 , 38 ]. Threshold effects between In LMI/VFMI and hyperuricemia were analyzed using smoothed curve fitting. The specific methods were segmented regression, which involved utilizing separate lines to fit each interval [ 39 ]. The segmented regression model was compared with the single-line model through log-likelihood ratio tests to identify the presence of a critical point.

The stability of the results was verified with subgroup analyses and sensitivity analyses. Subgroup analyses were conducted with separate stratification for different genders, BMI, WC, MET score, smoking, alcohol consumption, hypertension, diabetes mellitus, CVD, gout, and eGFR. We examined five primary scenarios in our sensitivity analyses: varying hyperuricemia thresholds, evaluating SUA as a continuous variable, analyzing the raw data without implementing MICE, variations in methods of SUA testing across different cycles, and considering the potential use of SUA lowering drugs by patients with gout.

Characteristics of participants

Out of the 9,646 study subjects, 1,455 were diagnosed with hyperuricemia. Subjects averaged 39 years of age, and 49.4% were female. A statistical analysis revealed significant differences in various aspects of the groups according to the quartiles of In LMI/VFMI, including gender, race, education level, MET scores, smoking, alcohol, gout, hypertension, diabetes, CVD, and SUA. In the highest quartile of In LMI/VFMI, age, BMI, WC, TC, and ALT were lower. More detailed information can be found in Table 1 .

Association of In LMI/VFMI with hyperuricemia

Table 2 illustrates the logistic regression results. The regression analyses, whether unadjusted, partially adjusted, or fully adjusted, consistently demonstrated a positive association between LMI and hyperuricemia. The OR for Per- SD was 1.88 with a 95% CI of 1.75 to 2.01, while the OR for quartile 4 versus quartile 1 was 5.37 with a 95% CI of 4.31 to 6.69.

At the same time, a positive association was observed between VFMI and hyperuricemia. This association remained significant when VFMI was examined as a continuous and categorical variable. The ORs for Per-SD was 2.02 (95% CI: 1.88, 2.16), and those for quartile 2, quartile 3, and quartile 4 were 2.54 (95% CI: 2.06, 3.13), 4.61 (95% CI: 3.73, 5.69), and 8.37 (95% CI: 6.70, 10.47), respectively, compared to the reference quartile. Furthermore, a significant trend ( P < 0.001) was observed, indicating an increased risk of hyperuricemia with higher quartiles of VFMI.

A negative correlation was observed when examining the correlation between In LMI/VFMI and hyperuricemia. This correlation remained consistent in the partially or fully adjusted models. We found that with each SD increase in In LMI/VFMI, the risk of hyperuricemia in participants decreased by 55% (OR=0.45; 95% CI: 0.42, 0.49). Likewise, in examining In LMI/VFMI as a categorical variable, a downward trajectory in the prevalence of hyperuricemia was observed as quartiles increased ( P -value for trend<0.001). The ORs for second, third, and fourth quartile were 0.60 (95% CI: 0.51, 0.70), 0.40 (95% CI: 0.34, 0.48), and 0.16 (95% CI: 0.13, 0.20), respectively.

Smooth curve fitting and saturation effect analysis

Applying generalized additive modeling in the analysis revealed a non-linear positive association between LMI, VFMI, and hyperuricemia (Supplement Figure 2 - 3 ). Upon investigating the association between In LMI/VFMI and hyperuricemia, an L-shaped negative correlation and a saturation effect of 5.64 were identified (Fig.  2 ). It is worth noting that when In LMI/VFMI was below 5.64, for every 2.72-fold increase in the ratio of LMI to VFMI, the risk of hyperuricemia was reduced by 80% (OR=0.20; 95% CI: 0.17, 0.24). However, once surpassing the critical threshold of 5.64, the connection appeared to stabilize, and the correlation was not statistically meaningful (OR=2.19; 95% CI: 0.86, 5.55). See Table 3 .

figure 2

Dose-response relationships between In LMI/VFMI and hyperuricemia. The solid red line depicts a smooth curve. The 95% confidence interval is visualized by the blue bands encompassing the fit

Subgroup and sensitivity analysis

We performed subgroup analyses to independently evaluate the consistency of the association between the variables related to exposure and hyperuricemia. After adjustment for covariates, the results showed that both LMI and VFMI were significantly and positively associated with hyperuricemia across subgroups (Supplementary Figure 4 - 5 ). Across all subgroups, a persistent and steady inverse relationship between In LM/VFM and hyperuricemia was noticed (Fig.  3 ).

figure 3

Forest plot of the ORs of hyperuricemia associated with In LMI/VFMI according different subgroups

The potential impact of different thresholds for SUA levels on the outcomes was assessed in the sensitivity analyses. Hyperuricemia was defined as SUA levels equal to or greater than 6.8 mmol/L. Furthermore, SUA was also analyzed as a continuous variable. Considering missing data, the original data without MICE was utilized for the analysis. Given the changes in SUA measurements between the years 2017-2018 and the previous cycle, we excluded data from that year in our analyses. Additionally, we excluded gout patients from the data analysis, considering that their SUA levels might have been influenced by medication. The results of these additional analyses did not significantly differ from the previous analyses (Supplement Tables 1 - 5 ).

In the current research, a positive association between LMI and VFMI with hyperuricemia was found. Conversely, a negative link was observed when considering the relationship between In LMI/VFMI and hyperuricemia. Regression analyses were conducted, carefully accounting for potential confounders, which allowed us to control for potential biases as much as possible. Additionally, stratified subgroup and sensitivity analyses were performed. The findings revealed that variations in these clinical characteristics did not substantially impact the relationship between the variables, thus affirming the robust and reliable nature of the results.

Scholars have studied the relationship between muscle mass or strength and SUA. However, the available evidence to date could be more varied and inconsistent. A number of researchers have observed a positive correlation between muscle mass or strength and SUA in people of all ages. Alvim et al. [ 40 ] discovered that children and adolescents with higher muscle mass also had higher SUA levels. Dong et al. found that elevated SUA was linked to higher muscle mass in community adults over 40 [ 41 ]. Similarly, Xu et al. [ 42 ] surveyed 992 hospitalized patients over 45 years old and found an inverted J-curve relationship between SUA levels and handgrip strength. According to Nahas et al. [ 43 ] and Molino-Lova et al. [ 44 ], SUA was positively correlated with muscle strength in older adults. As in the previous study, we similarly found that lean body mass was positively associated with hyperuricemia. However, several other studies have reported contradictory findings. Beavers et al. discovered a strong correlation between high SUA and sarcopenia in a study involving 7,544 adults over 40 from the NHANES III [ 45 ]. Similarly, according to a survey of Brazilian adults over 20, muscle mass index was negatively associated with high SUA [ 46 ]. Tanaka et al. [ 47 ] also found a negative correlation between SUA and skeletal muscle mass in individuals with type 2 diabetes.

There have been several studies examining the link between obesity and hyperuricemia. In China, Han et al. [ 5 ] conducted two prospective studies comprising 17,044 individuals who were followed for an average duration of 6 years. Their findings indicated a positive association between BMI and increased SUA levels. A dose-dependent correlation between hyperuricemia and overweight/obesity was illustrated by Choi et al. [ 48 ], indicating a population-attributable risk of 44%. Additionally, researchers have examined the link between adipose tissue and SUA by evaluating body composition. A previous investigation reported that the distribution of body fat could potentially impact the occurrence of hyperuricemia among individuals with obesity [ 49 ]. Furthermore, Takahashi et al. discovered that visceral adiposity was crucial in increasing SUA concentrations and reducing uric acid clearance. Visceral fat accumulation had a more detrimental impact on uric acid metabolism than BMI and subcutaneous fat accumulation [ 50 ]. However, these studies were small sample sizes overall.

Moreover, researchers have discovered that variations in the metabolic outcomes of diverse body fat deposits may exist. In their study, Bai et al. [ 27 ] examined a cohort comprising 3,683 individuals who were middle-aged and older. Their findings highlighted a significant association between SUA levels and the presence of visceral and hepatic adipose tissues. However, this correlation was not adequate for subcutaneous fat. Similar findings were observed in other studies [ 51 , 52 , 53 ]. In their research, Xie and colleagues [ 54 ] examined a sample of 271 children and adolescents in China who were classified as obese. The findings of their study revealed that skeletal muscle emerged as the most significant indicator for hyperuricemia, surpassing both WC and hip circumference. Despite this, no connection was found between hyperuricemia and body fat mass. In a study examining individuals with polycystic ovary syndrome, Zhang and colleagues [ 55 ] discovered an unfavorable relationship between SUA and the quantity of visceral adipose tissue. Nevertheless, no substantial association was identified between hyperuricemia and other adipose tissue forms, including overall fat, trunk fat, and subcutaneous abdominal fat. These studies provide valuable insights into the varying effects of different adipose tissues on metabolism, particularly highlighting the significance of visceral fat. This notion is further supported by Li et al.'s study [ 15 ]. Their study revealed a significant and positive relationship between SUA and visceral fat area, even among individuals who are not obese (BMI <30 kg/m 2 ).

Diminished lean body mass and elevated visceral fat both are strongly linked to an elevated risk of metabolic diseases. When the two are present together, there is likely a synergistic effect on metabolic health [ 18 , 19 , 56 ]. Based on the Chinese National Health Survey, He et al. [ 57 ] found that total body fat to muscle ratio was positively correlated with hyperuricemia and that the higher the ratio, the higher the SUA. According to a study conducted by Wang et al. [ 58 ], they discovered that the prevalence of hyperuricemia in women, when adjusted for BMI, was positively linked to the ratio of visceral fat area to leg muscle mass. However, this association was not observed in men. Additionally, Zhang et al. [ 59 ] examined 5158 Chinese medical check-up records and found a positive relationship between the ratio of visceral fat area to skeletal muscle mass and cardiometabolic diseases. Our current research identified a negative correlation between In LMI/VFMI and hyperuricemia. This negative association was observed across various subgroups, which included stratified analyses based on different sexes, BMI, WC, activity intensity, and disease states.

Changes in body composition occur gradually with age. Muscle mass and strength reach their maximum levels during early adulthood and tend to decline after middle age [ 60 ], while muscle mass decreases at 0.75% per year [ 61 ]. On the other hand, body fat tends to increase, resulting in visceral fat accumulation and ectopic fat deposition. A longitudinal study conducted by Koster et al. [ 62 ] involving 2,307 adults aged 70 and above discovered that increased fat mass was linked to decreased muscle mass. Additionally, surplus fat contributed to a rapid decline in lean body mass. Skeletal muscle, a critical endocrine organ, contributes to the body's metabolic health by secreting cytokines and peptides mediating energy metabolism [ 63 ]. The depletion of muscular tissue can lead to various severe outcomes, encompassing weakness, incapacity, and fatality [ 64 , 65 ]. In contrast, visceral obesity is often accompanied by significant disorders of glucolipid metabolism. It exhibits high insulin resistance, which can have various adverse impacts on the body, leading to a lower renal clearance of uric acid and elevated SUA [ 66 ]. The current investigation revealed that hyperuricemia declined following the In LMI/VFMI increase. This correlation remained consistent regardless of the varied attributes of the participants involved in the research. Additional examination of the curve fit indicated a nonlinear connection between the index and hyperuricemia, demonstrating a saturation effect. These findings emphasize the significance of preserving the equilibrium between muscle and visceral fat in human individuals.

Current methods of assessing obesity using BMI and WC may not accurately reflect an individual's obesity status and the distribution of muscle and fat [ 35 ]. Research has shown that relying solely on BMI may overlook cardiometabolic risks in individuals with normal BMI but excessive body fat [ 67 ]. Our study revealed a strong positive correlation between LMI, VFMI, and hyperuricemia across different levels of BMI and WC. When considering the effects of lean body and visceral fat mass, the correlation between In LMI/VFMI and hyperuricemia remained consistent across different BMI and WC strata, with no statistically significant differences between the subgroups (P for interaction all >0.05). These findings highlight the intricate relationship between muscle mass, adipose tissue, and metabolic health. Neglecting the role of adipose tissue in studying the association between muscle mass and SUA may lead to an incomplete assessment of these variables. In particular, the threshold effect between In LMI/VFMI and hyperuricemia reinforces this conjecture. Evaluating the ratio of LMI to VFMI could provide valuable insights beyond traditional obesity assessment methods.

Certain limitations exist in the present research. Initially, this study adopted a cross-sectional design, thus preventing the establishment of causal relationships among the variables. Cohort studies are needed to determine whether a specific muscle visceral fat ratio range implies better metabolic health. Secondly, it is worth mentioning that the NHANES survey only measured body composition in individuals aged up to 59 years. Therefore, this study's findings may not apply to people older than 59. Another consideration is the potential impact of blood uric acid-lowering medications on the results of this study. Although participants with gout were excluded from the sensitivity analyses, bias from potential confounders is still possible.

Conclusions

To summarize, the present investigation has revealed a positive association between LMI and VFMI with hyperuricemia. Furthermore, we have observed a non-linear inverse relationship, referred to as the saturation effect, between In LMI/VFMI and hyperuricemia. These findings propose that the LMI/VFMI ratio may offer valuable perspectives beyond solely evaluating separate indicators of lean or visceral fat mass.

Availability of data and materials

The data for this research can be accessed on the following websites: https://www.cdc.gov/nchs/nhanes/about_nhanes.htm .

Abbreviations

Alanine aminotransferase

Body mass index

Cardiovascular disease

Estimated Glomerular filtration rate

Lean body mass index

Metabolic equivalent

Multiple imputations by chained equation

National Health and Nutrition Examination Survey

Standard deviation

Serum uric acid

Total cholesterol

Visceral fat mass index

Waist circumference

Chen-Xu M, Yokose C, Rai SK, Pillinger MH, Choi HK. Contemporary prevalence of gout and hyperuricemia in the United States and Decadal Trends: The National Health and Nutrition Examination Survey, 2007–2016. Arthritis Rheumatol. 2019;71(6):991–9.

Article   PubMed   PubMed Central   Google Scholar  

Piao W, Zhao L, Yang Y, et al. The prevalence of hyperuricemia and its correlates among adults in China: results from CNHS 2015–2017. Nutrients. 2022;14(19):4095.

Kuwabara M, Fukuuchi T, Aoki Y, et al. Exploring the multifaceted nexus of uric acid and health: a review of recent studies on diverse diseases. Biomolecules. 2023;13(10):1519.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Borghi C, Agnoletti D, Cicero AFG, Lurbe E, Virdis A. Uric acid and hypertension: a review of evidence and future perspectives for the management of cardiovascular risk. Hypertension. 2022;79(9):1927–36.

Article   CAS   PubMed   Google Scholar  

Han T, Meng X, Shan R, et al. Temporal relationship between hyperuricemia and obesity, and its association with future risk of type 2 diabetes. Int J Obes. 2018;42(7):1336–44.

Article   CAS   Google Scholar  

Perticone M, Maio R, Shehaj E, et al. Sex-related differences for uric acid in the prediction of cardiovascular events in essential hypertension A population prospective study. Cardiovasc Diabetol. 2023;22(1):298.

Crawley WT, Jungels CG, Stenmark KR, Fini MA. U-shaped association of uric acid to overall-cause mortality and its impact on clinical management of hyperuricemia. Redox Biol. 2022;51:102271.

Zhao P, Shi W, Shi Y, et al. Positive association between weight-adjusted-waist index and hyperuricemia in patients with hypertension: The China H-type hypertension registry study. Front Endocrinol (Lausanne). 2022;13:1007557.

Article   PubMed   Google Scholar  

Bae J, Park KY, Son S, Huh Y, Nam GE. Associations between obesity parameters and hyperuricemia by sex, age, and diabetes mellitus: a nationwide study in Korea. Obes Res Clin Pract. 2023;17(5):405–10.

Zhou S, Yu Y, Zhang Z, et al. Association of obesity, triglyceride-glucose and its derivatives index with risk of hyperuricemia among college students in Qingdao, China. Front Endocrinol (Lausanne). 2022;13:1001844.

Wang J, Chen S, Zhao J, et al. Association between nutrient patterns and hyperuricemia: mediation analysis involving obesity indicators in the NHANES. BMC Public Health. 2022;22(1):1981.

Liu XZ, Li HH, Huang S, Zhao DB. Association between hyperuricemia and nontraditional adiposity indices. Clin Rheumatol. 2019;38(4):1055–62.

Barone M, Losurdo G, Iannone A, Leandro G, Di Leo A, Trerotoli P. Assessment of body composition: Intrinsic methodological limitations and statistical pitfalls. Nutrition. 2022;102:111736.

Huang X, Jiang X, Wang L, et al. Visceral adipose accumulation increased the risk of hyperuricemia among middle-aged and elderly adults: a population-based study. J Transl Med. 2019;17(1):341.

Li Z, Gao L, Zhong X, Feng G, Huang F, Xia S. Association of visceral fat area and hyperuricemia in non-obese US adults: a cross-sectional study. Nutrients. 2022;14(19):3992.

Chen L, Wu L, Li Q, et al. Hyperuricemia associated with low skeletal muscle in the middle-aged and elderly population in China. Exp Clin Endocrinol Diabetes. 2022;130(8):546–53.

Yu PC, Hsu CC, Lee WJ, et al. Muscle-to-fat ratio identifies functional impairments and cardiometabolic risk and predicts outcomes: biomarkers of sarcopenic obesity. J Cachexia Sarcopenia Muscle. 2022;13(1):368–76.

Shao Y, Li L, Zhong H, Wang X, Hua Y, Zhou X. Anticipated correlation between lean body mass to visceral fat mass ratio and insulin resistance: NHANES 2011–2018. Front Endocrinol (Lausanne). 2023;14:1232896.

Li L, Zhong H, Shao Y, Zhou X, Hua Y, Chen M. Association between lean body mass to visceral fat mass ratio and bone mineral density in United States population: a cross-sectional study. Arch Public Health. 2023;81(1):180.

Zipf G, Chiappa M, Porter KS, et al. National Health and Nutrition Examination Survey: Plan and operations, 1999–2010. National Center for Health Statistics. Vital Health Stat. 2013;1(56).

National Center for Health Statistics. National Health and Nutrition Examination Survey (NHANES): Body Composition Procedures Manual. 2018. https://wwwn.cdc.gov/nchs/data/nhanes/2017-2018/manuals/Body_Composition_Procedures_Manual_2018.pdf . Accessed 22 Feb 2024.

Lagacé JC, Brochu M, Dionne IJ. A counterintuitive perspective for the role of fat-free mass in metabolic health. J Cachexia Sarcopenia Muscle. 2020;11(2):343–7.

National Center for Health Statistics. National Health and Nutrition Examination Survey (NHANES): MEC Laboratory Procedures Manual 2017. https://wwwn.cdc.gov/nchs/data/nhanes/2017-2018/manuals/2017_MEC_Laboratory_Procedures_Manual.pdf . Accessed 22 Feb 2024.

Feig DI, Kang DH, Johnson RJ. Uric acid and cardiovascular risk. N Engl J Med. 2008;359(17):1811–21.

Neogi T. Clinical practice Gout. N Engl J Med. 2011;364(5):443–52.

He H, Pan L, Wang D, et al. Fat-to-muscle ratio is independently associated with hyperuricemia and a reduced estimated glomerular filtration rate in Chinese adults: the China national health survey. Nutrients. 2022;14(19):4193.

Bai R, Ying X, Shen J, et al. The visceral and liver fat are significantly associated with the prevalence of hyperuricemia among middle age and elderly people: a cross-sectional study in Chongqing China. Front Nutr. 2022;9:961792.

Copeland JK, Chao G, Vanderhout S, et al. The impact of migration on the gut metagenome of South Asian Canadians. Gut Microbes. 2021;13(1):1–29.

DeFina LF, Radford NB, Barlow CE, et al. Association of all-cause and cardiovascular mortality with high levels of physical activity and concurrent coronary artery calcification. JAMA Cardiol. 2019;4(2):174–81.

ALHarthi SSY, Natto ZS, Midle JB, Gyurko R, O'Neill R, Steffensen B. Association between time since quitting smoking and periodontitis in former smokers in the National Health and Nutrition Examination Surveys (NHANES) 2009 to 2012. J Periodontol. 2019;90(1):16-25.

Hao H, Chen Y, Xiaojuan J, et al. The association between METS-IR and serum ferritin level in United States Female: a cross-sectional study based on NHANES. Front Med (Lausanne). 2022;9:925344.

Xu JP, Zeng RX, Zhang YZ, et al. Systemic inflammation markers and the prevalence of hypertension: A NHANES cross-sectional study. Hypertens Res. 2023;46(4):1009–19.

Liu B, Liu J, Pan J, Zhao C, Wang Z, Zhang Q. The association of diabetes status and bone mineral density among US adults: evidence from NHANES 2005–2018. BMC Endocr Disord. 2023;23(1):27.

Miller WG, Kaufman HW, Levey AS, et al. National Kidney Foundation Laboratory Engagement Working Group Recommendations for Implementing the CKD-EPI 2021 Race-Free Equations for Estimated Glomerular Filtration Rate: Practical Guidance for Clinical Laboratories. Clin Chem. 2022;68(4):511–20.

Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48.

Kim JH. Multicollinearity and misleading statistical results. Korean J Anesthesiol. 2019;72(6):558–69.

Ranzani OT, Milà C, Sanchez M, et al. Personal exposure to particulate air pollution and vascular damage in peri-urban South India. Environ Int. 2020;139:105734.

Pan HC, Huang TM, Sun CY, et al. Predialysis serum lactate levels could predict dialysis withdrawal in Type 1 cardiorenal syndrome patients. EClinicalMedicine. 2022;44:101232.

Yu X, Cao L, Yu X. Elevated cord serum manganese level is associated with a neonatal high ponderal index. Environ Res. 2013;121:79–83.

Alvim RO, Siqueira JH, Zaniqueli D, Dutra DM, Oliosa PR, Mill JG. Influence of muscle mass on the serum uric acid levels in children and adolescents. Nutr Metab Cardiovasc Dis. 2020;30(2):300–5.

Dong XW, Tian HY, He J, Wang C, Qiu R, Chen YM. Elevated serum uric acid is associated with greater bone mineral density and skeletal muscle mass in middle-aged and older adults. PLoS One. 2016;11(5):e0154692.

Xu L, Jing Y, Zhao C, et al. Cross-sectional analysis of the association between serum uric acid levels and handgrip strength among Chinese adults over 45 years of age. Ann Transl Med. 2020;8(23):1562.

Nahas PC, Rossato LT, de Branco FMS, Azeredo CM, Rinaldi AEM, de Oliveira EP. Serum uric acid is positively associated with muscle strength in older men and women: Findings from NHANES 1999–2002. Clin Nutr. 2021;40(6):4386–93.

Molino-Lova R, Sofi F, Pasquini G, et al. Higher uric acid serum levels are associated with better muscle function in the oldest old: Results from the Mugello Study. Eur J Intern Med. 2017;41:39–43.

Beavers KM, Beavers DP, Serra MC, Bowden RG, Wilson RL. Low relative skeletal muscle mass indicative of sarcopenia is associated with elevations in serum uric acid levels: findings from NHANES III. J Nutr Health Aging. 2009;13(3):177–82.

de Oliveira EP, Moreto F, Silveira LV, Burini RC. Dietary, anthropometric, and biochemical determinants of uric acid in free-living adults. Nutr J. 2013;12:11.

Tanaka KI, Kanazawa I, Notsu M, Sugimoto T. Higher serum uric acid is a risk factor of reduced muscle mass in men with type 2 diabetes mellitus. Exp Clin Endocrinol Diabetes. 2021;129(1):50–5.

Choi HK, McCormick N, Lu N, Rai SK, Yokose C, Zhang Y. Population impact attributable to modifiable risk factors for hyperuricemia. Arthritis Rheumatol. 2020;72(1):157–65.

Matsuura F, Yamashita S, Nakamura T, et al. Effect of visceral fat accumulation on uric acid metabolism in male obese subjects: visceral fat obesity is linked more closely to overproduction of uric acid than subcutaneous fat obesity. Metabolism. 1998;47(8):929–33.

Takahashi S, Yamamoto T, Tsutsumi Z, Moriwaki Y, Yamakita J, Higashino K. Close correlation between visceral fat accumulation and uric acid metabolism in healthy men. Metabolism. 1997;46(10):1162–5.

Hikita M, Ohno I, Mori Y, Ichida K, Yokose T, Hosoya T. Relationship between hyperuricemia and body fat distribution. Intern Med. 2007;46(17):1353–8.

Kim TH, Lee SS, Yoo JH, et al. The relationship between the regional abdominal adipose tissue distribution and the serum uric acid levels in people with type 2 diabetes mellitus. Diabetol Metab Syndr. 2012;4(1):3.

Rospleszcz S, Dermyshi D, Müller-Peltzer K, Strauch K, Bamberg F, Peters A. Association of serum uric acid with visceral, subcutaneous and hepatic fat quantified by magnetic resonance imaging. Sci Rep. 2020;10(1):442.

Xie L, Mo PKH, Tang Q, et al. Skeletal muscle mass has stronger association with the risk of hyperuricemia than body fat mass in obese children and adolescents. Front Nutr. 2022;9:792234.

Zhang Y, Cai M, Dilimulati D, et al. Correlation between serum uric acid and body fat distribution in patients with polycystic ovary syndrome. Front Endocrinol (Lausanne). 2022;12:782808.

El Bizri I, Batsis JA. Linking epidemiology and molecular mechanisms in sarcopenic obesity in populations. Proc Nutr Soc. 2020;79(4):448–56.

Article   Google Scholar  

He H, Pan L, Wang D, et al. The association between muscle-to-fat ratio and cardiometabolic risks: the China National Health Survey. Exp Gerontol. 2023;175:112155.

Wang XH, Jiang WR, Zhang MY, et al. The visceral fat area to leg muscle mass ratio is significantly associated with the risk of hyperuricemia among women: a cross-sectional study. Biol Sex Differ. 2021;12(1):17.

Zhang S, Huang Y, Li J, et al. Increased visceral fat area to skeletal muscle mass ratio is positively associated with the risk of cardiometabolic diseases in a Chinese natural population: a cross-sectional study. Diabetes Metab Res Rev. 2023;39(2):e3597.

Dennison EM, Sayer AA, Cooper C. Epidemiology of sarcopenia and insight into possible therapeutic targets. Nat Rev Rheumatol. 2017;13(6):340–7.

Kemmler W, von Stengel S, Schoene D. Longitudinal Changes in Muscle Mass and Function in Older Men at Increased Risk for Sarcopenia - The FrOST-Study. J Frailty Aging. 2019;8(2):57–61.

CAS   PubMed   Google Scholar  

Koster A, Ding J, Stenholm S, et al. Does the amount of fat mass predict age-related loss of lean mass, muscle strength, and muscle quality in older adults? J Gerontol A Biol Sci Med Sci. 2011;66(8):888–95.

Kim G, Kim JH. Impact of Skeletal Muscle Mass on Metabolic Health. Endocrinol Metab (Seoul). 2020;35(1):1–6.

Shimada H, Suzuki T, Doi T, et al. Impact of osteosarcopenia on disability and mortality among Japanese older adults. J Cachexia Sarcopenia Muscle. 2023;14(2):1107-1116.

Gielen E, Dupont J, Dejaeger M, Laurent MR. Sarcopenia, osteoporosis and frailty. Metabolism. 2023;145:155638.

Facchini F, Chen YD, Hollenbeck CB, Reaven GM. Relationship between resistance to insulin-mediated glucose uptake, urinary uric acid clearance, and plasma uric acid concentration. JAMA. 1991;266(21):3008–11.

Zapata JK, Azcona-Sanjulian MC, Catalán V, et al. BMI-based obesity classification misses children and adolescents with raised cardiometabolic risk due to increased adiposity. Cardiovasc Diabetol. 2023;22(1):240.

Download references

This study was supported by grants from the Hubei Provincial Department of Education (18D070, 22D092) and the Shi yan Science and Technology Bureau (22Y31).

Author information

Authors and affiliations.

Department of Social Medicine and Health Management, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, No. 13 Hangkong Road, Wuhan, Hubei, 430030, PR China

Longti Li & Xiaoxv Yin

Innovation Centre of Nursing Research, TaiHe Hospital, Hubei University of Medicine, Shiyan, Hubei, PR China

Longti Li, Ya Shao, Huiqin Zhong, Rong Zhang & Boxiong Gong

Health Management Center, Wudangshan Campus, TaiHe Hospital, Hubei University of Medicine, Shiyan, Hubei, PR China

Ya Shao & Yu Wang

You can also search for this author in PubMed   Google Scholar

Contributions

LTL and XY proposed research design. LTL and YS drafted the manuscript with supervision by XY. YW, HQZ, RZ, and BXG cleaned up the data and performed statistical analysis. All authors contributed to the writing of the paper and granted final approval for the submitted manuscript.

Corresponding author

Correspondence to Xiaoxv Yin .

Ethics declarations

Ethics approval and consent to participate.

The National Center for Health Statistics Institutional Review Board approved the survey.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., supplementary material 2., supplementary material 3., supplementary material 4., supplementary material 5., supplementary material 6., supplementary material 7., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Li, L., Shao, Y., Zhong, H. et al. L-shaped association between lean body mass to visceral fat mass ratio with hyperuricemia: a cross-sectional study. Lipids Health Dis 23 , 116 (2024). https://doi.org/10.1186/s12944-024-02111-2

Download citation

Received : 23 February 2024

Accepted : 16 April 2024

Published : 20 April 2024

DOI : https://doi.org/10.1186/s12944-024-02111-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Lean body mass
  • Visceral fat mass
  • Hyperuricemia

Lipids in Health and Disease

ISSN: 1476-511X

research design regression analysis

IMAGES

  1. Regression Analysis: The Ultimate Guide

    research design regression analysis

  2. Regression analysis: What it means and how to interpret the outcome

    research design regression analysis

  3. What is regression analysis?

    research design regression analysis

  4. A Refresher on Regression Analysis

    research design regression analysis

  5. Regression Analysis: The Ultimate Guide

    research design regression analysis

  6. PPT

    research design regression analysis

VIDEO

  1. What is research design? #how to design a research advantages of research design

  2. Research Design

  3. Research Design என்றால் என்ன? ‌I தமிழில் I NTA NET Research Aptitude

  4. Regression analysis [in 57 sec.] #shorts

  5. What is Research Design?

  6. Clinical Trial Design

COMMENTS

  1. Regression Analysis

    Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

  2. Regression Analysis: The Complete Guide

    Regression analysis is a statistical method. It's used for analyzing different factors that might influence an objective - such as the success of a product launch, business growth, a new marketing campaign - and determining which factors are important and which ones can be ignored.

  3. What Is a Research Design

    Step 1: Consider your aims and approach. Step 2: Choose a type of research design. Step 3: Identify your population and sampling method. Step 4: Choose your data collection methods. Step 5: Plan your data collection procedures. Step 6: Decide on your data analysis strategies. Other interesting articles.

  4. Research Design

    A research design is a strategy for answering your research question using empirical data. Creating a research design means making decisions about: ... you can do your own analysis to answer new research questions that weren't addressed by the original study. ... Regression and correlation tests look for associations between two or more ...

  5. Regression Analysis

    Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices. Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or ...

  6. A Beginner's Guide to Regression Analysis

    Logistic Regression. Logistic Regression comes into play when the dependent variable is discrete. This means that the target value will only have one or two values. For instance, a true or false, a yes or no, a 0 or 1, and so on. In this case, a sigmoid curve describes the relationship between the independent and dependent variables.

  7. The clinician's guide to interpreting a regression analysis

    Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors ... Logistic regression in medical research. Anesth Analg. 2021 ...

  8. Regression Analysis: Definition, Types, Usage & Advantages

    Overall, regression analysis saves the survey researchers' additional efforts in arranging several independent variables in tables and testing or calculating their effect on a dependent variable. Different types of analytical research methods are widely used to evaluate new business ideas and make informed decisions.

  9. Regression Analysis

    The aim of linear regression analysis is to estimate the coefficients of the regression equation b 0 and b k (k∈K) so that the sum of the squared residuals (i.e., the sum over all squared differences between the observed values of the i th observation of y i and the corresponding predicted values \( {\hat{y}}_i \)) is minimized.The lower part of Fig. 1 illustrates this approach, which is ...

  10. Basic Concepts for Experimental Design and Introductory Regression Analysis

    This chapter introduces some basic concepts and principles in experimental design, including the fundamental principles of replication, randomization, and blocking. It includes a brief and self-contained introduction to regression analysis and covers commonly used techniques like simple and multiple linear regression, least squares estimation ...

  11. PDF An Introduction to Multivariate Design

    The domain of multivariate research design is quite large, and selecting which topics to ... MANOVA, multiple regression, discriminant function analysis, principal components and factor analysis, and canonical correlation analysis are all members of the general linear model family. Some of the procedures just represent different

  12. Appropriate design of research and statistical analyses: observational

    In logistic regression analysis, the regression coefficient (β1) of the equation is estimated, and the exponential function of the regression coefficient (e β1) is the OR associated with a one-unit increase in the exposure . For categorical variables, ORs can be directly interpreted between groups.

  13. Research Design and Statistical Analysis

    A new chapter (13) compares experimental designs to reinforce the connection between design and analysis and to help readers achieve the most efficient research study. A new chapter (27) on common errors in data analysis and interpretation. Increased emphasis on power analyses to determine sample size using the G*Power 3 program.

  14. PDF Multiple Regression Analysis

    5A.4.1 Research Problems Suggesting a Regression Approach If the research problem is expressed in a form that either specifies or implies prediction, multiple regression analysis becomes a viable candidate for the design. Here are some examples of research objectives that imply a regression design:

  15. Regression and Correlation

    Quantitative Research Methods. Correlation is the relationship or association between two variables. There are multiple ways to measure correlation, but the most common is Pearson's correlation coefficient (r), which tells you the strength of the linear relationship between two variables. The value of r has a range of -1 to 1 (0 indicates no ...

  16. PDF Fundamentals of Multiple Regression

    The value of t .025 is found in a t-table, using the usual df of t for assessing statistical significance of a regression coefficient (N - the num-ber of X's - 1), and is the value that leaves a tail of the t-curve with 2.5% of the total probability. For instance, if df = 30, then t.025 = 2.042.

  17. A Refresher on Regression Analysis

    A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...

  18. Regression Analysis for Prediction: Understanding the Process

    Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses ...

  19. Correlational Research

    Revised on June 22, 2023. A correlational research design investigates relationships between variables without the researcher controlling or manipulating any of them. A correlation reflects the strength and/or direction of the relationship between two (or more) variables. The direction of a correlation can be either positive or negative.

  20. Regression Analysis

    Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them. Regression analysis includes several variations ...

  21. Conducting correlation analysis: important limitations and pitfalls

    The correlation coefficient is easy to calculate and provides a measure of the strength of linear association in the data. However, it also has important limitations and pitfalls, both when studying the association between two variables and when studying agreement between methods. These limitations and pitfalls should be taken into account when ...

  22. Tenets of Good Practice in Regression Analysis. A Brief Tutorial

    Background: Regression analysis quantifies the relationships between one or more independent variables and a dependent variable and is one of the most frequently used types of analysis in medical research. The aim of this article is to provide a brief theoretical and practical tutorial for neurosurgeons wishing to conduct or interpret regression analyses.

  23. How to construct regression models for observational studies ...

    2 Applied Health Research Centre (AHRC), Li Ka Shing Knowledge Institute of St. Michael's Hospital, Toronto, ... Regression Analysis* ... Research Design ...

  24. Frontiers

    3 RDD applications for the study of health effects of exposures acting early in life 3.1 Studies. Three review articles have evaluated the application of RDD in health research (4-6), and one recent tutorial provided a guidance to RDD analysis with empirical examples from medical research ().The most recent and the only systematic review (), that performed searches of articles published ...

  25. Optimization of wear parameters for ECAP-processed ZK30 alloy ...

    The present research applies different statistical analysis and machine learning (ML) approaches to predict and optimize the processing parameters on the wear behavior of ZK30 alloy processed ...

  26. L-shaped association between lean body mass to visceral fat mass ratio

    Additionally, the segmented regression analysis indicated a saturation effect of 5.64 for the In LMI/VFMI with hyperuricemia (OR = 0.20, 95%CI: 0.17, 0.24). For every 2.72-fold increase of In LMI/VFMI, the risk of hyperuricemia was reduced by 80%. ... LTL and XY proposed research design. LTL and YS drafted the manuscript with supervision by XY ...

  27. Design and Study of Composite Film Preparation Platform

    This study aims to develop equipment for the preparation of composite films and successfully implement a film thickness prediction function. During the research process, we segmented the mechanical structure of the composite thin film preparation equipment into distinct modules, completed the structural design of the core module, and validated the stability of the process chamber, as well as ...

  28. Demographic determinants of wildlife attraction selection among

    The study employed binary logistic regression analysis and cross-tabulation to assess the influence of demographic characteristics on the selection of wildlife attractions from the total 1550 international repeat tourists. ... The study employs a cross-sectional research design to collect data from international tourists who make repeat visits ...