Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Tests and Confidence Intervals in Multiple Regression

Hypothesis Tests and Confidence Intervals in Multiple Regression

After completing this reading you should be able to:

  • Construct, apply, and interpret hypothesis tests and confidence intervals for a single coefficient in a multiple regression.
  • Construct, apply, and interpret joint hypothesis tests and confidence intervals for multiple coefficients in a multiple regression.
  • Interpret the \(F\)-statistic.
  • Interpret tests of a single restriction involving multiple coefficients.
  • Interpret confidence sets for multiple coefficients.
  • Identify examples of omitted variable bias in multiple regressions.
  • Interpret the \({ R }^{ 2 }\) and adjusted \({ R }^{ 2 }\) in a multiple regression.

Hypothesis Tests and Confidence Intervals for a Single Coefficient

This section is about the calculation of the standard error, hypotheses testing, and confidence interval construction for a single regression in a multiple regression equation.

Introduction

In a previous chapter, we looked at simple linear regression where we deal with just one regressor (independent variable). The response (dependent variable) is assumed to be affected by just one independent variable.  M ultiple regression, on the other hand ,  simultaneously considers the influence of multiple explanatory variables on a response variable Y. We may want to establish the confidence interval of one of the independent variables. We may want to evaluate whether any particular independent variable has a significant effect on the dependent variable. Finally, We may also want to establish whether the independent variables as a group have a significant effect on the dependent variable. In this chapter, we delve into ways all this can be achieved.

Hypothesis Tests for a single coefficient

Suppose that we are testing the hypothesis that the true coefficient \({ \beta }_{ j }\) on the \(j\)th regressor takes on some specific value \({ \beta }_{ j,0 }\). Let the alternative hypothesis be two-sided. Therefore, the following is the mathematical expression of the two hypotheses:

$$ { H }_{ 0 }:{ \beta }_{ j }={ \beta }_{ j,0 }\quad vs.\quad { H }_{ 1 }:{ \beta }_{ j }\neq { \beta }_{ j,0 } $$

This expression represents the two-sided alternative. The following are the steps to follow while testing the null hypothesis:

  • Computing the coefficient’s standard error.

multiple linear regression and hypothesis testing

$$ p-value=2\Phi \left( -|{ t }^{ act }| \right) $$

  • Also, the \(t\)-statistic can be compared to the critical value corresponding to the significance level that is desired for the test.

Confidence Intervals for a Single Coefficient

The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. 

multiple linear regression and hypothesis testing

The t-statistic has n – k – 1 degrees of freedom where k = number of independents

Supposing that an interval contains the true value of \({ \beta }_{ j }\) with a probability of 95%. This is simply the 95% two-sided confidence interval for \({ \beta }_{ j }\). The implication here is that the true value of \({ \beta }_{ j }\) is contained in 95% of all possible randomly drawn variables.

Alternatively, the 95% two-sided confidence interval for \({ \beta }_{ j }\) is the set of values that are impossible to reject when a two-sided hypothesis test of 5% is applied. Therefore, with a large sample size:

$$ 95\%\quad confidence\quad interval\quad for\quad { \beta }_{ j }=\left[ { \hat { \beta } }_{ j }-1.96SE\left( { \hat { \beta } }_{ j } \right) ,{ \hat { \beta } }_{ j }+1.96SE\left( { \hat { \beta } }_{ j } \right) \right] $$

Tests of Joint Hypotheses

In this section, we consider the formulation of the joint hypotheses on multiple regression coefficients. We will further study the application of an \(F\)-statistic in their testing.

Hypotheses Testing on Two or More Coefficients

Joint null hypothesis.

In multiple regression, we canno t test the null hypothesis that all slope coefficients are equal 0 based on t -tests that each individual slope coefficient equals 0. Why? individual t-tests do not account for the effects of interactions among the independent variables.

For this reason, we conduct the F-test which uses the F-statistic .  The F-test tests the null hypothesis that all of the slope coefficients in the multiple regression model are jointly equal to 0, .i.e.,

\(F\)-Statistic

The F-statistic, which is always a one-tailed test , is calculated as:

multiple linear regression and hypothesis testing

To determine whether at least one of the coefficients is statistically significant, the calculated F-statistic is compared with the one-tailed critical F-value, at the appropriate level of significance.

Decision rule:

multiple linear regression and hypothesis testing

Rejection of the null hypothesis at a stated level of significance indicates that at least one of the coefficients is significantly different than zero, i.e, at least one of the independent variables in the regression model makes a significant contribution to the dependent variable.

An analyst runs a regression of monthly value-stock returns on four independent variables over 48 months.

The total sum of squares for the regression is 360, and the sum of squared errors is 120.

Test the null hypothesis at the 5% significance level (95% confidence) that all the four independent variables are equal to zero.

\({ H }_{ 0 }:{ \beta }_{ 1 }=0,{ \beta }_{ 2 }=0,\dots ,{ \beta }_{ 4 }=0 \)

\({ H }_{ 1 }:{ \beta }_{ j }\neq 0\) (at least one j is not equal to zero, j=1,2… k )

ESS = TSS – SSR = 360 – 120 = 240

The calculated test statistic = (ESS/k)/(SSR/(n-k-1))

=(240/4)/(120/43) = 21.5

\({ F }_{ 43 }^{ 4 }\) is approximately 2.44 at 5% significance level.

Decision: Reject H 0 .

Conclusion: at least one of the 4 independents is significantly different than zero.

Omitted Variable Bias in Multiple Regression

This is the bias in the OLS estimator arising when at least one included regressor gets collaborated with an omitted variable. The following conditions must be satisfied for an omitted variable bias to occur:

  • There must be a correlation between at least one of the included regressors and the omitted variable.
  • The dependent variable \(Y\) must be determined by the omitted variable.

Practical Interpretation of the \({ R }^{ 2 }\) and the adjusted \({ R }^{ 2 }\), \({ \bar { R } }^{ 2 }\)

To determine the accuracy within which the OLS regression line fits the data, we apply the coefficient of determination and the regression’s standard error . 

The coefficient of determination, represented by \({ R }^{ 2 }\), is a measure of the “goodness of fit” of the regression. It is interpreted as the percentage of variation in the dependent variable explained by the independent variables

multiple linear regression and hypothesis testing

\({ R }^{ 2 }\) is not a reliable indicator of the explanatory power of a multiple regression model.Why? \({ R }^{ 2 }\) almost always increases as new independent variables are added to the model, even if the marginal contribution of the new variable is not statistically significant. Thus, a high \({ R }^{ 2 }\) may reflect the impact of a large set of independents rather than how well the set explains the dependent.This problem is solved by the use of the adjusted \({ R }^{ 2 }\) (extensively covered in chapter 8)

The following are the factors to watch out when guarding against applying the \({ R }^{ 2 }\) or the \({ \bar { R } }^{ 2 }\):

  • An added variable doesn’t have to be statistically significant just because the \({ R }^{ 2 }\) or the \({ \bar { R } }^{ 2 }\) has increased.
  • It is not always true that the regressors are a true cause of the dependent variable, just because there is a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessary that there is no omitted variable bias just because we have a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessarily true that we have the most appropriate set of regressors just because we have a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessarily true that we have an inappropriate set of regressors just because we have a low \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).

An economist tests the hypothesis that GDP growth in a certain country can be explained by interest rates and inflation.

Using some 30 observations, the analyst formulates the following regression equation:

$$ GDP growth = { \hat { \beta } }_{ 0 } + { \hat { \beta } }_{ 1 } Interest+ { \hat { \beta } }_{ 2 } Inflation $$

Regression estimates are as follows:

Is the coefficient for interest rates significant at 5%?

  • Since the test statistic < t-critical, we accept H 0 ; the interest rate coefficient is  not   significant at the 5% level.
  • Since the test statistic > t-critical, we reject H 0 ; the interest rate coefficient is not significant at the 5% level.
  • Since the test statistic > t-critical, we reject H 0 ; the interest rate coefficient is significant at the 5% level.
  • Since the test statistic < t-critical, we accept H 1 ; the interest rate coefficient is significant at the 5% level.

The correct answer is  C .

We have GDP growth = 0.10 + 0.20(Int) + 0.15(Inf)

Hypothesis:

$$ { H }_{ 0 }:{ \hat { \beta } }_{ 1 } = 0 \quad vs \quad { H }_{ 1 }:{ \hat { \beta } }_{ 1 }≠0 $$

The test statistic is:

$$ t = \left( \frac { 0.20 – 0 }{ 0.05 } \right)  = 4 $$

The critical value is t (α/2, n-k-1) = t 0.025,27  = 2.052 (which can be found on the t-table).

t-table-25-29

Conclusion : The interest rate coefficient is significant at the 5% level.

Offered by AnalystPrep

multiple linear regression and hypothesis testing

Modeling Cycles: MA, AR, and ARMA Models

Empirical approaches to risk metrics and hedging, foreign exchange risk.

After completing this reading, you should be able to: Calculate a financial institution’s... Read More

Principles for Sound Stress Testing &# ...

After completing this reading, you should be able to: Describe the rationale for... Read More

Options Markets

After completing this reading, you should be able to: Describe the various types... Read More

Regression with a Single Regressor  ...

After completing this reading you should be able to: Calculate and interpret confidence... Read More

Leave a Comment Cancel reply

You must be logged in to post a comment.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Prevent plagiarism. Run a free check.

Multiple linear regression formula.

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved April 15, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

Advanced Statistics using R

Applied Data Science Meeting, July 4-6, 2023, Shanghai, China . Register for the workshops: (1) Deep Learning Using R, (2) Introduction to Social Network Analysis, (3) From Latent Class Model to Latent Transition Model Using Mplus, (4) Longitudinal Data Analysis, and (5) Practical Mediation Analysis. Click here for more information .

  • Example Datasets
  • Basics of R
  • Graphs in R

Hypothesis testing

  • Confidence interval
  • Simple Regression
  • Multiple Regression
  • Logistic regression
  • Moderation analysis
  • Mediation analysis
  • Path analysis
  • Factor analysis
  • Multilevel regression
  • Longitudinal data analysis
  • Power analysis

Multiple Linear Regression

The general purpose of multiple regression (the term was first used by Pearson, 1908), as a generalization of simple linear regression, is to learn about how several independent variables or predictors (IVs) together predict a dependent variable (DV). Multiple regression analysis often focuses on understanding (1) how much variance in a DV a set of IVs explain and (2) the relative predictive importance of IVs in predicting a DV.

In the social and natural sciences, multiple regression analysis is very widely used in research. Multiple regression allows a researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what the best predictors of success in college are. Psychologists may want to determine which personality dimensions best predicts social adjustment.

Multiple regression model

A general multiple linear regression model at the population level can be written as

\[y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\varepsilon_{i} \]

  • $y_{i}$: the observed score of individual $i$ on the DV.
  • $x_{1},x_{2},\ldots,x_{k}$ : a set of predictors.
  • $x_{1i}$: the observed score of individual $i$ on IV 1; $x_{ki}$: observed score of individual $i$ on IV $k$.
  • $\beta_{0}$: the intercept at the population level, representing the predicted $y$ score when all the independent variables have their values at 0.
  • $\beta_{1},\ldots,\beta_{k}$: regression coefficients at the population level; $\beta_{1}$: representing the amount predicted $y$ changes when $x_{1}$ changes in 1 unit while holding the other IVs constant; $\beta_{k}$: representing the amount predicted $y$ changes when $x_{k}$ changes in 1 unit while holding the other IVs constant.
  • $\varepsilon$: unobserved errors with mean 0 and variance $\sigma^{2}$.

Parameter estimation

The least squares method used for the simple linear regression analysis can also be used to estimate the parameters in a multiple regression model. The basic idea is to minimize the sum of squared residuals or errors. Let $b_{0},b_{1},\ldots,b_{k}$ represent the estimated regression coefficients.The individual $i$'s residual $e_{i}$ is the difference between the observed $y_{i}$ and the predicted $y_{i}$

\[ e_{i}=y_{i}-\hat{y}_{i}=y_{i}-b_{0}-b_{1}x_{1i}-\ldots-b_{k}x_{ki}.\]

The sum of squared residuals is

\[ SSE=\sum_{i=1}^{n}e_{i}^{2}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}. \]

By minimizing $SSE$, the regression coefficient estimates can be obtained as

\[ \boldsymbol{b}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}=(\sum\boldsymbol{x}_{i}\boldsymbol{x}_{i}')^{-1}(\sum\boldsymbol{x}_{i}\boldsymbol{y}_{i}). \]

How well the multiple regression model fits the data can be assessed using the $R^{2}$. Its calculation is the same as for the simple regression

\[\begin{align*} R^{2} & = & 1-\frac{\sum e_{i}^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}\\& = & \frac{\text{Variation explained by IVs}}{\text{Total variation}} \end{align*}. \]

In multiple regression, $R^{2}$ is the total proportion of variation in $y$ explained by the multiple predictors.

The $R^{2}$ increases or at least is the same with the inclusion of more predictors. However, with more predators, the model becomes more complex and potentially more difficult to interpret. In order to take into consideration of the model complexity, the adjusted $R^{2}$ has been defined, which is calculated as

\[aR^{2}=1-(1-R^{2})\frac{n-1}{n-k-1}.\]

Hypothesis testing of regression coefficient(s)

With the estimates of regression coefficients and their standard errors estimates, we can conduct hypothesis testing for one, a subset, or all regression coefficients.

Testing a single regression coefficient

At first, we can test the significance of the coefficient for a single predictor. In this situation, the null and alternative hypotheses are

\[ H_{0}:\beta_{j}=0\text{ vs }H_{1}:\beta_{j}\neq0 \]

with $\beta_{j}$ denoting the regression coefficient of $x_{j}$ at the population level.

As in the simple regression, we use a test statistic

\[ t_{j}=\frac{b_{j} - \beta{j} }{s.e.(b_{j})}\]

where $b_{j}$ is the estimated regression coefficient of $x_{j}$ using data from a sample. If the null hypothesis is true and $\beta_j = 0$, the test statistic follows a t-distribution with degrees of freedom \(n-k-1\) where \(k\) is the number of predictors.

One can also test the significance of \(\beta_j\) by constructing a confidence interval for it. Based on a t distribution, the \(100(1-\alpha)%\) confidence interval is

\[ [b_{j}+t_{n-k-1}(\alpha/2)*s.e.(b_{j}),\;b_{j}+t_{n-k-1}(1-\alpha/2)*s.e.(b_{j})]\]

where $t_{n-k-1}(\alpha/2)$ is the $\alpha/2$ percentile of the t distribution. As previously discussed, if the confidence interval includes 0, the regression coefficient is not statistically significant at the significance level $\alpha$.

Testing all the regression coefficients together (overall model fit)

Given the multiple predictors, we can also test whether all of the regression coefficients are 0 at the same time. This is equivalent to test whether all predictors combined can explained a significant portion of the variance of the outcome variable. Since $R^2$ is a measure of the variance explained, this test is naturally related to it.

For this hypothesis testing, the null and alternative hypothesis are

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{k}=0\]

\[H_{1}:\text{ at least one of the regression coefficients is different from 0}.\]

In this kind of test, an F test is used. The F-statistic is defined as

\[F=\frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}.\]

It follows an F-distribution with degrees of freedom $k$ and $n-k-1$ when the null hypothesis is true. Given an F statistic, its corresponding p-value can be calculated from the F distribution as shown below. Note that we only look at one side of the distribution because the extreme values should be on the large value side.

Testing a subset of the regression coefficients

We can also test whether a subset of $p$ regression coefficients, e.g., $p$ from 1 to the total number coefficients $k$, are equal to zero. For convenience, we can rearrange all the $p$ regression coefficients to be the first $p$ coefficients. Therefore, the null hypothesis should be

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{p}=0\]

and the alternative hypothesis is that at least one of them is not equal to 0.

As for testing the overall model fit, an F test can be used here. In this situation, the F statistic can be calculated as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}},\]

which follows an F-distribution with degrees of freedom $p$ and $n-k-1$. $R^2$ is for the regression model with all the predictors and $R_0^2$ is from the regression model without the first $p$ predictors $x_{1},x_{2},\ldots,x_{p}$ but with the rest predictors $x_{p+1},x_{p+2},\ldots,x_{k}$.

Intuitively, this test determine whether the variance explained by the first \(p\) predictors above and beyond the $k-p$ predictors is significance or not. That is also the increase in R-squared.

As an example, suppose that we wanted to predict student success in college. Why might we want to do this? There's an ongoing debate in college and university admission offices (and in the courts) regarding what factors should be considered important in deciding which applicants to admit. Should admissions officers pay most attention to more easily quantifiable measures such as high school GPA and SAT scores? Or should they give more weight to more subjective measures such as the quality of letters of recommendation? What are the pros and cons of the approaches? Of course, how we define college success is also an open question. For the sake of this example, let's measure college success using college GPA.

In this example, we use a set of simulated data (generated by us). The data are saved in the file gpa.csv. As shown below, the sample size is 100 and there are 4 variables: college GPA (c.gpa), high school GPA (h.gpa), SAT, and quality of recommendation letters (recommd).

Graph the data

Before fitting a regression model, we should check the relationship between college GPA and each predictor through a scatterplot. A scatterplot can tell us the form of relationship, e.g., linear, nonlinear, or no relationship, the direction of relationship, e.g., positive or negative, and the strength of relationship, e.g., strong, moderate, or weak. It can also identify potential outliers.

The scatterplots between college GPA and the three potential predictors are given below. From the plots, we can roughly see all three predictors are positively related to the college GPA. The relationship is close to linear and the relationship seems to be stronger for high school GPA and SAT than for the quality of recommendation letters.

Descriptive statistics

Next, we can calculate some summary statistics to explore our data further. For each variable, we calculate 6 numbers: minimum, 1st quartile, median, mean, 3rd quartile, and maximum. Those numbers can be obtained using the summary() function. To look at the relationship among the variables, we can calculate the correlation matrix using the correlation function cor() .

Based on the correlation matrix, the correlation between college GPA and high school GPA is about 0.545, which is larger than that (0.523) between college GPA and SAT, in turn larger than that (0.35) between college GPA and quality of recommendation letters.

Fit a multiple regression model

As for the simple linear regression, The multiple regression analysis can be carried out using the lm() function in R. From the output, we can write out the regression model as

\[ c.gpa = -0.153+ 0.376 \times h.gpa + 0.00122 \times SAT + 0.023 \times recommd \]

Interpret the results / output

From the output, we see the intercept is -0.153. Its immediate meaning is that when all predictors' values are 0, the predicted college GPA is -0.15. This clearly does not make much sense because one would never get a negative GPA, which results from the unrealistic presumption that the predictors can take the value of 0.

The regression coefficient for the predictor high school GPA (h.gpa) is 0.376. This can be interpreted as keeping SAT and recommd scores constant , the predicted college GPA would increase 0.376 with a unit increase in high school GPA.This is again might be problematic because it might be impossible to increase high school GPA while keeping the other two predictors unchanged. The other two regression coefficients can be interpreted in the same way.

From the output, we can also see that the multiple R-squared ($R^2$) is 0.3997. Therefore, about 40% of the variation in college GPA can be explained by the multiple linear regression with h.GPA, SAT, and recommd as the predictors. The adjusted $R^2$ is slightly smaller because of the consideration of the number of predictors. In fact,

\[ \begin{eqnarray*} aR^{2} & = & 1-(1-R^{2})\frac{n-1}{n-k-1}\\& = & 1-(1-.3997)\frac{100-1}{100-3-1}\\& = & .3809 \end{eqnarray*} \]

Testing Individual Regression Coefficient

For any regression coefficients for the three predictors (also the intercept), a t test can be conducted. For example, for high school GPA, the estimated coefficient is 0.376 with the standard error 0.114. Therefore, the corresponding t statistic is \(t = 0.376/0.114 = 3.294\). Since the statistic follows a t distribution with the degrees of freedom \(df = n - k - 1 = 100 - 3 -1 =96\), we can obtain the p-value as \(p = 2*(1-pt(3.294, 96))= 0.0013\). Since the p-value is less than 0.05, we conclude the coefficient is statistically significant. Note the t value and p-value are directly provided in the output.

Overall model fit (testing all coefficients together)

To test all coefficients together or the overall model fit, we use the F test. Given the $R^2$, the F statistic is

\[ \begin{eqnarray*} F & = & \frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}\\ & = & \left(\frac{100-3-1}{3}\right)\times \left(\frac{0.3997}{1-.3997}\right )=21.307\end{eqnarray*} \]

which follows the F distribution with degrees of freedom $df1=k=3$ and $df2=n-k-1=96$. The corresponding p-value is 1.160e-10. Note that this information is directly shown in the output as " F-statistic: 21.31 on 3 and 96 DF, p-value: 1.160e-10 ".

Therefore, at least one of the regression coefficients is statistically significantly different from 0. Overall, the three predictors explained a significant portion of the variance in college GPA. The regression model with the 3 predictors is significantly better than the regression model with intercept only (i.e., predict c.gpa by the mean of c.gpa).

Testing a subset of regression coefficients

Suppose we are interested in testing whether the regression coefficients of high school GPA and SAT together are significant or not. Alternative, we want to see above and beyond the quality of recommendation letters, whether the two predictors can explain a significant portion of variance in college GPA. To conduct the test, we need to fit two models:

  • A full model: which consists of all the predictors to predict c.gpa by intercept, h.gpa, SAT, and recommd.
  • A reduced model: obtained by removing the predictors to be tested in the full model.

From the full model, we can get the $R^2 = 0.3997$ with all three predictors and from the reduced model, we can get the $R_0^2 = 0.1226$ with only quality of recommendation letters. Then the F statistic is constructed as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}}=\left(\frac{100-3-1}{2}\right )\times\frac{.3997-.1226}{1-.3997}=22.157.\]

Using the F distribution with the degrees of freedom $p=2$ (the number of coefficients to be tested) and $n-k-1 = 96$, we can get the p-value close to 0 ($p=1.22e-08$).

Note that the test conducted here is based on the comparison of two models. In R, if there are two models, they can be compared conveniently using the R function anova() . As shown below, we obtain the same F statistic and p-value.

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R . Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

12.2.1: Hypothesis Test for Linear Regression

  • Last updated
  • Save as PDF
  • Page ID 34850

  • Rachel Webb
  • Portland State University

To test to see if the slope is significant we will be doing a two-tailed test with hypotheses. The population least squares regression line would be \(y = \beta_{0} + \beta_{1} + \varepsilon\) where \(\beta_{0}\) (pronounced “beta-naught”) is the population \(y\)-intercept, \(\beta_{1}\) (pronounced “beta-one”) is the population slope and \(\varepsilon\) is called the error term.

If the slope were horizontal (equal to zero), the regression line would give the same \(y\)-value for every input of \(x\) and would be of no use. If there is a statistically significant linear relationship then the slope needs to be different from zero. We will only do the two-tailed test, but the same rules for hypothesis testing apply for a one-tailed test.

We will only be using the two-tailed test for a population slope.

The hypotheses are:

\(H_{0}: \beta_{1} = 0\) \(H_{1}: \beta_{1} \neq 0\)

The null hypothesis of a two-tailed test states that there is not a linear relationship between \(x\) and \(y\). The alternative hypothesis of a two-tailed test states that there is a significant linear relationship between \(x\) and \(y\).

Either a t-test or an F-test may be used to see if the slope is significantly different from zero. The population of the variable \(y\) must be normally distributed.

F-Test for Regression

An F-test can be used instead of a t-test. Both tests will yield the same results, so it is a matter of preference and what technology is available. Figure 12-12 is a template for a regression ANOVA table,

Template for a regression table, containing equations for the sum of squares, degrees of freedom and mean square for regression and for error, as well as the F value of the data.

where \(n\) is the number of pairs in the sample and \(p\) is the number of predictor (independent) variables; for now this is just \(p = 1\). Use the F-distribution with degrees of freedom for regression = \(df_{R} = p\), and degrees of freedom for error = \(df_{E} = n - p - 1\). This F-test is always a right-tailed test since ANOVA is testing the variation in the regression model is larger than the variation in the error.

Use an F-test to see if there is a significant relationship between hours studied and grade on the exam. Use \(\alpha\) = 0.05.

T-Test for Regression

If the regression equation has a slope of zero, then every \(x\) value will give the same \(y\) value and the regression equation would be useless for prediction. We should perform a t-test to see if the slope is significantly different from zero before using the regression equation for prediction. The numeric value of t will be the same as the t-test for a correlation. The two test statistic formulas are algebraically equal; however, the formulas are different and we use a different parameter in the hypotheses.

The formula for the t-test statistic is \(t = \frac{b_{1}}{\sqrt{ \left(\frac{MSE}{SS_{xx}}\right) }}\)

Use the t-distribution with degrees of freedom equal to \(n - p - 1\).

The t-test for slope has the same hypotheses as the F-test:

Use a t-test to see if there is a significant relationship between hours studied and grade on the exam, use \(\alpha\) = 0.05.

Multiple Linear Regression

We hope you enjoyed this lesson.

Get the Hypothesis Testing course for more great video tutorials.

Start free trial

Cool lesson, huh? Share it with your friends

Facebook Twitter LinkedIn WhatsApp Email

  • Lesson resources Resources
  • Quick reference Reference

About this lesson

Many times there are multiple factors that are influencing the response variable in a problem. Multiple regression determines the relationship between the response factor and multiple control factors. Like with simple linear regression, a formula is created that allows both analysis and prediction of the process and problem.

Exercise files

Download this lesson’s related exercise files.

Quick reference

Multiple linear regression analysis is the creation of an equation with multiple independent X variables that all influence a Y response variable.  This equation is based upon an existing data set and models the conditions represented in the data.

When to use

When there are multiple independent variables that correlate with the system response, a multiple linear regression should be done.  This can be used to predict process performance and identify which factors have the primary impact on process performance.

Instructions

Multiple linear regression is the appropriate technique to use when the data set has multiple continuous independent input variables and a continuous response variable.  The technique determines which variables are statistically significant and creates an equation that shows the relationship of the variables to the response.  To improve the accuracy of the analysis, there should be at least ten data points for each independent variable.  The equation takes on the form:

Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + …

Where the absolute value of the “b” coefficients shows the relative importance of each variable.

Multiple linear regression can be used to predict process performance based on the values of the inputs.  Input levels for ideal performance can be defined and tolerance levels that ensure acceptable performance can be determined using the regression equation.  The equation will also be helpful for setting process controls.

Excel does not have a multiple linear regression function.  The analysis can be done in Minitab using the “Fit Regression Model” option in the Regression menu.  This will display an input panel where the response variable and input variables can be selected.  If the analysis shows a variable is not statistically significant, check the residual plots to see if the result is normal.  If not, remove the variable that is not statistically significant and rerun the analysis.  The normality of the residuals should be improved.

Hints & tips

  • Too many variables increase uncertainty in the analysis.  There should be at least ten data points for each variable (e.g. if using three variables have at least 30 data points). 
  • Drop variables that are not statistically significant to improve the accuracy of the equation.
  • The analysis assumes a linear (straight line) effect.  If the residuals indicate a bad fit, you will need to add higher-order terms and create a non-linear analysis.  This is discussed in another lesson.
  • Always check the residual analysis to ensure it is normally distributed with equal variance and indicates independence.
  • 00:04 Hi, I'm Ray Sheen.
  • 00:05 Sometimes the dependent or
  • 00:07 response variable in the analysis depends upon more than one factor.
  • 00:12 When that happens, you may need to do a multi-linear regression analysis.
  • 00:18 >> Once again, I'll start with our decision tree for hypothesis testing.
  • 00:22 When we have continuous variables for the process response and
  • 00:26 the process independent variables, we turn to the regression analysis, and when we
  • 00:30 have multiple variables at the same time, we use the multiple regression technique.
  • 00:36 Let's take a few minutes to explain what we mean by multiple regression analysis.
  • 00:42 Recall that regression analysis determines the relationship between process
  • 00:45 variables.
  • 00:47 And it's no surprise that multiple regression considers multiple independent
  • 00:51 variables instead of just one.
  • 00:53 The analysis determines the impact of each of these independent
  • 00:57 variables upon the dependent variable.
  • 01:00 If you tried a simple linear analysis and it wasn't a good fit,
  • 01:03 you can consider adding some additional terms.
  • 01:07 Now, regardless of the number of terms, the format for
  • 01:09 the hypothesis tests are still the same as a simple linear regression.
  • 01:13 The null hypothesis is that there is no relationship, and
  • 01:17 the alternative hypothesis is that there is a relationship.
  • 01:21 The analysis will determine the relative significance of each of the factors to
  • 01:25 each other in addition to the dependent factor.
  • 01:28 It will show up with coefficients of these factors.
  • 01:31 The form of the multiple linear equation is a dependent variable y is equal to
  • 01:36 a constant indicated by a in this equation plus a term with each of the variables,
  • 01:41 which has a coefficient associated with it.
  • 01:44 So it's beta 1 times x1 + beta 2 times x 2 + beta 3 times x 3 and so on.
  • 01:53 Multiple regression analysis is particularly useful for
  • 01:56 predicting process performance.
  • 01:59 The multiple regression analysis will result in an equation that relates all
  • 02:03 the independent variables to the dependent variable.
  • 02:07 This analysis provides the terms and the coefficients.
  • 02:12 This equation is incredibly helpful when you're designing the solution for
  • 02:15 a problem in a Lean Six Sigma Project.
  • 02:18 The equation predicts the dependent variable performance based upon whatever
  • 02:22 values you've selected for the independent variables.
  • 02:25 So, as you're designing the solution, you may want to design one of your independent
  • 02:30 variables to be in a particular zone that's well controlled.
  • 02:34 This will then help you determine how with the other variables you can achieve
  • 02:38 the desired performance.
  • 02:40 Based upon the scaling constants for each of the factors,
  • 02:44 you can also decide which factors will have the primary control for the process.
  • 02:49 I prefer to use one easily controlled independent factor to control
  • 02:53 an overall process, and if possible to set the other factors and
  • 02:58 zones that are very easy to lock into a standard setting.
  • 03:02 You can't always do that, but if you can, it makes process control much easier.
  • 03:07 So let's look at how we conduct a multiple regression analysis.
  • 03:11 Excel does not have a function for
  • 03:13 conducting multiple linear regression analysis.
  • 03:16 So we will have to rely on Minitab.
  • 03:19 In Minitab, go to the stat pulldown menu, select Regression, select
  • 03:24 Regression again, and then select Fit Regression Model just like is shown here.
  • 03:30 That will bring up this panel.
  • 03:32 Place your cursor in the response window to activate the list of
  • 03:36 data values in the window on the left ,then select the dependent variable
  • 03:41 often referred to as the y factor and click on the selection button.
  • 03:46 The column should move to the response window.
  • 03:48 Now, place your cursor in the continuous predictors and
  • 03:52 then select the appropriate column.
  • 03:54 You can also use categorical or discrete factors if you have them.
  • 03:58 However, if you're using these factors I always recommend you use factors that
  • 04:02 are bimodal such as true, false.
  • 04:04 Set one of those to a value of 1 and the other to a value of 0.
  • 04:08 And one more point, you get the residual plots by selecting the graph button and
  • 04:13 then choosing residual 4 and 1.
  • 04:17 Let's finish off this topic with a few warnings and
  • 04:20 some pitfalls when doing multiple regression analysis.
  • 04:24 This analysis still assumes linear effect, which means straight line effects for
  • 04:28 each of the independent variables.
  • 04:31 We'll look at interactive effects when we look at nonlinear regression in
  • 04:34 another lesson.
  • 04:36 Adding lots of independent variables can increase uncertainty.
  • 04:40 If you find that some factor has virtually no effect,
  • 04:43 then I recommend removing it from your analysis and simplify things.
  • 04:48 Check your residual plots to make sure the residuals are normally distributed.
  • 04:53 This is another indication that you have a good solution.
  • 04:56 And finally, too many factors creates too many potential interactions,
  • 05:00 and it becomes difficult to statistically validate the effect of each independent
  • 05:04 variable.
  • 05:05 A good rule of thumb is that your data set size should have at least 10 times,
  • 05:10 the number of independent factors being analyzed.
  • 05:14 So if you want to analyze four factors at once,
  • 05:17 the data set needs to have a minimum of 40 points.
  • 05:20 Also, when there are many independent factors in the analysis,
  • 05:24 the regression formula becomes much more sensitive to outliers.
  • 05:29 >> In many cases,
  • 05:30 the multiple linear regression analysis is just what is needed to understand
  • 05:34 how the handful of independent variables affects the overall process output.
  • 05:40 The formula created is extremely helpful when
  • 05:44 determining the optimal solution for your problem.

Lesson notes are only available for subscribers.

PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.

How is your GoSkills experience?

Your feedback has been sent

© 2024 GoSkills Ltd. Skills for career advancement

Multiple Regression Analysis using SPSS Statistics

Introduction.

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.

Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:

  • Assumption #1: Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable . If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").
  • Assumption #2: You have two or more independent variables , which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable). For examples of continuous and ordinal variables , see the bullet above. Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Again, you can learn more about variables in our article: Types of Variable . If one of your independent variables is dichotomous and considered a moderating variable, you might need to run a Dichotomous moderator analysis .
  • Assumption #3: You should have independence of observations (i.e., independence of residuals ), which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required, in our enhanced multiple regression guide.
  • Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively . Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots and partial regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots to check for linearity. If the relationship displayed in your scatterplots and partial regression plots are not linear, you will have to either run a non-linear regression analysis or "transform" your data, which you can do using SPSS Statistics. In our enhanced multiple regression guide, we show you how to: (a) create scatterplots and partial regression plots to check for linearity when carrying out multiple regression using SPSS Statistics; (b) interpret different scatterplot and partial regression plot results; and (c) transform your data using SPSS Statistics if you do not have linear relationships between your variables.
  • Assumption #5: Your data needs to show homoscedasticity , which is where the variances along the line of best fit remain similar as you move along the line. We explain more about what this means and how to assess the homoscedasticity of your data in our enhanced multiple regression guide. When you analyse your own data, you will need to plot the studentized residuals against the unstandardized predicted values. In our enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity using SPSS Statistics; (b) some of the things you will need to consider when interpreting your data; and (c) possible ways to continue with your analysis if your data fails to meet this assumption.
  • Assumption #6: Your data must not show multicollinearity , which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model. Therefore, in our enhanced multiple regression guide, we show you: (a) how to use SPSS Statistics to detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret these correlation coefficients and Tolerance/VIF values so that you can determine whether your data meets or violates this assumption.
  • Assumption #7: There should be no significant outliers , high leverage points or highly influential points . Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your data, you can detect possible outliers, high leverage points and highly influential points. In our enhanced multiple regression guide, we: (a) show you how to detect outliers using "casewise diagnostics" and "studentized deleted residuals", which you can do using SPSS Statistics, and discuss some of the options you have in order to deal with outliers; (b) check for leverage points using SPSS Statistics and discuss what you should do if you have any; and (c) check for influential points in SPSS Statistics using a measure of influence known as Cook's Distance, before presenting some practical approaches in SPSS Statistics to deal with any influential points you might have.
  • Assumption #8: Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using: (a) a histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.

Setup in SPSS Statistics

In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a multiple regression analysis in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.

'Linear Regression' dialogue box for a multiple regression analysis in SPSS Statistics. All variables on the left

Interpreting and Reporting the Output of Multiple Regression Analysis

SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.

However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:

Determining how well the model fits

The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:

'Model Summary' table for a multiple regression analysis in SPSS. 'R', 'R Square' & 'Adjusted R Square' highlighted

The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.

Statistical significance

The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).

'ANOVA' table for a multiple regression analysis in SPSS Statistics. 'df', 'F' & 'Sig.' highlighted

Estimated model coefficients

The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:

predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )

This is obtained from the Coefficients table, as shown below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 'Unstandardized Coefficients B' highlighted

Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.

Statistical significance of the independent variables

You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 't' & 'Sig.' highlighted

You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.

Putting it all together

You could write up the results as follows:

A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - the hypothesis tests for the slopes.

At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

Hypotheses 1

Is the regression model containing at least one predictor useful in predicting the size of the infarct?

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon\) At least one \(\beta_{j} ≠ 0\) (for j = 1, 2, 3)

Hypotheses 2

Is the size of the infarct significantly (linearly) related to the area of the region at risk?

  • \(H_{0} \colon \beta_{1} = 0 \)
  • \(H_{A} \colon \beta_{1} \ne 0 \)

Hypotheses 3

(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon \) At least one \(\beta_{j} ≠ 0\) (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

\(F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)\)

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — \(df_{R}\) and \(df_{F}\), respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.

Testing all slope parameters equal 0 Section  

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 1, 2, 3)

The full model

The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

\(y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4\).

The reduced model

The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

\(y_i=\beta_0+\epsilon_i\)

The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 1 \).

Upon plugging in the above quantities, the general linear F -statistic:

\(F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}\)

becomes the usual " overall F -test ":

\(F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}\)

That is, to test \(H_{0}\) : \(\beta_{1} = \beta_{2} = \beta_{3} = 0 \), we just use the overall F -test and P -value reported in the analysis of variance table:

Analysis of Variance

Regression equation.

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0 Section  

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are \(x_{1}\), \(x_{2}\), and \(x_{3}\), we can denote the error sum of squares as SSE (\(x_{1}\), \(x_{2}\), \(x_{3}\)). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4 \).

Because the null hypothesis sets the first slope parameter, \(\beta_{1}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

Because the two predictors in the model are \(x_{2}\) and \(x_{3}\), we denote the error sum of squares as SSE (\(x_{2}\), \(x_{3}\)). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 3\).

The general linear statistic:

simplifies to:

\(F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\)

Getting the numbers from the Minitab output:

we determine that the value of the F -statistic is:

\(F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554\)

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

F distribution with 1 DF in Numerator and 28 DF in denominator

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, \(\beta_{1}\), is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

Coefficients

is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

\(t_{(n-p)}^{2}=F_{(1, n-p)}\)

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

\(t^{*2}=5.72^2=32.72=F^*\)

So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:

Compare the output obtained when \(x_{1}\) = Area is entered into the model last :

Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area

to the output obtained when \(x_{1}\) = Area is entered into the model first :

The t -statistic and P -value are the same regardless of the order in which \(x_{1}\) = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.

  • We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read right off of the Minitab output, it makes sense that it would be the test that we'll use most often.
  • But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test has taught us something new about the t -test. The t -test is a test for the marginal significance of the \(x_{1}\) predictor after the other predictors \(x_{2}\) and \(x_{3}\) have been taken into account. It does not test for the significance of the relationship between the response y and the predictor \(x_{1}\) alone.

Testing a subset of slope parameters is 0 Section  

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 2, 3)

Because the null hypothesis sets the second and third slope parameters, \(\beta_{2}\) and \(\beta_{3}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)

The ANOVA table for the reduced model is:

Because the only predictor in the model is \(x_{1}\), we denote the error sum of squares as SSE (\(x_{1}\)) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 2 = 32 – 2 = 30\).

\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}

Alternatively, we can calculate the F-statistic using a partial F-test :

\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}

To conduct the test, we regress y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3 }\)— in order (and with "Sequential sums of squares" selected under "Options"):

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

yielding SSR (\(x_{2}\) | \(x_{1}\)) = 0.31453, SSR (\(x_{3}\) | \(x_{1}\), \(x_{2}\)) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:

\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}

which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:

F distribution with 2 DF in Numerator and 28 DF in denominator

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing Section  

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

  • Hypothesis test for testing that all of the slope parameters are 0.
  • Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
  • Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:

  • Use the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / ( \(df_{R}\) – \(df_{F}\)) .
  • Alternatively, use the partial F-test formula by fitting only the full model but making sure the relevant predictors are fitted last and "sequential sums of squares" have been selected. Then the numerator of the F-statistic is the sum of the relevant sequential sums of squares divided by the sum of the degrees of freedom for these sequential sums of squares. The denominator of the F -statistic is the mean squared error in the ANOVA table.

For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors \(x_{2}\) and \(x_{3}\) using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and \(5.72^{2} = 32.72\).

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:

  • Fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3}\) and use an individual t-test for \(x_{3}\).
  • If the test results indicate that we can drop \(x_{3}\) then fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and use an individual t-test for \(x_{2}\).

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:

  • rejecting a true null hypothesis, i.e., make a type I error by concluding the tested predictor(s) should be retained in the model when in truth it/they should be dropped; or
  • failing to reject a false null hypothesis, i.e., make a type II error by concluding the tested predictor(s) should be dropped from the model when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

Hypothesis tests for the slope parameters Section  

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

  • whether one slope parameter is 0 (for example, \(H_{0} \colon \beta_{1} = 0 \))
  • whether a subset (more than one but less than all) of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \) against the alternative \(H_{A} \colon \beta_{2} \ne 0 \) or \(\beta_{3} \ne 0 \) or both ≠ 0)
  • whether all of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3}\) = 0 against the alternative \(H_{A} \colon \) at least one of the \(\beta_{i}\) is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

Sugar beets study

A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen (\(x_{1}\) = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.

Preliminary Work

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, \(x_{2}\) and \(x_{3}\), say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

Then, if we assume the trend in the data can be summarized by this regression model:

\(y_{i} = \beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) + \(\beta_{3}\)\(x_{3}\) + \(\epsilon_{i}\)

where \(x_{1}\) = nit and \(x_{2}\) and \(x_{3}\) are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter \(\beta_{2}\) quantify? And, what does the parameter \(\beta_{3}\) quantify?

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 \(x_{2}\) - 2.35 \(x_{3}\), which means that the equations for each treatment group are:

  • Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit
  • Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit
  • Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for \(x_{2}\) represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.

The regression parameter for \(x_{3}\) represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.

Testing whether all of the slope parameters are 0

\(H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0\) against the alternative \(H_A \colon \) at least one of the \(\beta_i\) is not 0.

\(F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95\)

Since the p -value for this F -statistic is reported as 0.000, we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

Tests for whether one slope parameter is 0

\(H_0 \colon \beta_1= 0\) against the alternative \(H_A \colon \beta_1 \ne 0\)

t -statistic = 19.60, p -value = 0.000, so we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that the slope parameter for \(x_{1}\) = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

\(F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal \(t^{2}\), the square of the t -statistic as we might expect?

\(F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32\), which is the same as \(19.60^{2}\).

Because \(t^{2}\) will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the \(x_{1}\) predictor after \(x_{2}\) and \(x_{3}\) have already been taken into account.

Tests for whether a subset of the slope parameters is 0

\(H_0 \colon \beta_2=\beta_3= 0\) against the alternative \(H_A \colon \beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).

\(F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46\).

F distribution with 2 DF in Numerator and 26 DF in denominator

p-value \(= 1-0.363677 = 0.636\), so we fail to reject \(H_{0}\) in favor of \(H_{A}\) and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Note that the sequential mean square due to regression, MSR(\(X_{2}\),\(X_{3}\)|\(X_{1}\)), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors \(X_{2}\) and \(X_{3}\) are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the \(\alpha= 0.05\) level.

multiple linear regression and hypothesis testing

Inference for Linear Regression

Kelly McConville

Stat 100 Week 12 | Spring 2024

Announcements

  • Last week of section.

Goals for Today

Recap multiple linear regression

Check assumptions for linear regression inference

Hypothesis testing for linear regression

Estimation and prediction inference for linear regression

Please make sure to fill out the Stat 100 Course Evaluations.

We appreciate constructive feedback.

For all of your course evaluations be mindful of unconscious and unintentional biases .

What does statistical inference (estimation and hypothesis testing) look like when I have more than 0 or 1 explanatory variables?

One route: Multiple Linear Regression!

Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

Both quantitative and categorical explanatory variables.

Multiple explanatory variables.

Curved relationships between the response variable and the explanatory variable.

BUT the response variable is quantitative.

In this week’s p-set you will explore the importance of controlling for key explanatory variables when making inferences about relationships.

Form of the Model:

\[ \begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align} \]

Fitted Model: Using the Method of Least Squares,

\[ \begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align} \]

Typical Inferential Questions – Hypothesis Testing

Should \(x_2\) be in the model that already contains \(x_1\) and \(x_3\) ? Also often asked as “Controlling for \(x_1\) and \(x_3\) , is there evidence that \(x_2\) has a relationship with \(y\) ?”

\[ \begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align} \]

In other words, should \(\beta_2 = 0\) ?

Typical Inferential Questions – Estimation

After controlling for the other explanatory variables, what is the range of plausible values for \(\beta_3\) (which summarizes the relationship between \(y\) and \(x_3\) )?

Typical Inferential Questions – Prediction

While \(\hat{y}\) is a point estimate for \(y\) , can we also get an interval estimate for \(y\) ? In other words, can we get a range of plausible predictions for \(y\) ?

To answer these questions, we need to add some assumptions to our linear regression model.

Additional Assumptions:

\[ \epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon}) \]

\(\sigma_{\epsilon}\) = typical deviations from the model

Let’s unpack these assumptions!

Assumptions – Independence

For ease of visualization, let’s assume a simple linear regression model:

Assumption : The cases are independent of each other.

Question : How do we check this assumption?

Consider how the data were collected.

Assumptions – Normality

\[ \begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{black}{N} \left(0, \sigma_{\epsilon} \right) \end{align*} \]

Assumption : The errors are normally distributed.

Recall the residual: \(e = y - \hat{y}\)

QQ-plot: Plot the residuals against the quantiles of a normal distribution!

multiple linear regression and hypothesis testing

Assumptions – Mean of Errors

Assumption : The points will, on average, fall on the line.

If you use the Method of Least Squares, then you don’t have to check.

It will be true by construction:

\[ \sum e = 0 \]

Assumptions – Constant Variance

Assumption : The variability in the errors is constant.

One option : Scatterplot

multiple linear regression and hypothesis testing

Better option (especially when have more than 1 explanatory variable): Residual Plot

multiple linear regression and hypothesis testing

Assumptions – Model Form

Assumption : The model form is appropriate.

One option : Scatterplot(s)

multiple linear regression and hypothesis testing

Assumption Checking

Question : What if the assumptions aren’t all satisfied?

Try transforming the data and building the model again.

Use a modeling technique beyond linear regression.

Question : What if the assumptions are all (roughly) satisfied?

  • Can now start answering your inference questions!

Let’s now look at an example and learn how to create qq-plots and residual plots in R .

Example: covid and candle ratings.

Kate Petrova created a dataset that made the rounds on Twitter:

multiple linear regression and hypothesis testing

COVID and Candle Ratings

She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question:

Do we have evidence that early in the pandemic the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time?

In other words, do we have evidence that we should allow the slopes to vary?

multiple linear regression and hypothesis testing

Checking assumptions:

Question : What needs to be true about the candles sampled?

Assumption Checking in R

The R package we will use to check model assumptions is called gglm and was written by one of my former Reed students, Grayson White.

First need to fit the model:

multiple linear regression and hypothesis testing

Residual Plot

multiple linear regression and hypothesis testing

Hypothesis Testing

Question : What tests is get_regression_table() conducting?

For the moment, let’s focus on the equal slopes model.

In General :

\[ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} \] \[ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} \]

For our Example :

\[ H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model} \] \[ H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model} \]

\[ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} \] \[ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} \]

Test Statistic: Let \(p\) = number of explanatory variables.

\[ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - p) \]

when \(H_o\) is true and the model assumptions are met.

Our Example

Test Statistic:

\[ t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2 \]

with p-value \(= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.\)

There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date.

multiple linear regression and hypothesis testing

One More Example – Prices of Houses in Saratoga Springs, NY

Does whether or not a house has central air conditioning relate to its price for houses in Saratoga Springs?

Potential confounding variables?

  • Notice that you generally don’t include interaction terms for the control variables.

Now let’s shift our focus to estimation and prediction!

Typical Inferential Question:

After controlling for the other explanatory variables, what is the range of plausible values for \(\beta_j\) (which summarizes the relationship between \(y\) and \(x_j\) )?

Confidence Interval Formula:

Two Types of Predictions

Confidence interval for the mean response.

→ Defined at given values of the explanatory variables

→ Estimates the average response

→ Centered at \(\hat{y}\)

→ Smaller SE

Prediction Interval for an Individual Response

→ Predicts the response of a single, new observation

→ Larger SE

CI for mean response at a given level of X:

We want to construct a 95% CI for the average price of Saratoga Houses (in 2006!) where the houses meet the following conditions: 1500 square feet, 20 years old, 2 bathrooms, and have central air.

  • Interpretation : We are 95% confident that the average price of 20 year old, 1500 square feet Saratoga houses with central air and 2 bathrooms is between $199,919 and $211834.

PI for a new Y at a given level of X:

Say we want to construct a 95% PI for the price of an individual house that meets the following conditions: 1500 square feet, 20 years old, 2 bathrooms, and have central air.

Notice : Predicting for a new observation not the mean!

  • Interpretation : For a 20 year old, 1500 square feet Saratoga house with central air and 2 bathrooms, we predict, with 95% confidence, that the price will be between $73,885 and $337,869.

Next Time: Comparing Models and Chi-Squared Tests!

  • How It Works

Multiple Linear Regression in SPSS

Discover Multiple Linear Regression in SPSS ! Learn how to perform, understand SPSS output , and report results in APA style. Check out this simple, easy-to-follow guide below for a quick read!

Struggling with the Multiple Regression in SPSS ? We’re here to help . We offer comprehensive assistance to students , covering assignments , dissertations , research, and more. Request Quote Now !

multiple linear regression and hypothesis testing

Introduction

Welcome to our comprehensive guide on Multiple Linear Regression in SPSS . In the dynamic world of statistics, understanding the nuances of Multiple Linear Regression is key for researchers and analysts seeking a deeper understanding of relationships within their data. This blog post is your roadmap to mastering Multiple Linear Regression using the Statistical Package for the Social Sciences (SPSS).

From unraveling the fundamentals to providing practical insights through examples, this guide aims to demystify the complexities, making Multiple Linear Regression accessible to both beginners and seasoned data enthusiasts.

Definition: Multiple Linear Regression

Multiple Linear Regression expands upon the principles of Simple Linear Regression by accommodating multiple independent variables. In essence, it assesses the linear relationship between the dependent variable and two or more predictors. The model’s flexibility allows for a more realistic representation of real-world scenarios where outcomes are influenced by multiple factors. By incorporating multiple predictors, this technique offers a nuanced understanding of how each variable contributes to the variation in the dependent variable. This section serves as a gateway to the intricacies of Multiple Linear Regression, setting the stage for a detailed exploration of its components and applications in subsequent sections.

Linear Regression Methods

Multiple Linear Regression encompasses various methods for building and refining models to predict a dependent variable based on multiple independent variables. These methods help researchers and analysts tailor regression models to the specific characteristics of their data and research questions. Here are some key methods in Multiple Linear Regression:

Ordinary Least Squares (OLS)

OLS is the most common method used in Multiple Linear Regression . It minimizes the sum of squared differences between observed and predicted values, aiming to find the coefficients that best fit the data. OLS provides unbiased estimates if the assumptions of the regression model are met.

Stepwise Regression

In stepwise regression , the model-building process involves adding or removing predictor variables at each step based on statistical criteria. The algorithm evaluates variables and decides whether to include or exclude them in a stepwise manner. It can be forward (adding variables) or backward (removing variables) stepwise regression.

Backward Regression

Backward regression begins with a model that includes all predictor variables and then systematically removes the least significant variables based on statistical tests. This process continues until the model only contains statistically significant predictors. It’s a simplification approach aimed at retaining only the most influential variables.

Forward Regression

Forward regression starts with an empty model and incrementally adds the most significant predictor variables based on statistical tests. This iterative process continues until the addition of more variables does not significantly improve the model. Forward regression helps identify the most relevant predictors contributing to the model’s explanatory power.

Hierarchical Regression

In hierarchical regression, predictor variables are entered into the model in a pre-defined sequence or hierarchy. This method allows researchers to examine the impact of different sets of variables on the dependent variable, taking into account their hierarchical or logical order. The most common approach involves entering blocks of variables at different steps, and assessing how each set contributes to the overall predictive power of the model.

Understanding these multiple linear regression types is crucial for selecting the most appropriate model-building strategy based on the specific goals of your analysis and the characteristics of your dataset. Each approach has its advantages and considerations, influencing the interpretability and complexity of the final regression model.

Regression Equation

The Multiple Regression Equation in Multiple Linear Regression takes the form of

Y = b0 + b1X1 + b2X2 + … + bnXn , where

  • Y is the predicted value of the dependent variable,
  • b0 is the intercept,
  • b1, b2, …, bn are the regression coefficients for each independent variable (X1, X2, …, Xn).

The regression coefficients represent the change in the dependent variable for a one-unit change in the corresponding independent variable, while the intercept is the predicted value when all independent variables are zero. Understanding the interplay between these components is essential for deciphering the impact of each predictor on the overall model. In the upcoming sections, we’ll delve deeper into specific aspects of Multiple Linear Regression, such as the role of dummy variables and the critical assumptions that underpin this statistical method.

What are Dummy Variables?

In the realm of Multiple Linear Regression , dummy variables are pivotal when dealing with categorical predictors. These variables allow us to include categorical data, like gender or region, in our regression model. Consider a binary categorical variable, such as gender (Male/Female). We represent this in our equation using a dummy variable, where one category is assigned 0 and the other 1. For instance, if Male is our reference category, the dummy variable would be 1 for Female and 0 for Male. This inclusion of categorical information enhances the model’s flexibility, capturing the nuanced impact of different categories on the dependent variable. As we explore Multiple Linear Regression further, understanding the role of dummy variables becomes paramount for robust and accurate analyses.

Assumption of Multiple Linear Regression

Before diving into Multiple Linear Regression analysis, it’s crucial to be aware of the underlying assumptions that bolster the reliability of the results.

  • Linearity : Assumes a linear relationship between the dependent variable and all independent variables. The model assumes that changes in the dependent variable are proportional to changes in the independent variables.
  • Independence of Residuals : Assumes that the residuals (the differences between observed and predicted values) are independent of each other. The independence assumption is crucial to avoid issues of autocorrelation and ensure the reliability of the model.
  • Homoscedasticity : Assumes that the variability of the residuals remains constant across all levels of the independent variables. Homoscedasticity ensures that the spread of residuals is consistent, indicating that the model’s predictions are equally accurate across the range of predictor values.
  • Normality of Residuals : Assumes that the residuals follow a normal distribution. Normality is essential for making valid statistical inferences and hypothesis testing. Deviations from normality may impact the accuracy of confidence intervals and p-values.
  • No Perfect Multicollinearity : Assumes that there is no perfect linear relationship among the independent variables. Perfect multicollinearity can lead to unstable estimates of regression coefficients, making it challenging to discern the individual impact of each predictor.

These assumptions collectively form the foundation of Multiple Linear Regression analysis . Ensuring that these conditions are met enhances the validity and reliability of the statistical inferences drawn from the model. In the subsequent sections, we will delve into hypothesis testing in Multiple Linear Regression, provide practical examples, and guide you through the step-by-step process of performing and interpreting Multiple Linear Regression analyses using SPSS.

Hypothesis of Multiple Linear Regression

The hypothesis in Multiple Linear Regression revolves around the significance of the regression coefficients. Each coefficient corresponds to a specific predictor variable, and the hypothesis tests whether each predictor has a significant impact on the dependent variable.

  • Null Hypothesis (H0): The regression coefficients for all independent variables are simultaneously equal to zero.
  • Alternative Hypothesis (H1): At least one regression coefficient for an independent variable is not equal to zero.

The hypothesis testing in Multiple Linear Regression revolves around assessing whether the collective set of independent variables has a statistically significant impact on the dependent variable. The null hypothesis suggests no overall effect, while the alternative hypothesis asserts the presence of at least one significant relationship. This testing framework guides the evaluation of the model’s overall significance, providing valuable insights into the joint contribution of the predictor variables.

Example of Simple Multiple Regression

To illustrate the concepts of Multiple Linear Regression, let’s consider an example. Imagine you are studying the factors influencing house prices, with predictors such as square footage, number of bedrooms, and distance to the city centre. By applying Multiple Linear Regression, you can model how these factors collectively influence house prices.

The regression equation would look like:

Price = b0 + b1(square footage) + b2(number of bedrooms) + b3(distance to city centre).

Through this example, you’ll gain practical insights into how Multiple Linear Regression can untangle complex relationships and offer a comprehensive understanding of the factors affecting the dependent variable.

How to Perform Multiple Linear Regression using SPSS Statistics

multiple linear regression and hypothesis testing

Step by Step: Running Regression Analysis in SPSS Statistics

Now, let’s delve into the step-by-step process of conducting the Multiple Linear Regression using SPSS Statistics .  Here’s a step-by-step guide on how to perform a Multiple Linear Regression in SPSS :

  • STEP: Load Data into SPSS

Commence by launching SPSS and loading your dataset, which should encompass the variables of interest – a categorical independent variable. If your data is not already in SPSS format, you can import it by navigating to File > Open > Data and selecting your data file.

  • STEP: Access the Analyze Menu

In the top menu, locate and click on “ Analyze .” Within the “Analyze” menu, navigate to “ Regression ” and choose ” Linear ” Analyze > Regression> Linear

  • STEP: Choose Variables

A dialogue box will appear. Move the dependent variable (the one you want to predict) to the “ Dependen t” box and the independent variables to the “ Independent ” box.

  • STEP: Generate SPSS Output

Once you have specified your variables and chosen options, click the “ OK ” button to perform the analysis. SPSS will generate a comprehensive output, including the requested frequency table and chart for your dataset.

Executing these steps initiates the Multiple Linear Regression in SPSS, allowing researchers to assess the impact of the teaching method on students’ test scores while considering the repeated measures. In the next section, we will delve into the interpretation of SPSS output for Multiple Linear Regression .

Conducting a Multiple Linear Regression in SPSS provides a robust foundation for understanding the key features of your data. Always ensure that you consult the documentation corresponding to your SPSS version, as steps might slightly differ based on the software version in use. This guide is tailored for SPSS version 25 , and for any variations, it’s recommended to refer to the software’s documentation for accurate and updated instructions.

SPSS Output for Multiple Regression Analysis

multiple linear regression and hypothesis testing

How to Interpret SPSS Output of Multiple Regression

Deciphering the SPSS output of Multiple Linear Regression is a crucial skill for extracting meaningful insights. Let’s focus on three tables in SPSS output;

Model Summary Table

  • R (Correlation Coefficient): This value ranges from -1 to 1 and indicates the strength and direction of the linear relationship. A positive value signifies a positive correlation, while a negative value indicates a negative correlation.
  • R-Square (Coefficient of Determination) : Represents the proportion of variance in the dependent variable explained by the independent variable. Higher values indicate a better fit of the model.
  • Adjusted R Square : Adjusts the R-squared value for the number of predictors in the model, providing a more accurate measure of goodness of fit.

ANOVA Table

  • F (ANOVA Statistic): Indicates whether the overall regression model is statistically significant. A significant F-value suggests that the model is better than a model with no predictors.
  • df (Degrees of Freedom): Represents the degrees of freedom associated with the F-test.
  • P values : The probability of obtaining the observed F-statistic by random chance. A low p-value (typically < 0.05) indicates the model’s significance.

Coefficient Table

  • Unstandardized Coefficients (B): Provides the individual regression coefficients for each predictor variable.
  • Standardized Coefficients (Beta): Standardizes the coefficients, allowing for a comparison of the relative importance of each predictor.
  • t-values : Indicate how many standard errors the coefficients are from zero. Higher absolute t-values suggest greater significance.
  • P values : Test the null hypothesis that the corresponding coefficient is equal to zero. A low p-value suggests that the predictors are significantly related to the dependent variable.

Understanding these tables in the SPSS output is crucial for drawing meaningful conclusions about the strength, significance, and direction of the relationship between variables in a Simple Linear Regression analysis.

  How to Report Results of Multiple Linear Regression in APA

Effectively communicating the results of Multiple Linear Regression in compliance with the American Psychological Association (APA) guidelines is crucial for scholarly and professional writing

  • Introduction : Begin the report with a concise introduction summarizing the purpose of the analysis and the relationship being investigated between the variables.
  • Assumption Checks: If relevant, briefly mention the checks for assumptions such as linearity, independence, homoscedasticity, and normality of residuals to ensure the robustness of the analysis.
  • Significance of the Model : Comment on the overall significance of the model based on the ANOVA table. For example, “The overall regression model was statistically significant (F = [value], p = [value]), suggesting that the predictors collectively contributed to the prediction of the dependent variable.”
  • Regression Equation : Present the Multiple Regression equation, highlighting the intercept and regression coefficients for each predictor variable.
  • Interpretation of Coefficients : Interpret the coefficients, focusing on the slope (b1..bn) to explain the strength and direction of the relationship. Discuss how a one-unit change in the independent variable corresponds to a change in the dependent variable.
  • R-squared Value: Include the R-squared value to highlight the proportion of variance in the dependent variable explained by the independent variables. For instance, “The R-squared value of [value] indicates that [percentage]% of the variability in [dependent variable] can be explained by the linear relationship with [independent variables].”
  • Conclusion : Conclude the report by summarizing the key findings and their implications. Discuss any practical significance of the results in the context of your study.

multiple linear regression and hypothesis testing

Get Help For Your SPSS Analysis

Embark on a seamless research journey with SPSSAnalysis.com , where our dedicated team provides expert data analysis assistance for students, academicians, and individuals. We ensure your research is elevated with precision. Explore our pages;

  • SPSS Data Analysis Help – SPSS Helper ,
  • Quantitative Analysis Help ,
  • Qualitative Analysis Help ,
  • SPSS Dissertation Analysis Help ,
  • Dissertation Statistics Help ,
  • Statistical Analysis Help ,
  • Medical Data Analysis Help .

Connect with us at SPSSAnalysis.com to empower your research endeavors and achieve impactful results. Get a Free Quote Today !

Expert SPSS data analysis assistance available.

Struggling with Statistical Analysis in SPSS? - Hire a SPSS Helper Now!

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Model Parallelism vs Data Parallelism: Examples - April 11, 2024
  • Model Complexity & Overfitting in Machine Learning: How to Reduce - April 10, 2024
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade - April 9, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Model Parallelism vs Data Parallelism: Examples
  • Model Complexity & Overfitting in Machine Learning: How to Reduce
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade
  • Self-Prediction vs Contrastive Learning: Examples
  • Free IBM Data Sciences Courses on Coursera

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Do you get more food when you order in-person at Chipotle?

April 7, 2024

Inspired by this Reddit post , we will conduct a hypothesis test to determine if there is a difference in the weight of Chipotle orders between in-person and online orders. The data was originally collected by Zackary Smigel , and a cleaned copy be found in data/chipotle.csv .

Throughout the application exercise we will use the infer package which is part of tidymodels to conduct our permutation tests.

Variable type: character

Variable type: Date

Variable type: numeric

The variable we will use in this analysis is weight which records the total weight of the meal in grams.

We wish to test the claim that the difference in weight between in-person and online orders must be due to something other than chance.

multiple linear regression and hypothesis testing

  • Your turn: Write out the correct null and alternative hypothesis in terms of the difference in means between in-person and online orders. Do this in both words and in proper notation.

Null hypothesis: TODO

\[H_0: \mu_{\text{online}} - \mu_{\text{in-person}} = TODO\]

Alternative hypothesis: The difference in means between in-person and online Chipotle orders is not \(0\) .

\[H_A: \mu_{\text{online}} - \mu_{\text{in-person}} TODO\]

Observed data

Our goal is to use the collected data and calculate the probability of a sample statistic at least as extreme as the one observed in our data if in fact the null hypothesis is true.

  • Demo: Calculate and report the sample statistic below using proper notation.

The null distribution

Let’s use permutation-based methods to conduct the hypothesis test specified above.

We’ll start by generating the null distribution.

  • Demo: Generate the null distribution.
  • Your turn: Take a look at null_dist . What does each element in this distribution represent?

Add response here.

Question: Before you visualize the distribution of null_dist – at what value would you expect this distribution to be centered? Why?

Demo: Create an appropriate visualization for your null distribution. Does the center of the distribution match what you guessed in the previous question?

multiple linear regression and hypothesis testing

  • Demo: Now, add a vertical red line on your null distribution that represents your sample statistic.

multiple linear regression and hypothesis testing

Question: Based on the position of this line, does your observed sample difference in means appear to be an unusual observation under the assumption of the null hypothesis?

Above, we eyeballed how likely/unlikely our observed mean is. Now, let’s actually quantify it using a p-value.

Question: What is a p-value?

Guesstimate the p-value

  • Demo: Visualize the p-value.

multiple linear regression and hypothesis testing

Your turn: What is you guesstimate of the p-value?

Calculate the p-value

multiple linear regression and hypothesis testing

Your turn: What is the conclusion of the hypothesis test based on the p-value you calculated? Make sure to frame it in context of the data and the research question. Use a significance level of 5% to make your conclusion.

Demo: Interpret the p-value in context of the data and the research question.

Reframe as a linear regression model

While we originally evaluated the null/alternative hypotheses as a difference in means, we could also frame this as a regression problem where the outcome of interest (weight of the order) is a continuous variable. Framing it this way allows us to include additional explanatory variables in our model which may account for some of the variation in weight.

Single explanatory variable

Demo: Let’s reevaluate the original hypotheses using a linear regression model. Notice the similarities and differences in the code compared to a difference in means, and that the obtained p-value should be nearly identical to the results from the difference in means test.

multiple linear regression and hypothesis testing

Multiple explanatory variables

Demo: Now let’s also account for additional variables that likely influence the weight of the order.

  • Protein type ( meat )
  • Type of meal ( meal_type ) - burrito or bowl
  • Store ( store ) - at which Chipotle location the order was placed

multiple linear regression and hypothesis testing

Your turn: Interpret the p-value for the order in context of the data and the research question.

Compare to CLT-based method

Demo: Let’s compare the p-value obtained from the permutation test to the p-value obtained from that derived using the Central Limit Theorem (CLT).

Your turn: What is the p-value obtained from the CLT-based method? How does it compare to the p-value obtained from the permutation test?

IMAGES

  1. PPT

    multiple linear regression and hypothesis testing

  2. Multiple Linear Regression Hypothesis Testing in Matrix Form

    multiple linear regression and hypothesis testing

  3. Hypothesis Test in Multiple Linear Regression, Part 1

    multiple linear regression and hypothesis testing

  4. Mod-01 Lec-39 Hypothesis Testing in Linear Regression

    multiple linear regression and hypothesis testing

  5. PPT

    multiple linear regression and hypothesis testing

  6. PPT

    multiple linear regression and hypothesis testing

VIDEO

  1. Hypothesis Testing in Simple Linear Regression

  2. Lecture 5. Hypothesis Testing In Simple Linear Regression Model

  3. Case Study: Using SigmaXL and DMAIC to Improve Customer Satisfaction

  4. SPSS 12- Linear Regression [Urdu/Hindi]

  5. Simple linear regression hypothesis testing

  6. اختبارات الفروض : تحليل الانحدار المتعدد Hypothesis tests: multiple regression analysis

COMMENTS

  1. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  2. Hypothesis Tests and Confidence Intervals in Multiple Regression

    Confidence Intervals for a Single Coefficient. The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. The t-statistic has n - k - 1 degrees of freedom where k = number of independents. Supposing that an interval contains the true value of ...

  3. PDF Hypothesis Testing in the Multiple regression model

    Hypothesis Testing in the Multiple regression model. • Testing that individual coefficients take a specific value such as zero or some other value is done in exactly the same way as with the simple two variable regression model. • Now suppose we wish to test that a number of coefficients or combinations of coefficients take some particular ...

  4. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  5. 5.3

    A population model for a multiple linear regression model that relates a y -variable to p -1 x -variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...

  6. Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation

    a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...

  7. Multiple Linear Regression. A complete study

    Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...

  8. Multiple linear regression

    Linear regression has an additive assumption: $ sales = β 0 + β 1 × tv + β 2 × radio + ε $. i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of 100 β 2 USD in sales on average, regardless of how much you spend on radio ads. We saw that in Fig 3.5 above.

  9. Hypothesis Tests in Multiple Linear Regression, Part 1

    Organized by textbook: https://learncheme.com/ See Part 2: https://www.youtube.com/watch?v=ziGbG0dRlsAMade by faculty at the University of Colorado Boulder, ...

  10. Lesson 5: Multiple Linear Regression

    Minitab Help 5: Multiple Linear Regression; R Help 5: Multiple Linear Regression; Lesson 6: MLR Model Evaluation. 6.1 - Three Types of Hypotheses; 6.2 - The General Linear F-Test; 6.3 - Sequential (or Extra) Sums of Squares; 6.4 - The Hypothesis Tests for the Slopes; 6.5 - Partial R-squared; 6.6 - Lack of Fit Testing in the Multiple Regression ...

  11. Multiple linear regression -- Advanced Statistics using R

    Hypothesis testing of regression coefficient(s) With the estimates of regression coefficients and their standard errors estimates, we can conduct hypothesis testing for one, a subset, or all regression coefficients. ... about 40% of the variation in college GPA can be explained by the multiple linear regression with h.GPA, SAT, and recommd as ...

  12. Multiple Regression Analysis: Hypothesis Tests

    This video is an introduction to multiple regression analysis, with a focus on conducting a hypothesis test. If I look tired in the video, it's because I've ...

  13. PDF Lecture 5: Multiple Linear Regression

    the hypothesis gathered by random sampling of the data. 1. State the hypotheses, typically a null hypothesis, P &and an alternative hypothesis, P ', that is the negation of the former. 2. Choose a type of analysis, i.e. how to use sample data to evaluate the null hypothesis. Typically this involves choosing a single test statistic. 3.Compute ...

  14. 12.2.1: Hypothesis Test for Linear Regression

    The formula for the t-test statistic is t = b1 (MSE SSxx)√. Use the t-distribution with degrees of freedom equal to n − p − 1. The t-test for slope has the same hypotheses as the F-test: Use a t-test to see if there is a significant relationship between hours studied and grade on the exam, use α = 0.05.

  15. PDF 13 Multiple Linear( Regression(

    Multiple Linear 13 Regression. Chapter 12. Definition. The multiple regression model equation. Y = b 0 + b 1x1 + b 2x2 + ... +. where E(ε) = 0 and Var(ε) = s 2. b pxp + ε. is. Again, it is assumed that ε is normally distributed.

  16. Multiple Linear Regression

    Multiple linear regression analysis is the creation of an equation with multiple independent X variables that all influence a Y response variable. This equation is based upon an existing data set and models the conditions represented in the data. ... 01:09 the hypothesis tests are still the same as a simple linear regression. 01:13 The null ...

  17. How to Simplify Hypothesis Testing for Linear Regression in Python

    Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool: To predict future values for the y variable; To infer if the trend is statistically significant; This is ...

  18. Hypothesis testing in the multiple regression model

    4.1 Hypothesis testing: an overview 1 4.1.1 Formulation of the null hypothesis and the alternative hypothesis 2 4.1.2 Test statistic 2 4.1.3 Decision rule 3 4.2 Testing hypotheses using the t test 5 4.2.1 Test of a single parameter 5 4.2.2 Confidence intervals 16 4.2.3 Testing hypothesis about a single linear combination of the parameters 17 4.2.4 Economic importance versus statistical ...

  19. Multiple Regression Analysis using SPSS Statistics

    Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value ...

  20. 6.4

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0.

  21. Writing hypothesis for linear multiple regression models

    2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...

  22. Inference for Regression

    Hypothesis testing for linear regression. Estimation and prediction inference for linear regression. Please make sure to fill out the Stat 100 Course Evaluations. ... One route: Multiple Linear Regression! Multiple Linear Regression. Linear regression is a flexible class of models that allow for:

  23. Multiple Linear Regression in SPSS

    The hypothesis testing in Multiple Linear Regression revolves around assessing whether the collective set of independent variables has a statistically significant impact on the dependent variable. The null hypothesis suggests no overall effect, while the alternative hypothesis asserts the presence of at least one significant relationship.

  24. Linear regression hypothesis testing: Concepts, Examples

    This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0. Determine the test statistics: The next step is to determine the test statistics and calculate the value.

  25. Do you get more food when you order in-person at Chipotle?

    Demo: Let's reevaluate the original hypotheses using a linear regression model. Notice the similarities and differences in the code compared to a difference in means, and that the obtained p-value should be nearly identical to the results from the difference in means test.