Linear Regression Problems with Solutions

Linear regression and modelling problems are presented along with their solutions at the bottom of the page. Also a linear regression calculator and grapher may be used to check answers and create more opportunities for practice.

Solutions to the Above Problems

More references and links.

  • Linear Regression Calculator and Grapher .
  • Linear Least Squares Fitting.
  • elementary statistics and probabilities .

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Statistics and probability

Course: statistics and probability   >   unit 5, fitting a line to data.

  • Estimating the line of best fit exercise
  • Eyeballing the line of best fit
  • Estimating with linear regression (linear models)
  • Estimating equations of lines of best fit, and using them to make predictions
  • Line of best fit: smoking in 1945
  • Estimating slope of line of best fit
  • Equations of trend lines: Phone data

Linear regression review

What is linear regression.

  • (Choice A)   A ‍   A A ‍  
  • (Choice B)   B ‍   B B ‍  
  • (Choice C)   C ‍   C C ‍  
  • (Choice D)   None of the lines fit the data. D None of the lines fit the data.

Using equations for lines of fit

Example: finding the equation.

  • (Choice A)   y = 5 x + 1.5 ‍   A y = 5 x + 1.5 ‍  
  • (Choice B)   y = 1.5 x + 5 ‍   B y = 1.5 x + 5 ‍  
  • (Choice C)   y = − 1.5 x + 5 ‍   C y = − 1.5 x + 5 ‍  
  • Your answer should be
  • an integer, like 6 ‍  
  • a simplified proper fraction, like 3 / 5 ‍  
  • a simplified improper fraction, like 7 / 4 ‍  
  • a mixed number, like 1   3 / 4 ‍  
  • an exact decimal, like 0.75 ‍  
  • a multiple of pi, like 12   pi ‍   or 2 / 3   pi ‍  

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

BUS204: Business Statistics

linear regression homework problems

Linear Regression and Correlation Homework

Solve these problems, then check your answers against the given solutions.

Exercise 12.1

For each situation below, state the independent variable and the dependent variable.

  • A study is done to determine if elderly drivers are involved in more motor vehicle fatalities than all other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers.
  • A study is done to determine if the weekly grocery bill changes based on the number of family members.
  • Insurance companies base life insurance premiums partially on the age of the applicant.
  • Utility bills vary according to power consumption.
  • A study is done to determine if a higher education reduces the crime rate in a population.

Exercise 12.3

  • Using "year " as the independent variable and "welfare family size" as the dependent variable, make a scatter plot of the data.
  • Find the correlation coefficient. Is it significant?
  • Pick two years between 1969 and 1991 and find the estimated welfare family sizes.
  • Use the two points in (d) to plot the least squares line on your graph from (b).
  • Based on the above data, is there a linear relationship between the year and the average number of people in a welfare family?
  • Using the least squares line, estimate the welfare family sizes for 1960 and 1995. Does the least squares line give an accurate estimate for those years? Explain why or why not.
  • Are there any outliers in the above data?
  • What is the estimated average welfare family size for 1986? Does the least squares line give an accurate estimate for that year? Explain why or why not.
  • What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.5

  • Using "stories" as the independent variable and "height" as the dependent variable, make a scatter plot of the data.
  • Does it appear from inspection that there is a relationship between the variables?
  • Find the estimated heights for 32 stories and for 94 stories.
  • Use the two points in (e) to plot the least squares line on your graph from (b).
  • Based on the above data, is there a linear relationship between the number of stories in tall buildings and the height of the buildings?
  • Are there any outliers in the above data? If so, which point(s)?
  • What is the estimated height of a building with 6 stories? Does the least squares line give an accurate estimate of height? Explain why or why not.
  • Based on the least squares line, adding an extra story adds about how many feet to a building?

Exercise 12.7

  •  Using "year" as the independent variable and "percent" as the dependent variable, make a scatter plot of the data.
  • Does it appear from inspection that there is a relationship between the variables? Why or why not?
  • Find the estimated percents for 1991 and 1988.
  • Based on the above data, is there a linear relationship between the year and the percent of female wage and salary earners who are paid hourly rates?
  • Are there any outliers in the above data? What is the estimated percent for the year 2050? Does the least squares line give an accurate estimate for that year? Explain why or why not?

Exercise 12.9

  • Using "size" as the independent variable and "cost" as the dependent variable, make a scatter plot.
  • If the laundry detergent were sold in a 40 ounce size, find the estimated cost.
  • If the laundry detergent were sold in a 90 ounce size, find the estimated cost.
  • Use the two points in (e) and (f) to plot the least squares line on your graph from (a).
  • Does it appear that a line is the best way to fit the data? Why or why not?
  • Is the least squares line valid for predicting what a 300 ounce size of the laundry detergent would cost? Why or why not?

Exercise 12.11

  • Decide which variable should be the independent variable and which should be the dependent variable.
  • Make a scatter plot of the data.
  • Find the estimated total cost for a net taxable estate of $1,000,000. Find the cost for $2,500,000.
  • Use the two points in (f) to plot the least squares line on your graph from (b).
  • Based on the above, what would be the probate fees and taxes for an estate that does not have any assets?

Exercise 12.13

  • Find the estimated average height for a one year-old. Find the estimated average height for an eleven year-old.
  • Use the least squares line to estimate the average height for a sixty-two year-old man. Do you think that your answer is reasonable? Why or why not?

Exercise 12.15

  • Find the correlation coefficient. What does it imply about the significance of the relationship?
  • Find the estimated number of letters (to the nearest integer) a state would have if it entered the Union in 1900. Find the estimated number of letters a state would have if it entered the Union in 1940.
  • Use the least squares line to estimate the number of letters a new state that enters the Union this year would have. Can the least squares line be used to predict it? Why or why not?

Exercise 12.21

Try these multiple choice question

  • Strong positive correlation
  • Weak negative correlation
  • Strong negative correlation
  • No Correlation

Exercise 12.26

  • Graph a scatterplot of the data.
  • Find the correlation coefficient and determine if it is significant.
  • Find the equation of the best fit line.
  • Write the sentence that interprets the meaning of the slope of the line in the context of the data.
  • What percent of the variation in fuel efficiency is explained by the variation in the weight of the vehicles, using the regression line? (State your answer in a complete sentence in the context of the data.)
  • Accurately graph the best fit line on your scatterplot.
  • For the vehicle that weights 3000 pounds, find the residual (y-yhat). Does the value predicted by the line underestimate or overestimate the observed data value?
  • Identify any outliers, using either the graphical or numerical procedure demonstrated in the textbook.
  • The outlier is a hybrid car that runs on gasoline and electric technology, but all other vehicles in the sample have engines that use gasoline only. Explain why it would be appropriate to remove the outlier from the data in this situation. Remove the outlier from the sample data. Find the new correlation coefficient, coefficient of determination, and best fit line.
  • Compare the correlation coefficients and coefficients of determination before and after removing the outlier, and explain in complete sentences what these numbers indicate about how the model has changed.

Creative Commons License

Homework 10: Correlation and simple linear regression

Objectives :

  • Explore null and alternate hypothesis concepts for coefficient estimates, linear association, and linear models.
  • Evaluate error rates (Type I, Type II) and critical value, p-value.
  • To compute how to obtain and interpret Product moment correlation and other correlations in R Commander
  • To compute how to obtain and interpret  regression statistics and diagnostic plots in R Commander
  • To apply General Linear Model approach to regression models

Homework 10 expectations

Read through the entire homework before starting to answer a question. You are expected to have read the chapter and to have completed preceding homework. Answers are  provided to odd numbered problems — turn in your work for even numbered problems.

How to work this homework

This homework is in two parts and involves two separate data sets. First, Homework 9 asks you to practice R work on correlation. The data set is Animals, brain weight and body weight measured in grams for 28 mammals. Second, work with the data set cars, stopping distance by speed of the car, develop simple linear regression (SLR) models with diagnostic graphs.

Your report will consist of your answers to the bold, numbered questions and supporting statistics from R. Suggested steps in your analysis are provided as numbered items. You may work together or individually, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in:  Turn in two properly named pdf files:

The files contain your R code, statistical results, and your answer to the questions. Use of RMarkdown recommended; however copy/paste into a word document is also acceptable.

Submit your work to CANVAS . Obey proper file naming formats.

Resources for this homework

Chapter 16. Mike’s Biostatistics Book

Chapter 17. Mike’s Biostatistics Book

Mike’s Workbook for Biostatistics: A quick look at R and R Commander , Part01 – Part10 and previous homework pages presented in this workbook.

Additional R commands and or code provided below .

Answers to selected problems

Work on Correlation

Objective 1 : To learn how to obtain and interpret Product moment correlation and other correlations in R Commander

For correlation, use this dataset = Animals . This data set consists of brain weight and body mass of 28 species, including mammals (24 species plus humans) and three dinosaurs (Brachiosaurus, Dipliodocus, Triceratops). The ratio of brain weight to body mass is sometimes called the encephalization index; humans have a high ratio, so this measure is sometimes taken as a crude estimate of “intelligence.” The data set is one of the built-in datasets with R, go to Rcmdr: Data in packages → Read data set from attached package… Select MASS package, then find Animals ). Alternatively, and for your convenience, I’ve included the dataset at the end of this page (scroll down or click here ).

Note. Since the 1990s, comparative biologists have recognized that such species comparisons without accounting for phylogenetic structure of the data likely violates the assumption of independence among the data points. Many approaches to account for the nonindependence have been published: see Chapter 20.12 in Mike’s Biostatistics Book for an introduction to one method now called phylogenetically independent contrasts (PIC). For the purposes of this homework we ignore this issue.

Questions: The work to do for correlation analysis

1. Make a scatterplot of brain weight on body weight.

a) It would a good idea to highlight specific points in the graph.

Note. This can be done, but it is an advanced R trick. Here’s one method, a bit crude, but it works. Copy and paste the code into the script window of Rcmdr. Assuming you’ve already loaded the dataset Animals, then submit the following one line a time

2. Make histograms of body weight and brain weight

3. Conduct a test of normality of brain weight, and another test for body weight

Rcmdr: Statistics → Summaries → Test of normality (Compare Shapiro Wilks against Anderson-Darling)

4. Obtain the product moment correlation (parametric) and conduct the two-sided test of the null hypothesis

Rcmdr: Statistics → Summaries → Correlation test

Rcmdr menu correlation

4a. Find in the R output the command for correlation and make sure you include this in your report.

5. Repeat, but this time calculate the (nonparametric) Spearman rank-order correlation, again a two-sided test

5a. Find in the R output the command for Spearman rank-order correlation and make sure you include this in your report.

Question 1 . Briefly define and contrast parametric and nonparametric statistical tests. (Hint: assumptions!)

Question 2 . Make a preliminary conclusion about whether or not brain weight is correlated with body weight.

Question 3 . Using the results from items 2 – 5, which correlation estimate is most justifiable as a test of the association between brain and body weight, the parametric or the nonparametric correlation?

Question 4. Reviewing your plot and the correlation results, comment on the relationship between body mass and brain weight.

Save your Markdown file and include Corr as part of your properly named pdf file name. Submit your file to this page.

Proceed to the second part of the homework

Work on Ordinary Linear (Simple) Regression

Objective 2 : To learn how to obtain and interpret  regression statistics and diagnostic plots in R Commander

Objective 3 : Introduce the General Linear Model approach

For simple linear regression use this dataset = cars, Speed and Stopping Distances of Cars (one of the built-in datasets with R, go to Rcmdr: Data in packages → Read data set from attached package … Select datasets package, then find cars). Alternatively, and for your convenience, I’ve included the dataset at the end of this page (scroll down or click here ). The data set is a classic from the 1920s ; it’s about distance (feet) to stop a car by speed (mph) of a car. Thus, the cars in 1920s looked more like the one on left than the one at right.

The work to do on simple linear regression

1. Identify the dependent and independent variables. Justify your selection

2. Make a scatterplot of Distance on Speed

3. Make a histogram of Distance

4. Conduct a test of normality on Distance

Rcmdr: Statistics → Summaries → Test of normality (Compare Shapiro Wilks against Anderson-Darling)

Question 1. Make a preliminary conclusion about whether or not the data conforms to assumptions of linear regression based on your results from items 1 – 4. Include your justification for choice of independent and dependent variables. If you find the data do not conform, create new log-transformed variables and redo your assumption tests.

5. Based on your answer to Question 1, write out the model, e.g., in R, the model is specified as Y ~ X, then go ahead and conduct the linear regression of Distance and Speed.

Rcmdr: Statistics → Fit models → Linear model

  • an example screen is shown (yours will look different from the example). Note also that you can let Rcmdr assign a name for the model object or you can specify one yourself ( ⮜ DrD recommended! )

Rcmdr General Linear Model menu

6. Obtain the following from the R output ⮜ recommend you organize these into a table

a) value of the Y-intercept

b) Report on whether the Y-intercept is statistically significant (what is the null hypothesis?)

c) value of the slope

d) Report on whether the slope is statistically significant (what is the null hypothesis?)

e) Write out the statistical model

f) Find the fit statistic for this model.

Question 2 . Make a preliminary conclusion about the predictive value of your X (independent) on your Y (dependent) variable

7. Obtain appropriate diagnostic plots and diagnostic statistics for your regression and evaluate your model against the assumptions

Rcmdr : Models → Graphs → Basic diagnostic plots

Rcmdr : Models → Numerical diagnostics → “several to choose from” ⮜ part of the evaluation is whether you select the correct diagnostic tests; hint: most of these are listed in your Chapter 18 )

Question 3 . Make a conclusion about the regression model based on your results from Question 1 and Question 2.

Question 4 . Use your regression model to predict stopping distance at 60 mph. How does your prediction compare to stopping distance of 108 feet for a Ford Mustang at the same speed?

Hint: you can simply calculate this using the appropriate numbers from your regression model. Instead of “hand” calculations, alternatively, you could use the following command

where modelName is replaced with the object name for your regression mode, nameOfdependent is replaced with the name of your independent variable.

Regardless of method you choose, if you use log-transform, then don’t forget to report your predicted value in original raw form. For example, if log(x, 10), then 10^x, with x equal to your predicted value. This is a standard recommendation for data analysis — do your statistics on the transformed data, but for graphics or other reports, always back-transform the data.

Save your Markdown file and include SLR as part of your properly named pdf file name. Submit your file to this page.

This concludes work for Homework 10

R or Rcmdr commands

Test normality.

Rcmdr → Statistics → Summaries → Test for normality

Other R/Rcmdr commands provided in text

Dataset = Animals from R package MASS

Dataset = cars

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

7.10: Practice problems

  • Last updated
  • Save as PDF
  • Page ID 33289

  • Mark Greenwood
  • Montana State University

7.1. Treadmill data analysis We will continue with the treadmill data set introduced in Chapter 1 and the SLR fit in the practice problems in Chapter 6. The following code will get you back to where we stopped at the end of Chapter 6:

7.1.1. Use the output to test for a linear relationship between treadmill oxygen and run time, writing out all 6+ steps of the hypothesis test. Make sure to address scope of inference and interpret the p-value.

7.1.2. Form and interpret a 95% confidence interval for the slope coefficient “by hand” using the provided multiplier:

7.1.3. Use the confint function to find a similar confidence interval, checking your previous calculation.

7.1.4. Use the predict function to find fitted values, 95% confidence, and 95% prediction intervals for run times of 11 and 16 minutes.

7.1.5. Interpret the CI and PI for the 11 minute run time.

7.1.6. Compare the width of either set of CIs and PIs – why are they different? For the two different predictions, why are the intervals wider for 16 minutes than for 11 minutes?

7.1.7. The Residuals vs Fitted plot considered in Chapter 6 should have suggested slight non-constant variance and maybe a little missed nonlinearity. Perform a log-transformation of the treadmill oxygen response variable and re-fit the SLR model. Remake the diagnostic plots and discuss whether the transformation changed any of them.

7.1.8 Summarize the \(\log(y) \sim x\) model and interpret the slope coefficient on the transformed and original scales, regardless of the answer to the previous question.

  • We can also write this as \(E(y_i|x_i) = \mu\{y_i|x_i\} = \beta_0 + \beta_1x_i\) , which is the notation you will see in books like the Statistical Sleuth ( Ramsey and Schafer 2012 ) . We will use notation that is consistent with how we originally introduced the methods.↩︎
  • There is an area of statistical research on how to optimally choose \(x\) -values to get the most precise estimate of a slope coefficient. In observational studies we have to deal with whatever pattern of \(x\text{'s}\) we ended up with. If you can choose, generate an even spread of \(x\text{'s}\) over some range of interest similar to what was used in the Beers vs BAC study to provide the best distribution of values to discover the relationship across the selected range of \(x\) -values.↩︎
  • See http://fivethirtyeight.com/features/which-city-has-the-most-unpredictable-weather/ for an interesting discussion of weather variability where Great Falls, MT had a very high rating on “unpredictability”.↩︎
  • It is actually pretty amazing that there are hundreds of locations in the U.S. with nearly complete daily records for over 100 years.↩︎
  • All joking aside, if researchers can find evidence of climate change using conservative methods (methods that reject the null hypothesis when it is true less often than stated), then their results are even harder to ignore.↩︎
  • It took many permutations to get competitor plots this close to the real data set and they really aren’t that close.↩︎
  • If the removal is of a point that is extreme in \(x\) -values, then it is appropriate to note that the results only apply to the restricted range of \(x\) -values that were actually analyzed in the scope of inference discussion. Our results only ever apply to the range of \(x\) -values we had available so this is a relatively minor change.↩︎
  • Note exp(x) is the same as \(e^{(x)}\) but easier to read in-line and exp() is the R function name to execute this calculation.↩︎
  • You can read my dissertation if you want my take on modeling U and V-shaped valley elevation profiles that included some discussion of these models, some of which was also in M. C. Greenwood and Humphrey ( 2002 ) .↩︎
  • This transformation could not be applied directly to the education growth score data in Chapter 5 because there were negative “growth” scores.↩︎
  • This silly nomenclature was inspired by De Veaux, Velleman, and Bock ( 2011 ) Stats: Data and Models text. If you find this too cheesy, you can just call it x-vee.↩︎
  • The geom_ribbon has been used inside the geom_smooth function we have used before, but this is the first time we are drawing these intervals ourselves.↩︎
  • I have really enjoyed writing this book and enjoy updating it yearly, but hope someone else gets to do the work of checking the level of inaccuracy of this model in another 30 years.↩︎

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Simple Linear Regression | An Easy Introduction & Examples

Simple Linear Regression | An Easy Introduction & Examples

Published on February 19, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Simple linear regression is used to estimate the relationship between two quantitative variables . You can use simple linear regression when you want to know:

  • How strong the relationship is between two variables (e.g., the relationship between rainfall and soil erosion).
  • The value of the dependent variable at a certain value of the independent variable (e.g., the amount of soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

If you have more than one independent variable, use multiple linear regression instead.

Table of contents

Assumptions of simple linear regression, how to perform a simple linear regression, interpreting the results, presenting the results, can you predict values outside the range of your data, other interesting articles, frequently asked questions about simple linear regression.

Simple linear regression is a parametric test , meaning that it makes certain assumptions about the data. These assumptions are:

  • Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
  • Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among observations.
  • Normality : The data follows a normal distribution .

Linear regression makes one additional assumption:

  • The relationship between the independent and dependent variable is linear : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

If your data violate the assumption of independence of observations (e.g., if observations are repeated over time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the data.

Prevent plagiarism. Run a free check.

Simple linear regression formula.

The formula for a simple linear regression is:

y = {\beta_0} + {\beta_1{X}} + {\epsilon}

  • y is the predicted value of the dependent variable ( y ) for any given value of the independent variable ( x ).
  • B 0 is the intercept , the predicted value of y when the x is 0.
  • B 1 is the regression coefficient – how much we expect y to change as x increases.
  • x is the independent variable ( the variable we expect is influencing y ).
  • e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1 ) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand , this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using our income and happiness example.

Dataset for simple linear regression (.csv)

Load the income.data dataset into your R environment, and then run the following command to generate a linear model describing the relationship between income and happiness:

This code takes the data you have collected data = income.data and calculates the effect that the independent variable income has on the dependent variable happiness using the equation for the linear model: lm() .

To learn more, follow our full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function in R:

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept) . This is the y-intercept of the regression equation, with a value of 0.20. You can plug this into your regression equation if you want to predict happiness values across the range of income that you have observed:

The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of income on reported happiness:

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The number in the table (0.713) tells us that for every one unit increase in income (where one unit of income = 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much variation there is in our estimate of the relationship between income and happiness.

The t value  column displays the test statistic . Unless you specify otherwise, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |)  column shows the p value . This number tells us how likely we are to see the estimated effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low ( p < 0.001),  we can reject the null hypothesis and conclude that income has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p value of the model. Here it is significant ( p < 0.001), which means that this model is a good fit for the observed data.

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what your regression coefficient means:

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply plot the observations on the x and y axis and then include the regression line and regression function:

Simple linear regression graph

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

No! We often say that regression models can be used to predict the value of the dependent variable at certain values of the independent variable. However, this is only true for the range of values where we have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000, we found an r 2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000 and 150,000?

Extrapolating data in R

The r 2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in reported happiness for every 10,000 increase in income. While the relationship is still statistically significant (p<0.001), the slope is much smaller than before.

Extrapolating data in R graph

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to the 70–150k incomes?

You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the happiness of people in the 75–150k income range.

Curved data line

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern continues beyond the range of values you have actually measured. Therefore, it’s important to avoid extrapolating beyond what the data actually tell you.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Simple Linear Regression | An Easy Introduction & Examples. Scribbr. Retrieved April 8, 2024, from https://www.scribbr.com/statistics/simple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, an introduction to t tests | definitions, formula and examples, multiple linear regression | a quick guide (examples), linear regression in r | a step-by-step guide & examples, what is your plagiarism score.

Calcworkshop

Least Squares Regression Line w/ 19 Worked Examples!

// Last Updated: October 10, 2020 - Watch Video //

Did you know that the least squares regression line can be used to predict future values?

Jenn (B.S., M.Ed.) of Calcworkshop® teaching least squares regression line

Jenn, Founder Calcworkshop ® , 15+ Years Experience (Licensed & Certified Teacher)

Now that’s pretty amazing!

In fact, a least squares regression line (LSRL) helps us to measure the trend and relationship of collected data values and allows us to answer questions like…

  • What happens when we want to study two variables at one time?
  • What is their relationship?
  • Is there an association that exists?
  • What is the strength of the association, if any, and how can it be measured?
  • Is there a way to measure and express this relationship mathematically, and then use this equation to predict future values?

All of these questions, and more, can be expressed using regression as it is a “best fit” for the data!

And that’s why least squares regression is sometimes called the line of best fit .

Now, regression analysis on bivariate ( two-variable ) data, has several key aspects that all help us to explain association and predict relationships:

Scatterplots

Correlation, least-squares regression lines.

  • Residual Plots

Scatterplots are a way for us to visually display a relationship between two quantitative variables, typically written in the form (x,y), where x is the explanatory or independent variable , and y is the response or dependent variable .

Additionally, scatterplots help us to identify outliers and influential points.

The correlation coefficient best measures the strength of this relationship.

correlation coefficient formula

Correlation Coefficient Formula

As the graphic to the right indicates, a strong relationship is closer to +1 or -1 and a weaker relationship is closer to zero.

correlation coefficient interpretation

Correlation Coefficient Interpretation

And if a straight line relationship is observed, we can describe this association with a regression line , also called a least-squares regression line or best-fit line . This trend line, or line of best-fit, minimizes the predication of error, called residuals as discussed by Shafer and Zhang . And the regression equation provides a rule for predicting or estimating the response variable’s values when the two variables are linearly related.

least squares regression line equation

Least Squares Regression Line Equation

We will observe that there are two different methods for calculating the LSRL, depending on whether we are given raw data or summary statistics . But what is important to note about the formulas shown below, is that we will always find our slope (b) first, and then we will find the y-intercept (a) second.

slope formulas for lsrl

Slope Formulas for LSRL (Summary Statistics vs Raw Data)

y intercept formulas for lsrl

Y Intercept Formulas For LSRL (Raw Data vs Summary Statistics)

Now the residuals are the differences between the observed and predicted values . It measures the distance from the regression line (predicted value) and the actual observed value. In other words, it helps us to measure error, or how well our regression line “fits” our data. Moreover, we can then visually display our findings and look for variations on a residual plot.

residual plot interpretation

Residual Plot Interpretation

residual scatter plot

Residual Scatter Plot

Likewise, we can also calculate the coefficient of determination , also referred to as the R-Squared value , which measures the percent of variation that can be explained by the regression line.

But there is always a word of caution: correlation doesn’t necessarily imply causation . Just because there is a strong relationship, we must be careful not to conclude a cause and effect relationship between two variables or use our noticed association to extrapolate beyond the data.

Throughout our study, we will see that the least-squares regression equation is the line that best fits the sample data where the sum of the square of the residuals is minimized and fits the mean of the y-coordinates for each x-coordinate. Generally speaking, this line is the best estimate of the line of averages.

Worked Example

Together we use raw data as well as summary statistics to create scatterplots, regression analysis, find the LSRL, correlation coefficients, and determine if the analysis is a “good fit” by calculating the coefficient of determination, as the example below illustrates.

First we will create a scatterplot to determine if there is a linear relationship. Next, we will use our formulas as seen above to calculate the slope and y-intercept from the raw data; thus creating our least squares regression line. Then we will calculate our correlation coefficient to measure the strength of the relationship between the bivariate data and lastly we will determine the residuals, or error, from our predicted value to our observed value and construct a residual plot.

finding lsrl for dataset

Finding LSRL for Dataset

Analyzing bivariate data has never been more fun!

Least Squares Regression Line – Lesson & Examples (Video)

2 hr 22 min

  • Introduction to Video: Least-Squares Regression
  • 00:00:38 – Identify Explanatory and Response Variables and How to determine the Correlation Coefficient (Example #1)
  • Exclusive Content for Members Only
  • 00:15:28 – Find the correlation coefficient using both formula methods (Example #2)
  • 00:26:33 – Find the correlation coefficient and create a scatterplot (Example #3)
  • 00:32:23 – Would you expect a positive, negative or no association for the pairs of variables (Example #4)
  • 00:38:13 – Consider the scatterplot and determine the linear association (Example #5)
  • 00:39:59 – How to find the Least Squares Regression Line using raw data or summary statistics
  • 00:50:28 – Find the LSRL (Examples #6-7)
  • 01:01:13 – What are residuals, outliers and influential points? With Example #8
  • 01:14:51 – Use the data to create a scatterplot and find the correlation coefficient, LSRL, residuals and residual plot (Example #9)
  • 01:30:16 – Find the regression line and use it to predict a value (Examples #10-11)
  • 01:36:59 – Using technology find the regression line, correlation coefficient, coefficient of determination and use the LSRL to predict a future value (Example #12-13)
  • 01:53:21 – Using the regression line interpret the slope and r-squared value and find the residual (Example #14)
  • 01:58:13 – Using output data determine the regression line (Example #15)
  • 02:00:38 – Determine if the observation in a regression outlier and has influence on the regression analysis (Example #16)
  • 02:06:06 – Explain what is wrong with the way regression is used in each scenario (Example #17)
  • 02:12:40 – Construct a scatterplot and compute the regression line and determine correlation and coefficient of determination (Example #18)
  • 02:18:29 – Find the regression line and use it to predict future values (Example #19)
  • Practice Problems with Step-by-Step Solutions
  • Chapter Tests with Video Solutions

Get access to all the courses and over 450 HD videos with your subscription

Monthly and Yearly Plans Available

Get My Subscription Now

Still wondering if CalcWorkshop is right for you? Take a Tour and find out how a membership can take the struggle out of learning math.

5 Star Excellence award from Shopper Approved for collecting at least 100 5 star reviews

Simple Linear Regression Examples

Many of simple  linear regression examples (problems and solutions) from the real life can be given to help you understand the core meaning.

On this page:

  • Simple linear regression examples: problems with solutions .
  • Infographic in PDF

In our previous post linear regression models , we explained in details what is simple and multiple linear regression. Here, we concentrate on the examples of linear regression from the real life.

Simple Linear Regression Examples, Problems, and Solutions

Simple linear regression allows us to study the correlation between only two variables:

  • One variable (X) is called independent variable or predictor.
  • The other variable (Y), is known as dependent variable or outcome.

and the simple linear regression equation is:

Y = Β 0  + Β 1 X

X  – the value of the independent variable, Y  – the value of the dependent variable. Β 0  – is a constant (shows the value of Y when the value of X=0) Β 1  – the regression coefficient (shows how much Y changes for each unit change in X)

You have to study the relationship between the monthly e-commerce sales and the online advertising costs. You have the survey results for 7 online stores for the last year.

Your task is to find the equation of the straight line that fits the data best.

The following table represents the survey results from the 7 online stores.

We can see that there is a positive relationship between the monthly e-commerce sales (Y) and online advertising costs (X).

The positive correlation means that the values of the dependent variable (y) increase when the values of the independent variable (x) rise.

So, if we want to predict the monthly e-commerce sales from the online advertising costs, the higher the value of advertising costs, the higher our prediction of sales.

We will use the above data to build our Scatter diagram.

Now, let’ see how the Scatter diagram looks like:

The Scatter plot shows how much one variable affects another. In our example, above Scatter plot shows how much online advertising costs affect the monthly e-commerce sales. It shows their correlation.

Let’s see the simple linear regression equation.

Y = 125.8 + 171.5*X

Note : You can find easily the values for Β 0   and Β 1 with the help of paid or free statistical software, online linear regression calculators or Excel. All you need are the values for the independent (x) and dependent (y) variables (as those in the above table).

Now, we have to see our regression line:

Graph of the Regression Line:

Linear regression aims to find the best-fitting straight line through the points. The best-fitting line is known as the regression line.

If data points are closer when plotted to making a straight line, it means the correlation between the two variables is higher. In our example, the relationship is strong.

The orange diagonal line in diagram 2 is the regression line and shows the predicted score on e-commerce sales for each possible value of the online advertising costs.

Interpretation of the results:

The slope of 171.5 shows that each increase of one unit in X, we predict the average of Y to increase by an estimated 171.5 units.

The formula estimates that for each increase of 1 dollar in online advertising costs, the expected monthly e-commerce sales are predicted to increase by $171.5.

This was a simple linear regression example for a positive relationship in business. Let’s see an example of the negative relationship.

You have to examine the relationship between the age and price for used cars sold in the last year by a car dealership company.

Here is the table of the data:

Now, we see that we have a negative relationship between the car price (Y) and car age(X) – as car age increases, price decreases.

When we use the simple linear regression equation, we have the following results:

Y = 7836 – 502.4*X

Let’s use the data from the table and create our Scatter plot and linear regression line:

The above 3 diagrams are made with  Meta Chart .

Result Interpretation:

With an estimated slope of – 502.4, we can conclude that the average car price decreases $502.2 for each year a car increases in age.

If you need more examples in the field of statistics and data analysis or more data visualization types , our posts “ descriptive statistics examples ” and “ binomial distribution examples ” might be useful to you.

Download the following infographic in PDF with the simple linear regression examples:

About The Author

linear regression homework problems

Silvia Valcheva

Silvia Valcheva is a digital marketer with over a decade of experience creating content for the tech industry. She has a strong passion for writing about emerging software and technologies such as big data, AI (Artificial Intelligence), IoT (Internet of Things), process automation, etc.

' src=

Hi. I really enjoy your article, seems to me that it can help to many students in order to improve their skills. Thanks,

linear regression homework problems

solved perfectly, great article

' src=

Thanks Silvia for your articles. They are quite informative and easy to understand.

Leave a Reply Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

6.03: Linear Regression

  • Last updated
  • Save as PDF
  • Page ID 126417

Lesson 1: Introduction to Linear Regression

Learning objectives.

After successful completion of this lesson, you should be able to:

1) Define a residual for a linear regression model,

2) Explain the concept of the least-squares method as an optimization approach,

3) Explain why other criteria of finding the regression model do not work.

Introduction

The problem statement for a regression model is as follows. Given \({n}\) data pairs \(\left( x_{1},y_{1} \right), \left( x_{2},y_{2} \right), \ldots, \left( x_{n},y_{n} \right)\) , best fit \(y = f\left( x \right)\) to the data (Figure \(\PageIndex{1.1}\)).

Fitting a curve through a series of data points, and measuring the shortest distance between each point and the curve.

Linear regression is the most popular regression model. In this model, we wish to predict response to \(n\) data points \(\left( x_{1},y_{1} \right),\left( x_{2},y_{2} \right),\ldots\ldots,\left( x_{n},y_{n} \right)\) by a regression model given by

\[y = a_{0} + a_{1}x\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{1.1}) \nonumber\]

where \(a_{0}\) and \(a_{1}\) are the constants of the regression model.

A measure of goodness of fit, that is, how well \(a_{0} + a_{1}x\) predicts the response variable \(y\), is the magnitude of the residual \(E_{i}\) at each of the \(n\) data points.

\[E_{i} = y_{i} - \left( a_{0} + a_{1}x_{i} \right)\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{1.2}) \nonumber\]

Ideally, if all the residuals \(E_{i}\) are zero, one has found an equation in which all the points lie on the model. Thus, minimization of the residuals is an objective of obtaining regression coefficients.

The most popular method to minimize the residual is the least-squares method, where the estimates of the constants of the models are chosen such that the sum of the squared residuals is minimized, that is, minimize

\[S_{r}=\sum_{i = 1}^{n}{E_{i}}^{2}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{1.3}) \nonumber\]

Why minimize the sum of the square of the residuals, \(S_{r}\) ?

Why not, for instance, minimize the sum of the residual errors or the sum of the absolute values of the residuals? Alternatively, constants of the model can be chosen such that the average residual is zero without making individual residuals small. Would any of these criteria yield unbiased parameters with the smallest variance? All of these questions will be answered. Look at the example data in Table \(\PageIndex{1.1}\).

To explain this data by a straight line regression model,

\[y = a_{0} + a_{1}x\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{1.4}) \nonumber\]

Let us use minimizing \(\displaystyle \sum_{i = 1}^{n}E_{i}\) as a criterion to find \(a_{0}\) and \(a_{1}\) . Assume randomly that

\[y = 4x - 4\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{1.5}) \nonumber\]

as the resulting regression model (Figure \(\PageIndex{1.2}\)).

Plot showing the given data points with the regression curve y=4x-4.

The sum of the residuals \(\displaystyle \sum_{i = 1}^{4}{E_{i}}^{} = 0\) is shown in Table \(\PageIndex{2.2}\).

So does this give us the smallest possible sum of residuals? For this data, it does as \(\displaystyle \sum_{i = 1}^{4}E_{i} = 0,\) and it cannot be made any smaller. But does it give unique values for the parameters of the regression model? No, because, for example, a straight-line model (Figure \(\PageIndex{1.3}\))

\[y = 6\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{1.6}) \nonumber\]

also gives \(\displaystyle \sum_{i = 1}^{4}E_{i} = 0\) as shown in Table \(\PageIndex{1.3}\).

In fact, there are many other straight lines for this data for which the sum of the residuals \(\displaystyle \sum_{i = 1}^{4}E_{i} = 0\) . We hence find the regression models are not unique, and therefore this criterion of minimizing the sum of the residuals is a bad one.

Plot showing the given data points with the regression curve y=6.

You may think that the reason the criterion of minimizing \(\displaystyle \sum_{i = 1}^{n}E_{i}\) does not work is because negative residuals cancel with positive residuals. So, is minimizing the sum of absolute values of the residuals, that is, \(\displaystyle \sum_{i = 1}^{n}\left| E_{i} \right|\) better? Let us look at the same example data given in Table \(\PageIndex{1.1}\). For the regression model \(y = 4x - 4\) , the sum of the absolute value of residuals \(\displaystyle \sum_{i = 1}^{4}\left| E_{i} \right| = 4\) is shown in Table \(\PageIndex{1.4}\).

The value of \(\displaystyle \sum_{i = 1}^{4}\left| E_{i} \right| = 4\) also exists for the straight-line model \(y = 6.\) (see Table \(\PageIndex{1.5}\)).

No other straight-line model that you may choose for this data has \(\displaystyle \sum_{i = 1}^{4}\left| E_{i} \right| < 4\) . And there are many other straight lines for which the sum of absolute values of the residuals \(\displaystyle \sum_{i = 1}^{4}\left| E_{i} \right| = 4\) . We hence find that the regression models are not unique, and hence the criterion of minimizing the sum of the absolute values of the residuals is also a bad one.

To get a unique regression model, the least-squares criterion where we minimize the sum of the square of the residuals

\[\begin{split} S_{r} &= \sum_{i = 1}^{n}{E_{i}}^{2}\\ &= \sum_{i = 1}^{n}(y_i-a_0- a_1x_i)^{2}\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{1.7}) \end{split}\]

is recommended. The formulas obtained for the regression constants \(a_0\) and \(a_1\) are given below and will be derived in the next lesson.

\[\displaystyle a_{0} = \frac{\displaystyle\sum_{i = 1}^{n}y_{i}\sum_{i = 1}^{n}x_{i}^{2} - \sum_{i = 1}^{n}x_{i}\sum_{i = 1}^{n}{x_{i}y_{i}}}{\displaystyle n\sum_{i = 1}^{n}x_{i}^{2} \ -\left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{1.8}) \nonumber\]

\[\displaystyle a_{1} = \frac{\displaystyle n\sum_{i = 1}^{n}{x_{i}y_{i}} - \sum_{i = 1}^{n}x_{i}\sum_{i = 1}^{n}y_{i}}{\displaystyle n\sum_{i = 1}^{n}x_{i}^{2}-\left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{1.9}) \nonumber\]

The formula for \(a_0\) can also be written as

\[\begin {split} \displaystyle a_{0} &= \frac{\displaystyle \sum_{i = 1}^{n}y_{i}}{n} -a_1\frac{\displaystyle \sum_{i = 1}^{n}x_{i}}{n} \\ &= \bar{y} - a_{1}\bar{x} \end{split}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{1.10}) \nonumber\]

Audiovisual Lecture

Title : Linear Regression - Background

Summary : This video is about learning the background of linear regression of how the minimization criterion is selected to find the constants of the model.

Lesson 2: Straight-Line Regression Model without an Intercept

1) derive constants of linear regression model without an intercept,

2) use the derived formula to find the constants of the nonlinear regression model from given data.

In this model, we wish to predict response to \(n\) data points \(\left( x_{1},y_{1} \right),\left( x_{2},y_{2} \right),\ldots\ldots,\left( x_{n},y_{n} \right)\) by a regression model given by

\[y = a_{1}x\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{2.1}) \nonumber\]

where \(a_{1}\) is the only constant of the regression model.

A measure of goodness of fit, that is, how well \(a_{1}x\) predicts the response variable \(y\) is the sum of the square of the residuals, \(S_{r}\)

\[\begin{split} S_{r} &= \sum_{i = 1}^{n}{E_{i}}^{2}\\ &= \sum_{i = 1}^{n}\left( y_{i} - a_{1}x_{i} \right)^{2}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{2.2}) \end{split}\]

To find \(a_{1},\) we look for the value of \(a_{1}\) for which \(S_{r}\) is the absolute minimum.

We will begin by conducting the first derivative test. Take the derivative of Equation \((\PageIndex{2.2})\)

\[\frac{dS_{r}}{da_{1}} = 2\sum_{i = 1}^{n}{\left( y_{i} - a_{1}x_{i} \right)\left( - x_{i} \right)} = 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{2.3}) \nonumber\]

Now putting

\[\frac{dS_{r}}{da_{1}} = 0 \nonumber\]

\[2\sum_{i = 1}^{n}{\left( y_{i} - a_{1}x_{i} \right)\left( - x_{i} \right)} = 0 \nonumber\]

\[- 2\sum_{i = 1}^{n}{y_{i}x_{i} + 2\sum_{i = 1}^{n}{a_{1}x_{i}^{2}}} = 0 \nonumber\]

\[- 2\sum_{i = 1}^{n}{y_{i}x_{i} + {2a}_{1}\sum_{i = 1}^{n}x_{i}^{2}} = 0 \nonumber\]

Solving the above equation for \(a_{1}\) gives

\[a_{1} = \frac{\displaystyle \sum_{i = 1}^{n}{y_{i}x_{i}}}{\displaystyle \sum_{i = 1}^{n}x_{i}^{2}}\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{2.4}) \nonumber\]

Let’s conduct the second derivative test.

\[\begin{split} \frac{d^{2}S_{r}}{d{a_{1}}^{2}} &= \frac{d}{da_{1}}\left( 2\sum_{i = 1}^{n}{\left( y_{i} - a_{1}x_{i} \right)\left( - x_{i} \right)} \right)\\ &= \frac{d}{da_{1}} \sum_{i = 1}^{n} (-2 x_{i}y_{i} + 2a_{1}{x_{i}}^{2}) \\ &= \sum_{i = 1}^{n} 2x_{i}^{2} > 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{2.5}) \end{split}\]

for at most one \(x_{i} \neq 0,\) which is a pragmatic assumption that all the \(x\) -values are not zero.

This inequality shows that the Equation \((\PageIndex{2.2})\) value of \(a_{1}\) corresponds to a location of local minimum. Since the sum of the squares of the residuals, \(S_{r}\) is a continuous function of \(a_{1}\) , that \(S_r\) has only one point where \(\displaystyle \frac{dS_{r}}{da_{1}} = 0,\) and at that point, we have \(\displaystyle \frac{d^{2}S_{r}}{d{a_{1}}^{2}} > 0\) , it corresponds not only to a local minimum but an absolute minimum as well. Hence, Equation \((\PageIndex{2.4})\) gives us the value of the constant, \(a_1\), of the regression model \(y=a_1x\) .

Example \(\PageIndex{2.1}\)

To find the longitudinal modulus of a composite material, the following data, as given in Table \(\PageIndex{2.1}\), is collected.

Find the longitudinal modulus \(E\) using the following regression model:

\[\sigma = E\varepsilon \nonumber\]

Rewriting data from Table \(\PageIndex{2.1}\) in the base SI system of units is given in Table \(\PageIndex{2.2}\).

Using Equation \((\PageIndex{2.4})\) gives

\[E = \frac{\displaystyle \sum_{i = 1}^{n}{\sigma_{i}\varepsilon_{i}}}{\displaystyle \sum_{i = 1}^{n}{\varepsilon_{i}}^{2}}\;\;\;\;\;\;\;\;\;\;\;\;(\PageIndex{2.E1.1}) \nonumber\]

The summations used in Equation \((\PageIndex{2.E1.1})\) are given in Table \(\PageIndex{2.3}\).

\[n = 12 \nonumber\]

\[\sum_{i = 1}^{12}{\varepsilon_{i}^{2} = 1.2764 \times 10^{- 3}} \nonumber\]

\[\sum_{i = 1}^{12}{\sigma_{i}\varepsilon_{i} = 2.3337 \times 10^{8}} \nonumber\]

From Equation \((\PageIndex{2.E1.1})\)

\[\begin{split} E &= \frac{\displaystyle \sum_{i = 1}^{12}{\sigma_{i}\varepsilon_{i}}}{\displaystyle \sum_{i = 1}^{12}{\varepsilon_{i}}^{2}} \\ &= \frac{2.3337 \times 10^{8}}{1.2764 \times 10^{- 3}}\\ &= 1.8284 \times 10^{11}\ \text{Pa}\\ &= 182.84 \text{ GPa}\end{split}\]

Stress vs strain data and linear regression model for a composite material uniaxial test.

Title : Linear Regression with Zero Intercept: Derivation

Summary : This video discusses how to regress data to a linear polynomial with zero constant term (no intercept). This segment shows you the derivation and also explains why using the formula for a general straight line is not valid for this case.

Title : Linear Regression with Zero Intercept: Example

Summary : This video shows an example of how to conduct linear regression with zero intercept.

Lesson 3: Theory of General Straight-Line Regression Model

1) derive the constants of a linear regression model based on the least-squares method criterion.

In this model, we best fit a general straight line \(y=a_0 +a_1x\) to the \(n\) data points \((x_1,y_1),\ (x_2,y_2),\ldots,\ (x_n,y_n)\)

Let us use the least-squares criterion where we minimize the sum of the square of the residuals, \(S_{r}\):

\[\begin{split} S_{r} &= \sum_{i = 1}^{n}{E_{i}}^{2}\\&= \sum_{i = 1}^{n}\left( y_{i} - a_{0} - a_{1}x_{i} \right)^{2}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.1}) \end{split}\]

Plot of several data points fitted with a linear regression model, showing the residual and square of the residual at each point.

To find \(a_{0}\) and \(a_{1}\) , we need to calculate where the sum of the square of the residuals, \(S_{r}\) is the absolute minimum. We start this process of finding the absolute minimum first by

  • taking the partial derivative of \(S_{r}\) with respect to \(a_{0}\) and \(a_{1}\) and set them equal to zero, and
  • conducting the second derivative test.

Taking the partial derivative of \(S_{r}\) with respect to \(a_{0}\) and \(a_{1}\) and seting them equal to zero gives

\[\frac{\partial S_{r}}{\partial a_{0}} = 2\sum_{i = 1}^{n}{\left( y_{i} - a_{0} - a_{1}x_{i} \right)\left( - 1 \right)} = 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.2}) \nonumber\]

\[\frac{\partial S_{r}}{\partial a_{1}} = 2\sum_{i = 1}^{n}{\left( y_{i} - a_{0} - a_{1}x_{i} \right)\left( - x_{i} \right)} = 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.3}) \nonumber\]

Dividing both sides by \(2\) and expanding the summations in Equations \((\PageIndex{3.2})\) and \((\PageIndex{3.3})\) gives,

\[- \sum_{i = 1}^{n}{y_{i} + \sum_{i = 1}^{n}a_{0} + \sum_{i = 1}^{n}{a_{1}x_{i}}} = 0 \nonumber\]

\[- \sum_{i = 1}^{n}{y_{i}x_{i} + \sum_{i = 1}^{n}{a_{0}x_{i}} + \sum_{i = 1}^{n}{a_{1}x_{i}^{2}}} = 0 \nonumber\]

Noting that

\[\sum_{i = 1}^{n}a_{0} = a_{0} + a_{0} + \ldots + a_{0} = na_{0} \nonumber\]

\[na_{0} + a_{1}\sum_{i = 1}^{n}x_{i} = \sum_{i = 1}^{n}y_{i}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.4}) \nonumber\]

\[a_{0}\sum_{i = 1}^{n}x_{i} + a_{1}\sum_{i = 1}^{n}x_{i}^{2} = \sum_{i = 1}^{n}{x_{i}y_{i}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.5}) \nonumber\]

Solving the above two simultaneous linear equations \((\PageIndex{3.4})\) and \((\PageIndex{3.5})\) gives

\[a_{1} = \frac{n \displaystyle \sum_{i = 1}^{n}{x_{i}y_{i}} - \sum_{i = 1}^{n}x_{i} \sum_{i = 1}^{n}y_{i}}{n \displaystyle \sum_{i = 1}^{n}x_{i}^{2} - \left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.6}) \nonumber\]

\[a_{0} = \frac{\displaystyle \sum_{i = 1}^{n}x_{i}^{2}\ \sum_{i = 1}^{n}y_{i} - \sum_{i = 1}^{n}x_{i} \sum_{i = 1}^{n}{x_{i}y_{i}}}{n\displaystyle \sum_{i = 1}^{n}x_{i}^{2} - \left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.7}) \nonumber\]

\[S_{xy} = \sum_{i = 1}^{n}{x_{i}y_{i}} - n\bar{x}\bar{y} \nonumber\]

\[S_{xx} = \sum_{i = 1}^{n}x_{i}^{2} - n \bar{x}^{2} \nonumber\]

\[\bar{x} = \frac{\displaystyle \sum_{i = 1}^{n}x_{i}}{n} \nonumber\]

\[\bar{y} = \frac{\displaystyle \sum_{i = 1}^{n}y_{i}}{n} \nonumber\]

we can also rewrite the constants \(a_{0}\) and \(a_{1}\) from Equations \((\PageIndex{3.6})\) and \((\PageIndex{3.7})\) as

\[a_{1} = \frac{S_{xy}}{S_{xx}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.8}) \nonumber\]

\[a_{0} = \bar{y} - a_{1}\bar{x}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{3.9}) \nonumber\]

Putting the first derivative equations equal to zero only gives us a critical point. For a general function, it could be a local minimum, a local maximum, a saddle point, or none of the previous. The second derivative test, though, given in the “optional” appendix below, shows that it is a local minimum. Now, is this local minimum also the absolute minimum? Yes, because the first derivative test gave us only one solution, and that \(S_{r}\) is a continuous function of \(a_{0}\) and \(a_{1}\) .

Given \(n\) data pairs, \(\left( x_{1},y_{1} \right),\ldots,\left( x_{n},y_{n} \right)\) , do the values of the two constants \(a_{0\ }\) and \(a_{1}\) in the least-squares straight-line regression model \(y = a_{0} + a_{1}x\) correspond to the absolute minimum of the sum of the squares of the residuals? Are these constants of regression unique?

Given \(n\) data pairs \(\left( x_{1},y_{1} \right),\ldots,\left( x_{n},y_{n} \right)\) , the best fit for the straight-line regression model

\[y = a_{0} + a_{1}x\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.1}) \nonumber\]

is found by the method of least squares. Starting with the sum of the squares of the residuals \(S_{r}\)

\[S_{r} = \sum_{i = 1}^{n}\left( y_{i} - a_{0} - a_{1}x_{i} \right)^{2}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.2}) \nonumber\]

\[\frac{\partial S_{r}}{\partial a_{0}} = 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.3}) \nonumber\]

\[\frac{\partial S_{r}}{\partial a_{1}} = 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.4}) \nonumber\]

gives two simultaneous linear equations whose solution is

\[a_{1} = \frac{\displaystyle n\sum_{i = 1}^{n}{x_{i}y_{i}} - \sum_{i = 1}^{n}x_{i}\sum_{i = 1}^{n}y_{i}}{\displaystyle n\sum_{i = 1}^{n}x_{i}^{2} - \left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.5a}) \nonumber\]

\[a_{0} = \frac{\displaystyle \sum_{i = 1}^{n}x_{i}^{2}\sum_{i = 1}^{n}y_{i} - \sum_{i = 1}^{n}x_{i}\sum_{i = 1}^{n}{x_{i}y_{i}}}{\displaystyle n\sum_{i = 1}^{n}x_{i}^{2} - \left( \sum_{i = 1}^{n}x_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.5b}) \nonumber\]

But do these values of \(a_{0}\) and \(a_{1}\) give the absolute minimum value of \(S_{r}\) (Equation \((\PageIndex{A.2})\))? The first derivative analysis only tells us that these values give local minima or maxima of \(S_{r}\) , and not whether they give an absolute minimum or maximum. So, we still need to figure out if they correspond to an absolute minimum.

We need first to conduct a second derivative test to find out whether the point \((a_{0},a_{1})\) from Equation \((\PageIndex{A.5})\) gives a local minimum of \(S_r\) . Only then can we show if this local minimum also corresponds to the absolute minimum (or maximum).

What is the second derivative test for a local minimum of a function of two variables?

If you have a function \(f\left( x,y \right)\) and we found a critical point \(\left( a,b \right)\) from the first derivative test, then \(\left( a,b \right)\) is a minimum point if

\[\frac{\partial^{2}f}{\partial x^{2}}\frac{\partial^{2}f}{\partial y^{2}} - \left( \frac{\partial^{2}f}{\partial x\partial y} \right)^{2} > 0,\ \text{and}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.6}) \nonumber\]

\[\frac{\partial^{2}f}{\partial x^{2}} > 0\ \text{or}\ \frac{\partial^{2}f}{\partial y^{2}} > 0\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.7}) \nonumber\]

From Equation \((\PageIndex{A.2})\)

\[\begin{split} \frac{\partial S_{r}}{\partial a_{0}} &= \sum_{i = 1}^{n}{2\left( y_{i} - a_{0} - a_{1}x_{i} \right)( - 1)}\\ &= - 2\sum_{i = 1}^{n}\left( y_{i} - a_{0} - a_{1}x_{i} \right)\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.8}) \end{split}\]

\[\begin{split} \frac{\partial S_{r}}{\partial a_{1}} &= \sum_{i = 1}^{n}{2\left( y_{i} - a_{0} - a_{1}x_{i} \right)}( - x_{i})\\ &= - 2\sum_{i = 1}^{n}\left( x_{i}y_{i} - a_{0}x_{i} - a_{1}x_{i}^{2} \right)\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.9}) \end{split}\]

\[\begin{split} \frac{\partial^{2}S_{r}}{\partial a_{0}^{2}} &= - 2\sum_{i = 1}^{n}{- 1}\\ &= 2n\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.10}) \end{split}\]

\[\frac{\partial^{2}S_{r}}{\partial a_{1}^{2}} = 2\sum_{i = 1}^{n}x_{i}^{2}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.11}) \nonumber\]

\[\frac{\partial^{2}S_{r}}{\partial a_{0}\partial a_{1}} = 2\sum_{i = 1}^{n}x_{i}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.12}) \nonumber\]

So, we satisfy condition \((\PageIndex{A.7})\), because from Equation \((\PageIndex{A.10})\), we see that \(2n\) is a positive number. Although not required, from Equation \((\PageIndex{A.11})\), we see that \(\displaystyle 2\sum_{i = 1}^{n}{x_{i}^{2}\ }\) is also a positive number as assuming that all \(x\) data points are NOT zero is reasonable.

Is the other condition (Equation \((\PageIndex{A.6})\)) for \(S_{r}\) being a minimum met? Yes, we can show ( proof not given that the term is positive )

\[\begin{split} \frac{\partial^{2}S_{r}}{\partial a_{0}^{2}}\frac{\partial^{2}S_{r}}{\partial a_{1}^{2}} - \left( \frac{\partial^{2}S_{r}}{\partial a_{0}\partial a_{1}} \right)^{2} &= \left( 2n \right)\left( 2\sum_{i = 1}^{n}x_{i}^{2} \right) - \left( 2\sum_{i = 1}^{n}x_{i} \right)^{2}\\ &= 4\left\lbrack n\sum_{i = 1}^{n}x_{i}^{2} - \left( \sum_{i = 1}^{n}x_{i} \right)^{2} \right\rbrack\\ &= 4\sum_{\begin{matrix} i = 1 \\ i < j \\ \end{matrix}}^{n}{(x_{i} - x_{j})^{2}} > 0\;\;\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{A.13}) \end{split}\]

So, the values of \(a_{0}\) and \(a_{1}\) that we have in Equation \((\PageIndex{A.5})\) do correspond to a local minimum of \(S_r\) . Now, is this local minimum also the absolute minimum? Yes, because the first derivative test gave us only one solution, and that \(S_{r}\) is a continuous function of \(a_{0}\) and \(a_{1}\) .

As a side note, the denominator in Equations \((\PageIndex{A.5a})\) and \((\PageIndex{A.5b})\) is nonzero, as shown by Equation \((\PageIndex{A.13})\). This nonzero value proves that \(a_{0}\) and \(a_{1}\) are finite numbers.

Title : Derivation of Linear Regression

Summary : This video is on learning how the linear regression formula is derived.

Lesson 4: Application of General Straight-Line Regression Model

1) calculate the constants of a linear regression model.

In the previous lesson, we derived the formulas for the linear regression model. In this lesson, we show the application of the formulas to an applied engineering problem.

Example \(\PageIndex{4.1}\)

The torque \(T\) needed to turn the torsional spring of a mousetrap through an angle, \(\theta\) is given below

Find the constants \(k_{1}\) and \(k_{2}\) of the regression model

\[T = k_{1} + k_{2}\theta\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{4.E1.1}) \nonumber\]

For the linear regression model,

\[T = k_{1} + k_{2}\theta \nonumber\]

the constants of the regression model are given by

\[k_{2} = \frac{\displaystyle n\sum_{i = 1}^{5}{\theta_{i}T_{i}} - \sum_{i = 1}^{5}\theta_{i}\sum_{i = 1}^{5}T_{i}}{\displaystyle n\sum_{i = 1}^{5}\theta_{i}^{2} - \left( \sum_{i = 1}^{5}\theta_{i} \right)^{2}}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{4.E1.2}) \nonumber\]

\[k_{1} = \bar{T} - k_{2}\bar{\theta}\;\;\;\;\;\;\;\;\;\;\;\; (\PageIndex{4.E1.3}) \nonumber\]

Table \(\PageIndex{4.2}\) shows the summations needed for the calculation of the above two constants \(k_{1}\) and \(k_{2}\) of the regression model.

Using the summations from the last row of Table \(\PageIndex{4.2}\), we get

\[n = 5 \nonumber\]

From Equation \((\PageIndex{4.E1.2})\),

\[\begin{split} k_{2} &= \frac{\displaystyle n\sum_{i = 1}^{5}{\theta_{i}T_{i}} - \sum_{i = 1}^{5}\theta_{i}\sum_{i = 1}^{5}T_{i}}{\displaystyle n\sum_{i = 1}^{5}\theta_{i}^{2} - \left( \sum_{i = 1}^{5}\theta_{i} \right)^{2}}\\[4pt] &= \frac{5(1.5896) - (6.2831)(1.1921)}{5(8.8491) - (6.2831)^{2}}\\[4pt] &= 9.6091 \times 10^{- 2}\text{N-m/rad} \end{split}\]

To find \(k_{1}\)

\[\begin{split} \bar{T} &= \frac{\displaystyle \sum_{i = 1}^{5}T_{i}}{n}\\ &= \frac{1.1921}{5}\\ &= 2.3842 \times 10^{- 1} N-m \end{split}\]

\[\begin{split} \bar{\theta} &= \frac{\displaystyle \sum_{i = 1}^{5}\theta_{i}}{n}\\ &= \frac{6.2831}{5}\\ &= 1.2566\ {radians} \end{split}\]

From Equation \((\PageIndex{4.E1.3})\),

\[\begin{split} k_{1} &= \bar{T} - k_{2}\bar{\theta}\\ &= 2.3842 \times 10^{- 1} - (9.6091 \times 10^{- 2})(1.2566)\\ &= 1.1767 \times 10^{- 1} \text{N-m} \end{split}\]

Title : Linear Regression Applications

Summary : This video will teach you, through an example, how to regress data to a straight line.

Multiple Choice Test

(1). Given \(\left( x_{1},y_{1} \right),\left( x_{2},y_{2} \right),............,\left( x_{n},y_{n} \right),\) best fitting data to \(y = f\left( x \right)\) by least squares requires minimization of

(A) \(\displaystyle \sum_{i = 1}^{n}\left\lbrack y_{i} - f\left( x_{i} \right) \right\rbrack\)

(B) \(\displaystyle \sum_{i = 1}^{n}\left| y_{i} - f\left( x_{i} \right) \right|\)

(C) \(\displaystyle \sum_{i = 1}^{n}\left\lbrack y_{i} - f\left( x_{i} \right) \right\rbrack^{2}\)

(D) \(\displaystyle \sum_{i = 1}^{n}(y_{i} - \bar{y})^{2},\ \bar{y} = \frac{\displaystyle \sum_{i = 1}^{n}y_{i}}{n}\)

(2). The following data

is regressed with least squares regression to \(y = a_{0} + a_{1}x\) . The value of \(a_{1}\) most nearly is

(A) \(27.480\)

(B) \(28.956\)

(C) \(32.625\)

(D) \(40.000\)

(3). The following data

is regressed with least squares regression to \(y = a_{1}x\) . The value of \(a_{1}\) most nearly is

(4). An instructor gives the same \(y\) vs. \(x\) data as given below to four students and asks them to regress the data with least squares regression to \(y = a_{0} + a_{1}x\) .

They each come up with four different answers for the straight-line regression model. Only one is correct. Which one is the correct model? (additional exercise - without using the regression formulas for \(a_0\) and \(a_1,\) can you find the correct model?)

(A) \(y = 60x - 1200\)

(B) \(y = 30x - 200\)

(C) \(y = - 139.43 + 29.684x\)

(D) \(y = 1 + 22.782x\)

(5). A torsion spring of a mousetrap is twisted through an angle of \(180^\circ\) . The torque vs. angle data is given below.

The relationship between the torque and the angle is \(T = a_{0} + a_{1}\theta\) .

The amount of strain energy stored in the mousetrap spring in Joules is

(A) \(0.29872\)

(B) \(0.41740\)

(C) \(0.84208\)

(D) \(1561.8\)

(6). A scientist finds that regressing the \(y\) vs. \(x\) data given below to \(y = a_{0} + a_{1}x\) results in the coefficient of determination for the straight-line model, \(r^{2}\), being zero.

The missing value for \(y\) at \(x = 17\) most nearly is

(A) \(-2.4444\)

(B) \(2.0000\)

(C) \(6.8889\)

(D) \(34.000\)

For complete solution, go to

http://nm.mathforcollege.com/mcquizzes/06reg/quiz_06reg_linear_solution.pdf

Problem Set

(1). Given the following data of \(y\) vs. \(x\)

The data is regressed to a straight line \(y = - 7 + 6x\) . What is the residual at \(x = 4\) ?

(2). The force vs. displacement data for a linear spring is given below. \(F\) is the force in Newtons and \(x\) is the displacement in meters. Assume displacement data is known more accurately.

If the \(F\) vs \(x\) data is regressed to \(F = a + kx\) , what is the value of \(k\) by minimizing the sum of the square of the residuals?

\(30\ \text{N}/\text{m}\)

(3). A torsion spring of a mousetrap is twisted through an angle of \(180^{\circ}\) . The torque vs. angle data is given below.

Assuming that the torque and the angle are related via a general straight line as \(T = k_{0} + k_{1}\ \theta\) , regress the above data to the straight-line model.

\(0.06567+1.7750\theta\)

(4). The force vs. displacement data for a linear spring is given below. \(F\) is the force in Newtons and \(x\) is the displacement in meters. Assume displacement data is known more accurately.

If the \(F\) vs. \(x\) data is regressed to \(F = kx\) , what is the value of \(k\) by minimizing the sum of the square of the residuals?

\(16.55\ \text{N}/\text{m}\)

(5). Given the following data of \(y\) vs. \(x\)

If the \(y\) vs. \(x\) data is regressed to a constant line given by \(y = a\) , where \(a\) is a constant, what is the value of \(a\) by minimizing the sum of the square of the residuals.

(6). To find the longitudinal modulus of a composite material, the following data is given.

Find the longitudinal modulus, \(E\), using the regression model. (Hint: \(\sigma = E\varepsilon\) )

\(182.8\ \text{GPa}\)

scikit-learn 1.4.1 Other versions

Please cite us if you use the software.

  • LinearRegression.fit
  • LinearRegression.get_metadata_routing
  • LinearRegression.get_params
  • LinearRegression.predict
  • LinearRegression.score
  • LinearRegression.set_fit_request
  • LinearRegression.set_params
  • LinearRegression.set_score_request
  • Examples using sklearn.linear_model.LinearRegression

sklearn.linear_model .LinearRegression ¶

Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

If True, X will be copied; else, it may be overwritten.

The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True . None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

When set to True , forces the coefficients to be positive. This option is only supported for dense arrays.

New in version 0.24.

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

Rank of matrix X . Only available when X is dense.

Singular values of X . Only available when X is dense.

Independent term in the linear model. Set to 0.0 if fit_intercept = False .

Number of features seen during fit .

Names of features seen during fit . Defined only when X has feature names that are all strings.

New in version 1.0.

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients with l2 regularization.

The Lasso is a linear model that estimates sparse coefficients with l1 regularization.

Elastic-Net is a linear regression model trained with both l1 and l2 -norm regularization of the coefficients.

From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares (scipy.optimize.nnls) wrapped as a predictor object.

Fit linear model.

Training data.

Target values. Will be cast to X’s dtype if necessary.

Individual weights for each sample.

New in version 0.17: parameter sample_weight support to LinearRegression.

Fitted Estimator.

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

A MetadataRequest encapsulating routing information.

Get parameters for this estimator.

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Parameter names mapped to their values.

Predict using the linear model.

Returns predicted values.

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\) , where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum() . The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y , disregarding the input features, would get a \(R^2\) score of 0.0.

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted) , where n_samples_fitted is the number of samples used in the fitting for the estimator.

True values for X .

Sample weights.

\(R^2\) of self.predict(X) w.r.t. y .

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score . This influences the score method of all the multioutput regressors (except for MultiOutputRegressor ).

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False : metadata is not requested and the meta-estimator will not pass it to fit .

None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Metadata routing for sample_weight parameter in fit .

The updated object.

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline ). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Estimator parameters.

Estimator instance.

Request metadata passed to the score method.

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False : metadata is not requested and the meta-estimator will not pass it to score .

Metadata routing for sample_weight parameter in score .

Examples using sklearn.linear_model.LinearRegression ¶

linear regression homework problems

Principal Component Regression vs Partial Least Squares Regression

linear regression homework problems

Plot individual and voting regression predictions

linear regression homework problems

Comparing Linear Bayesian Regressors

linear regression homework problems

Linear Regression Example

linear regression homework problems

Logistic function

linear regression homework problems

Non-negative least squares

linear regression homework problems

Ordinary Least Squares and Ridge Regression Variance

linear regression homework problems

Quantile regression

linear regression homework problems

Robust linear estimator fitting

linear regression homework problems

Robust linear model estimation using RANSAC

linear regression homework problems

Sparsity Example: Fitting only features 1 and 2

linear regression homework problems

Theil-Sen Regression

linear regression homework problems

Failure of Machine Learning to infer causal effects

linear regression homework problems

Face completion with a multi-output estimators

linear regression homework problems

Isotonic Regression

linear regression homework problems

Metadata Routing

linear regression homework problems

Plotting Cross-Validated Predictions

linear regression homework problems

Underfitting vs. Overfitting

linear regression homework problems

Using KBinsDiscretizer to discretize continuous features

IMAGES

  1. Linear Regression Worksheet Answer Key

    linear regression homework problems

  2. Linear Regression

    linear regression homework problems

  3. Linear Regression Worksheet Answers

    linear regression homework problems

  4. Solved College Algebra Linear Regression Homework problem

    linear regression homework problems

  5. Scatter Plot Correlation Worksheet

    linear regression homework problems

  6. Linear Regression (Lesson with Homework) by Infinitely Pi Learning

    linear regression homework problems

VIDEO

  1. linear regression in base R, checking for heteroscedasticity

  2. Using Combined Linear Regression and Principal Component Analysis for Unsupervised Change Detection

  3. In a simple linear regression problem, r and b1

  4. Linear Regression (Introduction and Give 5 word problems with solution)

  5. linear regression class 12 exercise 3.1 commerce maths

  6. Linear regression ISC-12#icse #engineering#viral#youtube #backbenchers#iit #exam #dhanbad

COMMENTS

  1. Linear Regression

    Graph of linear regression in problem 2. a) We use a table to calculate a and b. We now calculate a and b using the least square regression formulas for a and b. b) Now that we have the least square regression line y = 0.9 x + 2.2, substitute x by 10 to find the value of the corresponding y.

  2. 12.E: Linear Regression and Correlation (Exercises)

    This page titled 12.E: Linear Regression and Correlation (Exercises) is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. These are homework exercises to accompany the ...

  3. PDF Simple Linear Regression Homework Problems Homework Solutions

    Homework Problems Homework Solutions Rob McCulloch. 1. The Simple Linear Regression Model 1.1. Problem: SLR Model 2. Estimates and Plug-in Prediction 3. Con dence Intervals, Prediction, and Hypothesis Tests 3.1. Problem: The Shock Absorber Data 3.2. Problem: Predictive Interval for the Shock Data

  4. Linear regression review (article)

    Write a linear equation to describe the given model. Step 1: Find the slope. This line goes through ( 0, 40) and ( 10, 35) , so the slope is 35 − 40 10 − 0 = − 1 2 . Step 2: Find the y -intercept. We can see that the line passes through ( 0, 40) , so the y -intercept is 40 . Step 3: Write the equation in y = m x + b form.

  5. Linear Regression and Correlation Homework: Exercises

    1991. 2.9. Table 12.3. Using "year " as the independent variable and "welfare family size" as the dependent variable, make a scatter plot of the data. Calculate the least squares line. Put the equation in the form of: ˆy = a + bx. Find the correlation coefficient.

  6. PDF Statistics 512: Homework#1 Solution

    linear. (b) Run the linear regression to predict hardness from time. Give i. the linear model used in this problem Solution: The linear model is Yi = β0 +β1Xi +ϵi. ii. the estimated regression equation. Solution: The estimated regression equation is Yˆ = 168.6+2.034X. (c) Describe the results of the significance test for the slope. Give ...

  7. PDF Simple Linear regression (solutions to exercises)

    to be a linear function of the temperature x. The following data of correspond-ing values of x and y is found: Temperature in °C (x) 0 25 50 75 100 Yield in grams (y) 14 38 54 76 95 The average and standard deviation of temperature and yield are x¯ = 50, sx = 39.52847, y¯ = 55.4, sy = 31.66702, In the exercise the usual linear regression ...

  8. PDF STA 3024 Practice Problems Exam 2 NOTE: These are just Practice

    homework and the corresponding chapters in the book. 1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~N(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d)ε, 0, σ. 2. We can measure the proportion of the variation explained by the regression model by: a) r b) R. 2c) σ d) F. 3. The MSE is an estimator of:

  9. Practice Linear Regressions

    The table below shows the relationship between total fat grams and the total calories in a selection of fast food sandwiches. Find the linear regression equation that models this data. (Round to nearest integer with Fat on x-axis and Calories on y-axis.) Choose: y = 14x + 99. y = 14x + 98. y = 13x + 143.

  10. Homework 10: Correlation and simple linear regression

    Work on Ordinary Linear (Simple) Regression. Objective 2: To learn how to obtain and interpret regression statistics and diagnostic plots in R Commander. Objective 3: Introduce the General Linear Model approach. For simple linear regression use this dataset = cars, Speed and Stopping Distances of Cars (one of the built-in datasets with R, go to ...

  11. 7.10: Practice problems

    7.1.1. Use the output to test for a linear relationship between treadmill oxygen and run time, writing out all 6+ steps of the hypothesis test. Make sure to address scope of inference and interpret the p-value. 7.1.2. Form and interpret a 95% confidence interval for the slope coefficient "by hand" using the provided multiplier:

  12. PDF LINEAR REGRESSION MODELS W4315

    LINEAR REGRESSION MODELS W4315 HOMEWORK 5 ANSWERS March 9, 2010 Due: 03/04/10 Instructor: Frank Wood 1. (20 points) In order to get a maximum likelihood estimate of the parameters of a ... 1This is problem 4.22 in 'Applied Linear Regression Models(4th edition)' by Kutner etc. 2This is problem 5.24 in 'Applied Linear Regression Models(4th ...

  13. PDF Correlation and Regression Example solutions

    7) Use the regression equation to predict a student's final course grade if 75 optional homework assignments are done. Grade =44.8 + 0.355(75) = 71.4. 8) Use the regression equation to compute the number of optional homework assignments that need to be completed if a student expects an 85. 85 = 44.8 + 0.355(x) ⇒ x » 113.

  14. Simple Linear Regression

    Regression allows you to estimate how a dependent variable changes as the independent variable (s) change. Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to ...

  15. Introduction to Linear Regression Analysis

    With Expert Solutions for thousands of practice problems, you can take the guesswork out of studying and move forward with confidence. Math. Probability. Introduction to Linear Regression Analysis. 5th Edition. ISBN: 9781118471463. Alternate ISBNs. Douglas C. Montgomery, George C. Runger, Norma F. Hubele.

  16. Least Squares Regression Line (w/ 19 Worked Examples!)

    With Example #8. 01:14:51 - Use the data to create a scatterplot and find the correlation coefficient, LSRL, residuals and residual plot (Example #9) 01:30:16 - Find the regression line and use it to predict a value (Examples #10-11) 01:36:59 - Using technology find the regression line, correlation coefficient, coefficient of ...

  17. Simple Linear Regression Examples: Real Life Problems & Solutions

    Many of simple linear regression examples (problems and solutions) from the real life can be given to help you understand the core meaning. From a marketing or statistical research to data analysis, linear regression model have an important role in the business. As the simple linear regression equation explains a correlation between 2 variables (one independent and one dependent variable), it ...

  18. PDF Multiple Linear Regression (solutions to exercises)

    Chapter 6 6.1 NITRATE CONCENTRATION 5 Solution From Theorem6.5we know that the confidence intervals can be calculated by bˆ i t1 a/2 sˆb i, where t1 a/2 is based on 237 degrees of freedom, and with a = 0.05, we get t0.975 = 1.97. The standard errors for the estimates is the second column of the coefficient

  19. 6.03: Linear Regression

    Sr = n ∑ i = 1Ei2 = n ∑ i = 1(yi − a0 − a1xi)2 (6.3.3.1) Figure 6.3.3.1. Linear regression of y vs. x data showing residuals and square of residual at a typical point, xi. To find a0 and a1, we need to calculate where the sum of the square of the residuals, Sr is the absolute minimum.

  20. Solved Problems about Linear Regression Model This homework

    Problems about Linear Regression Model This homework is to practice the Python implementation of two methods for training Linear Regression models: normal equations and gradient descent. See detailed instructions below. We are going to build a linear regression model that predicts the GPAs of university students from two features, Math SAT and Verb

  21. Regression

    Homework 2, R code and output for problem 1; Homework 3: Due Friday, Oct. 5 (start of class) Cow carcass and crime data are below. Homework 3 solution. Homework 4: Due Friday, Oct. 12 Kishi data below. Homework 4 solution. Homework 5: Due Monday, Oct. 22, start of class (no exceptions) Note in problem 2 it should be testing beta0=0 and beta1 = 1.

  22. sklearn.linear_model.LinearRegression

    Attributes: coef_ array of shape (n_features, ) or (n_targets, n_features) Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

  23. Section 5.4

    Join bartleby learn and gain access to the full version. Access to all documents. Unlimited textbook solutions. 24/7 expert homework help. VIEW FULL DOCUMENT. 3/19/24, 8:48 PM Section 5.4: Linear Regression and Correlation - MAT-143. MAT-143-01H-2024SP-Quantitative Literacy Course Content Unit 3 Section 5.4: Linear Regression and Correlation ...