Simple Linear Regression

[DRAFT, IN PROGRESS]

Introduction

As is often true, there is an xkcd cartoon that is appropriate:

We have a set of points to which we want to fit a straight line (not a constellation!!)

\[ y = \beta_0 + \beta_1 x \]

There are \(n\) subjects indexed by \(i\), where \(i = 1, \ldots, n\). We have two data variables \(x\) and \(y\), where \(x\) is called the predictor and \(y\) is called the response.

Example 1 (Galton and Pearson’s original example) The first person to identify and investigate the phenomenon of regression was the English scientist (well, polymath might be more accurate), Francis Galton. Regrettably, he was also the founder of eugenics, and was more generally interested in inheritance. He was probably as brilliant as his more well-known cousin, Charles Darwin, but is not as well-known. He was the first person to “use fingerprints in detective work”, among other things, but he is best known for his research into inheritance, and might also be considered one of the founders of statistics.

The question that he and his student, Karl Pearson, asked was: “Can we use the height of the father to predict the height of the son?”

2 data variables \(x\) and \(y\)
- \(x_i\) is the height of the father in family \(i\)
- \(y_i\) is the height of the son in family \(i\)

Here is a plot showing the famous father-son height data set (first collected by Pearson). Each point is a father-son pair. There are three lines, as you can see. The blue SD line is the line for which the slope is the standard deviation of the \(y\)-data (son’s heights) divided by the standard deviation of the \(x\)-data (father’s heights), whereas the dark red regression line has a flatter slope. This flattening is because the correlation is at most 1, and is called the Regression Effect. What is implies, for instance, is that if a father is \(n\) standard deviations above average, the son is predicted to be only \(r\times n\) standard deviations above average. That is, our prediction is closer to the mean, and this is why Galton called this phenomenon “regression to mediocrity, that is, to the middle*.

Note that each point \((x_i, y_i)\) represents a father–son pair; \(\bar{x} \approx 68''\), \(\bar{y} \approx 69''\), \(\mathrm{SD} \approx 3''\) for both;

The regression line approximates the average heights of the sons given the heights of their fathers. Geometrically, the dark red line above passes roughly through the centers of the vertical strips of the scatter plot.

Simple Linear Regression: The Model

The Statistical Model

\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n \]

We call it simple linear regression because there is only one predictor.

Assumptions

\(\epsilon_i \overset{IID}\sim \mathcal{N}(0, \sigma^2)\) are iid random errors.
\(x_i\) are not random — they are known constants, called covariates.
\(\beta_0,\, \beta_1,\, \sigma^2\) are the unknown parameters of the model that we will need to estimate from our data.

Implications of the model:

Since the \(x_i\) are not random, the only randomness, according to our model, comes from the IID errors, which are normally distributed with mean \(0\). Since, by the model, \[ \begin{align*} Y_i &= \beta_0 + \beta_1 x_i + \epsilon_i\\ \implies E(Y_i\mid x_i) &= \beta_0 + \beta_1 x_i + E(\epsilon_i)\\ &= \beta_0 + \beta_1 x_i. \end{align*} \] Further, the \(Y_i\) will be normally distributed with the same variance as the errors, so this means that for every \(i\), \(Var(Y_i) = Var(\epsilon_i) = \sigma^2\). Putting these statements together, we see that \(Y_1, Y_2, \ldots, Y_n\) are independent, with \[ Y_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\; \sigma^2). \] We define \(\hat Y_i\) to be this expected value, so \(\hat Y_i = E(Y_i\mid x_i)\). Since the \(Y_i\) are normal, the sample mean \(\bar Y\) must also be normal, with: \[ \bar{Y} \sim \mathcal{N}\!\left(\beta_0 + \beta_1 \bar{x},\; \frac{\sigma^2}{n}\right) \]

The true regression line is \(y = \beta_0 + \beta_1 x\). Our goal is to estimate this line from data, and use \(\sigma^2\), the variance of the errors, to compute the variance and SE of our estimates.

Estimating the intercept and slope (\(\beta_0\) and \(\beta_1\))

We minimize the sum of squared residuals (the squared loss function):

\[ S(\beta_0, \beta_1) = \sum_{i=1}^{n}(Y_i - \beta_0 - \beta_1 x_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 \]

where the model says \(Y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), so the fitted value is \(\hat{Y}_i = \beta_0 + \beta_1 x_i = \mathbb{E}[Y_i \mid x_i]\).

Deriving \(\hat{\beta}_0\)

Differentiate with respect to \(\beta_0\) and set equal to zero:

\[ \frac{\partial S}{\partial \beta_0} = -2\sum_{i=1}^{n}(Y_i - \beta_0 - \beta_1 x_i) = 0 \]

\[ \Rightarrow \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

\[ \Rightarrow n\hat{\beta}_0 = \sum_{i=1}^{n} Y_i - \hat{\beta}_1 \sum_{i=1}^{n} x_i \]

\[ \boxed{\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}} \]

Deriving \(\hat{\beta}_1\)

Define \(x_i' = x_i - \bar{x}\) (shifting everything horizontally will not change \(\beta_1\), but will change \(\beta_0\), so call the new intercept \(\beta_0'\)). Note that \(\sum_{i=1}^{n} x_i' = 0\).

Differentiate with respect to \(\beta_1\) and set equal to zero:

\[ \frac{\partial S}{\partial \beta_1} = -2\sum_{i=1}^{n} x_i'(Y_i - \beta_0 - \beta_1 x_i') = 0 \quad \circledast \]

Replace \(x_i\) by \(x_i'\) (\(\beta_1\) does not change):

\[ \frac{\partial S}{\partial \beta_1} = 0 \;\Rightarrow\; \sum_{i=1}^{n} x_i' Y_i - \hat{\beta}_0' \underbrace{\sum_{i=1}^{n} x_i'}_{=\,0} - \hat{\beta}_1 \sum_{i=1}^{n}(x_i')^2 = 0 \]

This gives expression ①

\[ \hat{\beta}_1 = \frac{\displaystyle\sum_{i=1}^{n} x_i' Y_i}{\displaystyle\sum_{i=1}^{n}(x_i')^2} = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})\,Y_i}{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2} \tag{1} \]

Exercise 1 Show that expression (1) above equals expression (2) below:

\[ \boxed{ \hat{\beta}_1 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})(Y_i - \bar{Y})}{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2} } \tag{2} \]

(Hint: show \(\sum(x_i-\bar x)\bar Y = 0\).)

Equivalent formula via sample correlation

Plugging in sample values, the estimates can be written as

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, \qquad \hat{\beta}_1 = r \cdot \frac{s_y}{s_x}, \]

where \(r = r(x, y)\) is the sample correlation coefficient, and \(s_x\) and \(s_y\) are the sample standard deviations. So we can see that for simple linear regression, the line depends on five quantities: the sample means and the sample sd’s of \(x\) and \(y\), and their sample correlation. It is very easy to just write down the slope and the intercept but you should always…

Warning

Look at the data first!

Example 2 (Anscombe’s Data Quartet) Always inspect the data before performing a regression. “Anscombe’s Quartet”, shown below, consists of four datasets with identical \(\hat\beta_0\), \(\hat\beta_1\), and \(r^2\), and therefore all have the same fitted line, The data have radically different shapes, though — only one of which, the top left, is actually suited to a linear model.

Uses of Linear Modeling

Most uses of linear models fall into three categories:

Summarizing associations: What is the relationship between the number of rooms in a house and its market value?
Estimating/predicting the response: A house is going on the market. Given its attributes, what will it sell for?
Causal inference: By how much will the resale price of my house increase if I build an extension and add one room?

Example 3 (Chocolate Consumption and the Nobel Prize)

scatter plot of Nobel laureates per 10M population vs. chocolate consumption (kg/yr/capita) by country

From Messerli (2012), “Chocolate consumption, cognitive function, and Nobel laureates,” New England Journal of Medicine. The author fit a linear model with slope \(\approx 2.5\).

What is the model good for?

Summarizing an association: Each additional kg of annual per-capita chocolate consumption is associated with 2.5 extra Nobel laureates per 10 million population.
Prediction: India’s 2017 per-capita chocolate consumption was about 0.1 kg, so the model predicts roughly one Nobel laureate per 40 million Indian residents.
Causal inference: If Americans ate an extra kilogram of chocolate per year, the Nobel prize rate would increase by 2.5 per 10 million. (This interpretation is clearly unreasonable — correlation \(\neq\) causation!)

Statistical Properties of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)

Since both \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are linear functions of the \(Y_i\), we have

\[ \hat{\beta}_j \sim \mathcal{N}\!\left(E(\hat{\beta}_j),\; \mathrm{Var}(\hat{\beta}_j)\right), \quad j = 0, 1. \]

Theorem 1 (Theorem A (p. 548)) The OLS estimators are unbiased:

\[ E(\hat{\beta}_0) = \beta_0 \qquad \text{and} \qquad E(\hat{\beta}_1) = \beta_1. \]

(Proof: exercise.)

Theorem 2 (Theorem B (pp. 548–549)) \[ \mathrm{Var}(\hat{\beta}_0) = \frac{\sigma^2 \displaystyle\sum_{i=1}^{n} x_i^2}{n\displaystyle\sum_{i=1}^{n} x_i^2 - \left(\displaystyle\sum_{i=1}^{n} x_i\right)^2} \]

\[ \mathrm{Var}(\hat{\beta}_1) = \frac{n\sigma^2}{n\displaystyle\sum_{i=1}^{n} x_i^2 - \left(\displaystyle\sum_{i=1}^{n} x_i\right)^2} \]

\[ \mathrm{Cov}(\hat{\beta}_0, \hat{\beta}_1) = \frac{-\sigma^2 \displaystyle\sum_{i=1}^{n} x_i}{n\displaystyle\sum_{i=1}^{n} x_i^2 - \left(\displaystyle\sum_{i=1}^{n} x_i\right)^2} \]

Proof. Define \(a_i = x_i - \bar{x}\) and \(A = \displaystyle\sum_{i=1}^{n} a_i^2 = \sum_{i=1}^{n}(x_i-\bar{x})^2\).

We will compute the variances in terms of \(a_i\) and \(A\).

From the formula for \(\hat{\beta}_1\), with \(a_i = x_i - \bar{x}\) and \(A = \sum a_i^2\):

\[ \hat{\beta}_1 = \frac{\displaystyle\sum_{i=1}^{n} a_i Y_i}{A} \]

Since the \(Y_i\) are independent with \(\mathrm{Var}(Y_i) = \sigma^2\):

\[ \mathrm{Var}(\hat{\beta}_1) = \frac{1}{A^2}\sum_{i=1}^{n} a_i^2\,\sigma^2 = \frac{\sigma^2}{A} = \frac{\sigma^2}{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2} \]

This agrees with the expression in Theorem B using the identity

\[ \sum_{i=1}^{n}(x_i - \bar{x})^2 = \sum_{i=1}^{n} x_i^2 - n\bar{x}^2, \qquad n\bar{x} = \sum_{i=1}^{n} x_i. \]

Exercise 2 Use the variance of \(\hat{\beta}_1\) to derive the variance of \(\hat{\beta}_0\) and the covariance \(\mathrm{Cov}(\hat\beta_0, \hat\beta_1)\).

Exercise 3 Show that \(\mathrm{Cov}(\bar Y,\hat\beta_1)=0\).

More About \(\hat{\beta}_1\)

If \(Y_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\, \sigma^2)\), then

\[ \hat{\beta}_1 \sim \mathcal{N}\!\left(\beta_1,\; \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right). \] Now, if the \(x_i\) values are clustered together, the denominator becomes small, which causes the variance of the slope estimate (\(\mathrm{Var}\hat\beta_1\)) to be large. The estimate is not precise, and this will make the confidence intervals for the slope to be wider, and unreliable predicted values \(\hat Y\). The data becomes noisy - small changes in the sample can lead to very different slope estimates. Note that there are other reasons that make the variance large, but we if can control the \(x_i\) values, we should spread them out to reduce the variance of our estimate.

Inference About \(\beta_1\)

Estimating \(\sigma^2\)

We need to estimate \(\sigma^2\). Define the Residual Sum of Squares using our observed error \(=y_i-\hat y_i\), which we call the residual.

\[ RSS = \sum_{i=1}^{n} \hat{e}_i^2, \qquad \hat{e}_i = = y_i-\hat y_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i \quad (\text{estimated error / residual}) \]

Fact: \(s^2\) is an unbiased estimator of \(\sigma^2\), where we define \(s^2\) as: \[ s^2 = \frac{RSS}{n - 2} \]

We divide by \(n-2\) because we use two estimates (\(\hat\beta_0\) and \(\hat\beta_1\)) in computing \(s\).

Confidence Interval for \(\beta_1\)

The estimated standard error of \(\hat\beta_1\) (using our estimate of \(\sigma\)) is

\[ s_{\hat\beta_1} = \frac{s}{\displaystyle\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}} \]

Then we can define the standardized statistic \(T\), which, since \(\hat \beta_1\) is normally distributed, has a \(t\)-distribution with \(n-2\) degrees of freedom.

\[ T = \frac{\hat{\beta}_1 - \beta_1}{s_{\hat\beta_1}} \sim t_{n-2} \]

Putting all of these pieces together, we can write down a \((1-\alpha)100\%\) confidence interval for \(\beta_1\), the true slope: \[ \hat{\beta}_1 \pm t_{\alpha/2,\, n-2} \cdot s_{\hat\beta_1} \]

Hypothesis Test for \(\beta_1\)

We test if there is really some relationship between \(x\) and \(y\) by testing if the slope of the regression line is different from \(0\). Our hypotheses are: \[ H_0: \beta_1 = 0 \:\mathrm{ against }\: H_1: \beta_1 \neq 0, \] using the test statistic \(T\) given by: \[ T = \frac{\hat{\beta}_1 - 0}{s_{\hat\beta_1}} \]

If we reject \(H_0\), we conclude that \(x\) has predictive value — i.e. there exists a linear relationship between \(x\) and \(Y\).

Correlation Analysis and Regression

We talked about the regression effect observed by Galton, in which he found that sons of fathers taller than average tended to be smaller than their father, while sons of fathers smaller than average tended to be taller. That is, in a test-retest situation, if the test value is above average, the retest value tended to be closer to average.

We can see this easily by writing the slope of the regression line in terms of the sample correlation and the standard deviations of the sample data.

First, let’s recall the sample variances and covariance: \[ s_{xx} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2, \qquad s_{yy} = \frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2, \qquad s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}). \]

The sample correlation coefficient is

\[ r = \frac{s_{xy}}{\sqrt{s_{xx}\, s_{yy}}} %= \frac{1}{n}\sum_{i=1}^{n} %\frac{x_i - \bar{x}}{\sqrt{s_{xx}}} %\cdot %\frac{Y_i - \bar{Y}}{\sqrt{s_{yy}}}. \]

Note that the slope estimator can be written as

\[ \hat{\beta}_1 = \frac{\displaystyle\sum_{i=1}^{n}(x_i-\bar{x})(Y_i-\bar{Y})}{\displaystyle\sum_{i=1}^{n}(x_i-\bar{x})^2} = \frac{s_{xy}}{s_{xx}} = \frac{r\sqrt{s_{xx}\,s_{yy}}}{s_{xx}} = r\sqrt{\frac{s_{yy}}{s_{xx}}}. \]

The fitted line can therefore be rewritten as:

\[ \hat{Y} - \bar{Y} = r \cdot \sqrt{\frac{s_{yy}}{s_{xx}}} \cdot (x - \bar{x}) \]

This exhibits the regression effect (“regression to mediocrity”): since \(|r| \leq 1\), the regression line is always flatter than the SD line. A father who is 1 SD above the mean in height will have a son predicted to be only \(r\) SDs above the mean — not a full SD.

Regression in R

Example 4 (Regressing the bismuth II–I transition pressure on temperature) Problem 36 of the text states that the dataset bismuth contains the transition pressure (bar) of the bismuth II–I transition as a function of temperature (°C). Fit a linear relationship between pressure and temperature, examine the residuals, and comment.

To perform a linear regression in R, we use the function lm(), as shown, and then use the function summary() to see the result:

fit_bismuth <- lm(pressure ~ temperature, data = bismuth)

Code

bismuth <- read_csv("bismuth.csv")
fit_bismuth <- lm(pressure ~ temperature, data = bismuth)
summary(fit_bismuth)


Call:
lm(formula = pressure ~ temperature, data = bismuth)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.902 -12.314   2.174  17.727  37.376 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 26067.9955    16.1690 1612.22   <2e-16 ***
temperature   -39.8736     0.5042  -79.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.31 on 21 degrees of freedom
Multiple R-squared:  0.9967,    Adjusted R-squared:  0.9965 
F-statistic:  6254 on 1 and 21 DF,  p-value: < 2.2e-16

Let’s take a closer look at this output, and indicate the quantities we know. I have “x-ed” out the quantities that do not concern us at this time. Annotated output from lm()

The slope is estimated as \(\hat\beta_1 \approx -39.87\), and is highly significant (the \(p\)-value is shown to be approximately 0 for testing if the slope is 0). We see that the intercept estimate is just over 26,000, and it also appears to be significant (though the slope is more important). The value of \(R^2 = 0.997\) indicates that temperature explains about \(99.67\%\) of the variability in transition pressure. Note, however, that the residuals should be plotted to verify model assumptions.

Residual plots

Code

ggplot(data.frame(x = fitted(fit_bismuth), e = resid(fit_bismuth)),
       aes(x = x, y = e)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals",
       title = "Residual Plot for Bismuth data") +
  theme_bw()

When we look at the residual plot to see if our assumptions are valid, we see that the variability of the residuals does not appear to be constant and appears to be greater for higher fitted values of pressure (and therefore for lower temperatures). It might be that sets of readings were collected at different temperatures, which would imply that the data, and therefore the errors are not independent. Our estimates, therefore, are not reliable.

Notes

Since one assumption of the simple linear regression model is that \(Y\) is Normal for fixed \(x\), and homoscedastic (the variance of \(Y\) is the same for every \(x\)), we can make probability statements:

\[ P(Y \leq y \mid x) \approx \Phi\!\left(\frac{y - \hat{E}(Y|x)}{\hat{\sigma}}\right). \]

\(R^2\), seen in the output of lm() is the coefficient of determination — the proportion of variability in the response explained by the model, that is, by its linear relationship with \(x\). Note that for simple linear regression, \(R^2\) is the same as \(r^2\), the square of the correlation coefficient.
Note that there are two regression lines (regressing \(Y\) on \(x\) vs. regressing \(X\) on \(y\)). These lines will have different slopes and intercepts and will need to be interpreted differently. Their slopes are not just reciprocals of each other.

Cautions

Regression fallacy: Do not impute important cause to the regression effect. This is an example of a fallacy, and a well-known example is the Sports Illustrated cover jinx.
Extrapolation: Do not predict beyond the range of the observed \(x\)-values. Here is an xkcd cartoon to drive home the point:

xkcd comic showing the perils of extrapolating too far

Outliers: Re-run the regression with and without outliers to assess their influence.
Nonlinearity: Do not fit a straight line to a clearly nonlinear relationship. This means you had better look at the data first! You might need to transform the data to obtain linearity, as we saw in class.
Homoscedasticity: The linear regression model assumes the error variance \(\sigma^2\) is constant (does not depend on \(x\)). If the data are heteroscedastic (the variance is not constant), estimated standard errors and confidence intervals will be unreliable.
Always examine the residual plots (\(\hat{e}\) against \(x\)) in R. They reveal nonlinearity, outliers, and heteroscedasticity far more clearly than the original scatter plot. Philip Stark’s online statistics text SticiGui has an excellent chapter on regression diagnostics, which is worth reading.

References

Adhikari, Ani. 2012. “STAT 135 Notes.” Notes (shared privately).

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.