Section 1 Simple Linear Regression

As a reminder, we consider simple linear regression in this section. My hope is, that all of you have seen this material before at some stage, e.g. in school or in some first or second year modules.

In preparation for notation introduced in the next section, we rename the parameters \(\alpha\) and \(\beta\) from the introduction to the new names \(\beta_0\) for the intercept and \(\beta_1\) for the slope.

1.1 Residual Sum of Squares

In simple linear regression, the aim is to find a regression line \(y = \beta_0 + \beta_1 x\), such that the line is “close” to given data points \((x_1, y_1), \ldots, (x_n, y_n) \in\mathbb{R}^2\) for \(i \in \{1, 2, \ldots, n\}\). The ususal way to find \(\beta_0\) and \(\beta_1\), and thus the regression line, is by minimising the residual sum of squares: \[\begin{equation} r(\beta_0, \beta_1) = \sum_{i=1}^n \bigl( y_i - (\beta_0 + \beta_1 x_i) \bigr)^2. \tag{1.1} \end{equation}\] For given \(\beta_0\) and \(\beta_1\), the value \(r(\beta_0, \beta_1)\) measures how close (in vertical direction) the given data points \((x_i, y_i)\) are to the regression line \(\beta_0 + \beta_1 x\). By minimising \(r(\beta_0, \beta_1)\) we find the regression line which is “closest” to the data. The solution of this minimisation problem is usually expressed in terms of the sample variance \(\mathrm{s}_x\) and the sample covariance \(\mathrm{s}_{xy}\).

Definition 1.1 The sample covariance of \(x_1, \ldots, x_n \in \mathbb{R}\) and \(y_1, \ldots, y_n\in\mathbb{R}\) is given by \[\begin{equation*} \mathrm{s}_{xy} := \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y), \end{equation*}\] where \(\bar x\) and \(\bar y\) are the sample means.

The sample variance of \(x_1, \ldots, x_n \in \mathbb{R}\) is given by \[\begin{equation*} \mathrm{s}_{x}^2 := \mathrm{s}_{xx} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2, \end{equation*}\] where, again, \(\bar x\) is the sample mean of the \(x_i\).

Lemma 1.1 Assume that \(\mathrm{s}_x^2 > 0\). Then the function \(r(\beta_0, \beta_1)\) from (1.1) takes its minimum at the point \((\beta_0, \beta_1)\) given by \[\begin{equation*} \hat\beta_1 = \frac{\mathrm{s}_{xy}}{\mathrm{s}_x^2}, \qquad \hat\beta_0 = \bar y - \hat \beta_1 \bar x, \end{equation*}\] where \(\bar x, \bar y\) are the sample means, \(\mathrm{s}_{xy}\) is the sample covariance and \(\mathrm{s}_x^2\) is the sample variance.

Proof. We could find the minimum of \(r\) by differentiating and setting the derivatives to zero. Here we follow a different approach which uses a “trick” to simplify the algebra: Let \(\tilde x_i = x_i - \bar x\) and \(\tilde y_i = y_i - \bar y\) for all \(i \in \{1, \ldots, n\}\). Then we have \[\begin{equation*} \sum_{i=1}^n \tilde x_i = \sum_{i=1}^n x_i - n \bar x = 0 \end{equation*}\] and, similarly, \(\sum_{i=1}^n \tilde y_i = 0\). Using the new coordinates \(\tilde x_i\) and \(\tilde y_i\) we find \[\begin{align*} r(\beta_0, \beta_1) &= \sum_{i=1}^n \bigl(y_i - \beta_0 - \beta_1 x_i \bigr)^2 \\ &= \sum_{i=1}^n \bigl( \tilde y_i + \bar y - \beta_0 - \beta_1 \tilde x_i - \beta_1 \bar x \bigr)^2 \\ &= \sum_{i=1}^n \Bigl( \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr) + \bigl( \bar y - \beta_0 - \beta_1 \bar x \bigr) \Bigr)^2 \\ &= \sum_{i=1}^n \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr)^2 + 2 \bigl( \bar y - \beta_0 - \beta_1 \bar x \bigr) \sum_{i=1}^n \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr) + n \bigl( \bar y - \beta_0 - \beta_1 \bar x \bigr)^2 \end{align*}\] Since \(\sum_{i=1}^n \tilde x_i = \sum_{i=1}^n \tilde y_i = 0\), the second term on the right-hand side vanishes and we get \[\begin{equation} r(\beta_0, \beta_1) = \sum_{i=1}^n \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr)^2 + n \bigl( \bar y - \beta_0 - \beta_1 \bar x \bigr)^2. \tag{1.2} \end{equation}\] Both of these terms are positive and we can minimise the second term (without changing the first term) by setting \(\beta_0 = \bar y - \beta_1 \bar x\).

To find the value of \(\beta_1\) which minimises the first term on the right-hand side of (1.2) we now set the (one-dimensional) derivative w.r.t. \(\beta_1\) equal to \(0\). We get the condition \[\begin{align*} 0 &\overset{!}{=} \frac{d}{d\beta_1} \sum_{i=1}^n \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr)^2 \\ &= \sum_{i=1}^n 2 \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr) \frac{d}{d\beta_1} \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr) \\ &= - 2 \sum_{i=1}^n \bigl( \tilde y_i - \beta_1 \tilde x_i \bigr) \tilde x_i \\ &= -2 \sum_{i=1}^n \tilde x_i \tilde y_i + 2 \beta_1 \sum_{i=1}^n \tilde x_i^2. \end{align*}\] The only solution to this equation is \[\begin{align*} \beta_1 &= \frac{\sum_{i=1}^n \tilde x_i \tilde y_i}{\sum_{i=1}^n \tilde x_i^2} \\ &= \frac{\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} \\ &= \frac{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2} \\ &= \frac{\mathrm{s}_{xy}}{\mathrm{s}_x^2}. \end{align*}\] Since the second derivative is \(2 \sum_{i=1}^n \tilde x_i^2 > 0\), this is indeed a minimum and the proof is complete.

1.2 Linear Regression as a Parameter Estimation Problem

In statistics, any analysis starts by making a statistical model of the data. This is done by writing random variables which have the same structure as the data, and which are chosen so that the data “looks like” a random sample from these random variables.

To construct a model for the data used in a simple linear regression problem, we consider fixed \(x_1, \ldots, x_n\) and then define random variables \(Y_1, \ldots, Y_n\) as follows: \[\begin{equation} Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \tag{1.3} \end{equation}\] for all \(i \in \{1, 2, \ldots, n\}\), where \(\varepsilon_1, \ldots, \varepsilon_n\) are i.i.d. random variables with \(\mathbb{E}(\varepsilon_i) = 0\) and \(\mathop{\mathrm{Var}}(\varepsilon_i) = \sigma^2\).

Here the \(x\)-values are fixed and known. The only random quantities in the model are \(\varepsilon_i\) and \(Y_i\). (There are more complicated models which also allow for randomness of \(x\), but we won’t consider such models here.)
The random variables \(\varepsilon_i\) are called residuals or errors. In a scatter plot, the residuals correspond to the vertical distance between the samples and the regression line. Often one assumes that \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) for all \(i \in \{1, 2, \ldots, n\}\).
The values \(\beta_0\), \(\beta_1\) and \(\sigma^2\) are parameters of the model. To fit the model to data, we need to estimate these parameters.

This model is more complex than the models considered in some introductory statistics courses:

The data consists now of pairs of numbers, instead of just single numbers.
We have \[\begin{equation*} \mathbb{E}(Y_i) = \mathbb{E}\bigl( \beta_0 + \beta_1 x_i + \varepsilon_i \bigr) = \beta_0 + \beta_1 x_i + \mathbb{E}(\varepsilon_i) = \beta_0 + \beta_1 x_i. \end{equation*}\] Thus, the expectation of \(Y_i\) depends on \(x_i\) and, at least for \(\beta_1 \neq 0\), the random variables \(Y_i\) are not identically distributed.

In this setup, we can consider the estimates \(\hat\beta_0\) and \(\hat\beta_1\) from the previous subsection as statistical parameter estimates for the model parameters \(\beta_0\) and \(\beta_1\).

In order to fit a linear model we also need to estimate the residual variance \(\sigma^2\). This can be done using the estimator \[\begin{equation} \hat\sigma^2 = \frac{1}{n-2} \sum_{i=1}^n \hat\varepsilon_i^2 = \frac{1}{n-2} \sum_{i=1}^n (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2. \tag{1.4} \end{equation}\] To understand the form of this estimator, we have to remember that \(\sigma^2\) is the variance of the \(\varepsilon_i\). Thus, using the standard estimator for the variance, we could estimate \(\sigma^2\) as \[\begin{equation} \sigma^2 \approx \frac{1}{n-1} \sum_{i=1}^n \bigl(\varepsilon_i - \bar\varepsilon\bigr)^2 \approx \frac{1}{n-1} \sum_{i=1}^n \bigl(\hat\varepsilon_i - \overline{\hat\varepsilon}\bigr)^2, \tag{1.5} \end{equation}\]

where \(\bar\varepsilon\) and \(\overline{\hat\varepsilon}\) are the averages of the \(\varepsilon_i\) and the \(\hat\varepsilon_i\), respectively. One can show that \(\overline{\hat\varepsilon} = 0\). The estimates of \(\beta_0\) and \(\beta_1\) are sensitive to fluctuations in the data, with the effect that the estimated regression line is, on average, slightly closer to the data points than the true regression line would be. This causes the sample variance of the \(\hat\varepsilon_i\), on average, to be slightly smaller than the true residual variance \(\sigma^2\) and the thus the estimator (1.5) is slightly biased. A more detailed analysis reveals that an unbiased estimator can be obtained if one replaces the pre-factor \(1/(n-1)\) in equation (1.5) with \(1/(n-2)\). This leads to the estimator (1.4).

The main advantage gained by considering a statistical model is, that we now can consider how close the estimators \(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\sigma^2\) are to the true values. Results one can obtain include the following:

The estimators \(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\sigma^2\) are unbiased: This means that when we plug in random data \((x_i, Y_i)\) from the model (1.3), on average we get the correc answer: \(\mathbb{E}(\hat\beta_0) = \beta_0\), \(\mathbb{E}(\hat\beta_1) = \beta_1\), \(\mathbb{E}(\hat\sigma^2) = \sigma^2\).
One can ask about the average distance between the estimated parameters \(\hat\beta_0\), \(\hat\beta_1\) and \(\hat\sigma^2\) and the (unknown) true values \(\beta_0\), \(\beta_1\) and \(\sigma^2\). One measure for these distances is the root mean squared error of the estimators.
One can consider confidence intervals for the parameters \(\beta_0\), \(\beta_1\) and \(\sigma^2\).
One can consider statistical hypothesis tests to answer yes/no questions about the parameters. For example, one might ask whether the data could have come from the model with \(\beta_0=0\).
One can consider whether the data is compatible with the model at all, irrespective of parameter values. If there is a non-linear relationship between \(x\) and \(y\), the model (1.3) will no longer be appropriate.

We will consider most of these topics over the course of the module.

1.3 Matrix Notation

To conclude this section, we will rewrite the results using matrix notation. This approach simplifies our equations and will be extensively used for multiple linear regression in the rest of this module. The idea here is to arrange all quantities in the problem as matrices and vectors. We write \[\begin{equation*} X = \begin{pmatrix} 1 & x_1\\ 1 & x_2\\\vdots & \vdots\\1 & x_n \end{pmatrix} \in \mathbb{R}^{n\times 2}, \qquad y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n, \qquad \varepsilon= \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} \in \mathbb{R}^n, \qquad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \in\mathbb{R}^2 \end{equation*}\] Using this notation, we can rewrite the \(n\) equations \(y_i = \beta_0 + x_i \beta_1 + \varepsilon_i\) for \(i \in \{1, \ldots, n\}\) as one vector-valued equation in \(\mathbb{R}^n\): we get \[\begin{equation*} y = X\beta + \varepsilon, \end{equation*}\] and we want to “solve” this vector-valued equation for \(\beta\). The sum of squares can now be written as \[\begin{equation*} r(\beta) = \sum_{i=1}^n \varepsilon_i^2 = \varepsilon^\top \varepsilon = (y - X\beta)^\top (y - X\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X \beta. \end{equation*}\] In the next section we will see that the minimum of \(r\) is attained for \[\begin{equation*} \hat\beta = (X X^\top)^{-1} X^\top y \end{equation*}\] and one can check that the components of this vector \(\hat\beta = (\hat\beta_0, \hat\beta_1)\) coincide with the estimates we obtained above.

Summary

Simple linear regression is the case where there is only one input.
A regression line is fitted by minimising the residual sum of squares.
Linear regression is a statistical parameter estimation problem.
The problem can be conveniently written in matrix/vector notation.