Section 2 Least Squares Estimates

2.1 Data and Models

For multiple linear regression we assume that there are \(p\) inputs and one output. If we have a sample of \(n\) obervations, we have \(np\) inputs and \(n\) outputs in total. Here we denote the \(i\)th observation of the \(j\)th input by \(x_{ij}\) and the corresponding output by \(y_j\).

As an example, we consider the mtcars dataset built into R. This is a small dataset, which contains information about 32 automobiles (1973-74 models). The table lists fuel consumption mpg, gross horsepower hp, and 9 other aspects of these cars. Here we consider mpg to be the output, and the other listed aspects to be inputs. Type help(mtcars) in R to learn more about this dataset:

mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

For this dataset we have \(n = 32\) (number of cars), and \(p = 10\) (number of attributes, excluding mpg). The values \(y_1, \ldots, y_{32}\) are listed in the first column of the table, the values \(x_{i,1}\) for \(i \in \{1, \ldots, 32\}\) are shown in the second column, and the values \(x_{i,10}\) are shown in the last column.

It is easy to make scatter plots which show how a single input affects the output. For example, we can show how the engine power affects fuel consumption:

plot(mtcars$hp, mtcars$mpg,
     xlab = "power [hp]", ylab = "fuel consumption [mpg]")

We can see that cars with stronger engines tend to use more fuel (i.e. a gallon of fuel lasts for fewer miles; the curve goes down), but leaving out the other inputs omits a lot of information. It is not easy to make a plot which takes all inputs into account. Is is also not immediately obvious which of the variables are most important.

In linear regression, we assume that the output depends on the inputs in a linear (or more precisely, affine) way. We write this as \[\begin{equation} y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p} + \varepsilon_i \tag{2.1} \end{equation}\] where the errors \(\varepsilon_i\) are assumed to be “small”. Here, \(i\) ranges over all \(n\) samples, so (2.1) represents \(n\) simultaneous equations.

The parameters \(\beta_j\) can be interpreted as the expected change in the response \(y\) per unit change in \(x_j\) when all other regressor variables are held fixed. For this reason the parameters \(\beta_j\) (for \(j=1, \ldots, p\)) are sometimes called partial regression coefficients.

This model describes a hyperplane in the \((p+1)\)-dimensional space of the inputs \(x_j\) and the output \(y\). The hyperplane is easily visualized when \(p=1\) (as a line in \(\mathbb{R}^2\)), and visualisation can be attempted for \(p=2\) (as a plane in \(\mathbb{R}^3\)) but is hard for \(p>2\).

We defer making a proper statistical model for multiple linear regression until the next section.

2.2 The Normal Equations

Similar to what we did in Section 1.3, we rewrite the model using matrix notation. We define the vectors \[\begin{equation*} y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n, \qquad \varepsilon= \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} \in \mathbb{R}^n, \qquad \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \in\mathbb{R}^{1+p} \end{equation*}\] as well as the matrix \[\begin{equation*} X = \begin{pmatrix} 1 & x_{1,1} & \cdots & x_{1,p} \\ 1 & x_{2,1} & \cdots & x_{2,p} \\ \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & \cdots & x_{n,p} \\ \end{pmatrix} \in \mathbb{R}^{n\times (1+p)}. \end{equation*}\] The parameters \(\beta_i\) are called the regression coefficients and the matrix \(X\) is called the design matrix.

Using this notation, the model (2.1) can be written as \[\begin{equation} y = X \beta + \varepsilon, \tag{2.2} \end{equation}\] where again \(X\beta\) is a matrix-vector multiplication which “hides” the sums in equation (2.1), and (2.2) is an equation of vectors of size \(n\), which combines the \(n\) individual equations from (2.1).

To simplify notation, we index the columns of \(X\) by \(0, 1, \ldots, p\) (instead of the more conventional \(1, \ldots, p+1\)), so that we can for example write \[\begin{equation*} (X \beta)_i = \sum_{j=0}^p x_{i,j} \beta_j = \beta_0 + \sum_{j=1}^p x_{i,j} \beta_j. \end{equation*}\]

As before, we find the regression coefficients by minimising the residual sum of squares: \[\begin{align*} r(\beta) &= \sum_{i=1}^n \varepsilon_i^2 \\ &= \sum_{i=1}^n \bigl( y_i - (\beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p}) \bigr)^2. \end{align*}\] In practice, this notation turns out to be cumbersome, and we will use matrix notation in the following proof.

Lemma 2.1 Assume that the matrix \(X^\top X \in \mathbb{R}^{(1+p) \times (1+p)}\) is invertible. Then the function \(r(\beta)\) takes its minimum at the vector \(\hat\beta\in\mathbb{R}^{p+1}\) given by \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top y. \end{equation*}\]

Proof. Using the vector equation \(\varepsilon= y - X \beta\), we can also write the residual sum of squares as \[\begin{align*} r(\beta) &= \sum_{i=1}^n \varepsilon_i^2 \\ &= \varepsilon^\top \varepsilon\\ &= (y - X \beta)^\top (y - X \beta) \\ &= y^\top y - y^\top X\beta - (X\beta)^\top y + (X\beta)^\top (X\beta). \end{align*}\] Using the linear algebra rules from Appendix A.2 we find that \(y^\top X\beta = (X\beta)^\top y = \beta^\top X^\top y\) and \((X\beta)^\top (X\beta) = \beta^\top X^\top X \beta\). Thus we get \[\begin{equation*} r(\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X \beta. \end{equation*}\] Note that in this equation \(X\) is a matrix, \(y\) and \(\beta\) are vectors, and \(r(\beta)\) is a number.

To find the minimum of this function, we set all partial derivatives \(\frac{\partial}{\partial \beta_i} r(\beta)\) equal to \(0\). Going through the terms in the formula for \(r(\beta)\) we find: (1) \(y^\top y\) does not depend on \(\beta\), so we have \(\frac{\partial}{\partial \beta_i} y^\top y = 0\) for all \(i\), (2) we have \[\begin{equation*} \frac{\partial}{\partial \beta_i} \beta^\top X^\top y = \frac{\partial}{\partial \beta_i} \sum_{j=1}^{p+1} \beta_j (X^\top y)_j = (X^\top y)_i \end{equation*}\] and (3) finally \[\begin{equation*} \frac{\partial}{\partial \beta_i} \beta^\top X^\top X \beta = \frac{\partial}{\partial \beta_i} \sum_{j,k=1}^{p+1} \beta_j (X^\top X)_{j,k} \beta_k = 2 \sum_{k=1}^{p+1} (X^\top X)_{i,k} \beta_k = 2 \bigl( (X^\top X) \beta \bigr)_i. \end{equation*}\] (Some care is needed, when checking that the middle equality sign in the previous equation is correct.) Combining these derivatives, we find \[\begin{equation} \frac{\partial}{\partial \beta_i} r(\beta) = 0 - 2 (X^\top y)_i + 2 \bigl( X^\top X \beta \bigr)_i \tag{2.3} \end{equation}\] for all \(i \in \{0, 1, \ldots, p\}\). At a local minimum of \(r\), all of these partial derivatives must be zero and using a vector equation we find that a necessary condition for a minimum is \[\begin{equation} X^\top X \beta = X^\top y. \tag{2.4} \end{equation}\] Since we assumed that \(X^\top X\) is invertible, there is exactly one vector \(\beta\) which solves (2.4). This vector is given by \[\begin{equation*} \hat\beta := (X^\top X)^{-1} X^\top y. \end{equation*}\]

As for one-dimensional minimisation, there is a condition on the second derivatives which must be checked to see which local extrema are local minima. Here we are only going to sketch this argument: A sufficient condition for \(\hat\beta\) to be a minimum is for the second derivative matrix (the Hessian matrix) to be positive definite (see appendix A.2.8). Using equation (2.3) we find \[\begin{equation*} \frac{\partial}{\partial\beta_i \partial\beta_j} r(\beta) = 2 (X^\top X)_{i,j} \end{equation*}\] And thus the Hessian matrix is \(H = 2 X^\top X\). Using results from linear algebra, one can show that this matrix is indeed positive definite and thus \(\hat\beta\) is the unique minimum of \(r\).

Equation (2.4) gives a system of \(p+1\) linear equations with \(p+1\) unknowns. This system of linear equations, \(X^\top X \beta = X^\top y\) is called the normal equations. If \(X^\top X\) is invertible, as assumed in the lemma, this system of equations has \(\hat\beta\) as its unique solution. Otherwise, there may be more than one \(\beta\) which leads to the same value \(r(\beta)\) and the minimum will no longer be unique. This happens for example, if two of the inputs are identical to each other (or, more generally, one input is linearly dependent on one or more other inputs).

The condition that \(X^\top X\) must be invertible in lemma 2.1 corresponds to the condition \(\mathrm{s}_x^2 > 0\) in lemma 1.1.

The value \(\hat\beta\) found in the lemma is called the least squares estimator for \(\beta\).

2.3 Fitted Values

Let us again consider our model \[\begin{equation*} y = X \beta + \varepsilon, \end{equation*}\] using the matrix notation introduced above. Here we can think of \(X\beta\) as the true values, while \(\varepsilon\) are the errors. The design matrix \(X\) (containing the inputs) and the response \(y\) are known to us, while the true coefficients \(\beta\) and the errors \(\varepsilon\) are unknown. Solving for \(\varepsilon\) we find that the errors satisfy \[\begin{equation*} \varepsilon= y - X\beta. \end{equation*}\]

Using the least squares estimate \(\hat\beta\) we can estimate the true values as \[\begin{equation} \hat y = X \hat\beta. \tag{2.5} \end{equation}\] These estimates are called the fitted values. Using the definition of \(\hat\beta\) we get \[\begin{equation*} \hat y = X (X^\top X)^{-1} X^\top y =: Hy. \end{equation*}\] The matrix \(H = X (X^\top X)^{-1} X^\top\) is commonly called the hat matrix (because it “puts the hat on \(y\)”).

Finally, we can estimate the errors using the residuals \[\begin{equation} \hat\varepsilon = y - X \hat\beta = y - \hat y = y - H y = (I - H) y, \tag{2.6} \end{equation}\] where \(I\) is the \(n\times n\) identity matrix.

2.4 Models Without Intercept

Sometimes, it is useful to fit a model without an intercept. This means that we assume that the output \(y\) is a linear function of the inputs \(x_1, \ldots, x_p\) but that the output is zero when all inputs are zero. We write this as \[\begin{equation*} y_i = \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p} + \varepsilon_i \end{equation*}\] for all \(i \in \{1, \ldots, n\}\).

When we write this model in matrix notation, we need to leave out the intercept column in the design matrix \(X\). For this case we get \[\begin{equation*} y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n, \qquad \varepsilon= \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} \in \mathbb{R}^n, \qquad \beta = \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_p \end{pmatrix} \in\mathbb{R}^p \end{equation*}\] as well as the matrix \[\begin{equation*} X = \begin{pmatrix} x_{1,1} & \cdots & x_{1,p} \\ x_{2,1} & \cdots & x_{2,p} \\ \vdots & & \vdots \\ x_{n,1} & \cdots & x_{n,p} \\ \end{pmatrix} \in \mathbb{R}^{n\times p}. \end{equation*}\] The vectors \(y\) and \(\varepsilon\) are unchanged, but the vector \(\beta\) is now shorter and the matrix \(X\) is narrower.

Most of the results in this module carry over to this case, for example the least squares estimate for \(\beta\) is still given by \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top y. \end{equation*}\] However, some results need to be adjusted and to be sure one has to carefully check the details of the derivations when results are needed for models without intercept.

2.5 Example

To conclude this section, we demonstrate how these methods can be used in R. For this we consider the mtcars example from the beginning of the section again. I will first show how to do the analysis “by hand”, and later show how the same result can be obtained using R’s built-in functions.

We first split mtcars into the respons column y (the first column) and the design matrix X (a column of ones, followed by columns 2 to 11 of mtcars):

y <- mtcars[, 1]
X <- cbind(1, data.matrix(mtcars[, 2:11]))

Next we compute \(X^\top X\) and solve the normal equations. Often it is faster, easier, and has smaller numerical errors to solve the normal equations rather than inverting the matrix \(X^\top X\).

XtX <- t(X) %*% X
beta.hat <- solve(XtX, t(X) %*% y)
beta.hat
            [,1]
     12.30337416
cyl  -0.11144048
disp  0.01333524
hp   -0.02148212
drat  0.78711097
wt   -3.71530393
qsec  0.82104075
vs    0.31776281
am    2.52022689
gear  0.65541302
carb -0.19941925

Without further checks it is hard to know whether the result is correct, or whether we made a mistake somewhere. One good sign is that we argued earlier that higher hp should lead to lower mpg, and indeed the corresponding coefficient -0.02148212 is negative.

Finally, compute the fitted values and generate a plot of fitted values against responses. If everything worked, we would expect the points in this plot to be close to the diagonal.

y.hat <- X %*% beta.hat
plot(y, y.hat, xlab = "responses", ylab = "fitted values")
abline(a = 0, b = 1) # plot the diagonal

For comparison we now re-do the analysis using built-in R commands. In the lm() command below, we use data=mtcars to tell R where the data is stored, and the formula mpg ~ . states that we want to model mpg as a function of all other variable (this is the meaning of .).

m <- lm(mpg ~ ., data = mtcars) # fit a linear model
coef(m) # get the estimated coefficients
(Intercept)         cyl        disp          hp        drat          wt 
12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393 
       qsec          vs          am        gear        carb 
 0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925 

Comparing these coefficients to the vector beta.hat from above shows that we got the same result using both methods. The fitted values can be computed using fitted.values(m). Here we just check that we get the same result as above:

max(abs(fitted.values(m) - y.hat))
[1] 3.410605e-13

This result 3.410605e-13 stands for the number \(3.410605 \cdot 10^{-13}\), which is extremely small. The difference between our results and R’s result is caused by rounding errors.

Summary

  • Multiple linear regression allows for more than one input, but still has only one output.
  • The least squared estimate for the coefficients is found by minimising the residual sum of squares.
  • The least squares estimate can be computed by solving the normal equations.
  • The hat matrix transforms responses into fitted values.