Section 4 Properties of the Least Squares Estimate

In the previous section we introduced random vectors and the multivariate normal distribution. We now use these tools to build a statistical model for multiple linear regression. As in the one-dimensional case, we assume that the errors are random. More precisely we have \[\begin{equation} Y = X \beta + \varepsilon. \tag{4.1} \end{equation}\] where \(\varepsilon= (\varepsilon_1, \ldots, \varepsilon_n)\) is a random error. The individual errors \(\varepsilon_1, \ldots, \varepsilon_n\) are assumed to be i.i.d. random variables with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\).

Remarks:

  • The parameters in this model are the regression coefficients \(\beta = (\beta_0, \beta_1, \cdots, \beta_p) \in \mathbb{R}^{p+1}\) and the error variance \(\sigma^2\).

  • Using the notation from the previous section, we have \(\varepsilon\sim \mathcal{N}(0, \sigma^2 I)\) and \(Y \sim \mathcal{N}(X\beta, \sigma^2 I)\).

  • The model assumes that the \(x\)-values are fixed and known. The only random quantities in the model are \(\varepsilon_i\) and \(Y_i\). This is a modelling assumption, there are more general models where the \(x\)-values are also random.

Having defined the model, our next task is to understand the properties of the least squares estimator \(\hat\beta\) when applied to data from this model. We study these properties by treating the estimates as random variables.

4.1 Mean and Covariance

We begin by examining the bias of the estimator. To find \(\mathop{\mathrm{bias}}(\hat\beta) = \mathbb{E}(\hat\beta) - \beta\) we need to determine the expectation of \(\hat\beta = (X^\top X)^{-1} X^\top Y\), where \(Y\) is the random vector from the model (4.1).

Lemma 4.1 We have

  1. \(\hat\beta = \beta + (X^\top X)^{-1} X^\top \varepsilon\) and

  2. \(\hat\beta \sim \mathcal{N}\bigl( \beta, \sigma^2 (X^\top X)^{-1} \bigr)\).

Proof. From lemma 2.1 we know \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top Y, \end{equation*}\] Using the definition of \(Y\) we can write this as \[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top Y \\ &= (X^\top X)^{-1} X^\top (X\beta + \varepsilon) \\ &= (X^\top X)^{-1} X^\top X \beta + (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top \varepsilon. \end{align*}\] This proves the first claim.

Since \(\varepsilon\) follows a multivariate normal distribution, \(\beta + (X^\top X)^{-1} X^\top \varepsilon\) is also normally distributed. Taking expectations we get \[\begin{align*} \mathbb{E}(\hat\beta) &= \mathbb{E}\bigl( \beta + (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= \beta + (X^\top X)^{-1} X^\top \mathbb{E}(\varepsilon) \\ &= \beta, \end{align*}\] since \(\mathbb{E}(\varepsilon) = 0\).

For the covariance we find \[\begin{align*} \mathop{\mathrm{Cov}}(\hat\beta) &= \mathop{\mathrm{Cov}}\bigl( \beta + (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= \mathop{\mathrm{Cov}}\bigl( (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= (X^\top X)^{-1} X^\top \mathop{\mathrm{Cov}}(\varepsilon) \bigl( (X^\top X)^{-1} X^\top \bigr)^\top \\ &= (X^\top X)^{-1} X^\top \mathop{\mathrm{Cov}}(\varepsilon) X (X^\top X)^{-1}. \end{align*}\] Since \(\mathop{\mathrm{Cov}}(\varepsilon) = \sigma^2 I\), this simplifies to \[\begin{align*} \mathop{\mathrm{Cov}}(\hat\beta) &= (X^\top X)^{-1} X^\top \sigma^2 I X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1} X^\top X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1}. \end{align*}\] This completes the proof.

The lemma implies that \(\mathbb{E}(\hat\beta) = \beta\), i.e. the estimator \(\hat\beta\) is unbiased. Note that for this statement we only used \(\mathbb{E}(\varepsilon) = 0\) to compute the expectation of \(\hat\beta\). Thus, the estimator will still be unbiased for correlated noise or for noise which is not normally distributed.

We have seen earlier in equation (3.2) that the diagonal elements of a covariance matrix give the variances of the elements of the random vector. Setting \(C := (X^\top X)^{-1}\) as a shorthand here, we find that the individual estimated coefficients \(\hat\beta_j\) satisfy \[\begin{equation} \hat\beta_j \sim \mathcal{N}\bigl( \beta_j, \sigma^2 C_{jj} \bigr) \tag{4.2} \end{equation}\] for \(j \in \{0, \ldots, p\}\).

Before we can use these results in practice, we need to address a problem: the error variance \(\sigma^2\) is usually unknown. We will return to estimating \(\sigma^2\) in section 4.4 below.

4.2 Properties of the Hat Matrix

Before we can estimate the error variance \(\sigma^2\), we need to establish some important properties of the hat matrix \(H = X (X^\top X)^{-1} X^\top\). These properties will be essential tools for deriving the distribution of our variance estimator.

Lemma 4.2 The hat matrix \(H\) has the following properties:

  1. \(H^\top = H\)
  2. \((I-H)^\top = I-H\)
  3. \(H^2 = H\)
  4. \((I-H)^2 = I-H\)
  5. \(HX = X\)
  6. \((I-H)X = 0\)

Proof. For property 1 we have \[\begin{align*} H^\top &= \bigl( X (X^\top X)^{-1} X^\top \bigr)^\top \\ &= (X^\top)^\top \bigl((X^\top X)^{-1}\bigr)^\top X^\top \\ &= X (X^\top X)^{-1} X^\top \\ &= H, \end{align*}\] where we used that the inverse of a symmetric matrix is symmetric.

Property 2 follows from property 1, since \[\begin{equation*} (I - H)^\top = I^\top - H^\top = I - H. \end{equation*}\]

For property 3 we have \[\begin{align*} H^2 &= \bigl( X (X^\top X)^{-1} X^\top \bigr) \bigl( X (X^\top X)^{-1} X^\top \bigr) \\ &= X \bigl( (X^\top X)^{-1} X^\top X \bigr) (X^\top X)^{-1} X^\top \\ &= X (X^\top X)^{-1} X^\top \\ &= H. \end{align*}\]

Property 4 follows from property 3, since \[\begin{equation*} (I - H)^2 = I - 2H + H^2 = I - 2H + H = I - H. \end{equation*}\]

For property 5 we have \[\begin{equation*} H X = X (X^\top X)^{-1} X^\top X = X I = X. \end{equation*}\]

Property 6 follows from property 5, since \[\begin{equation*} (I - H) X = IX - HX = X - X = 0. \end{equation*}\] This completes the proof.

Remark. Properties 1 and 2 state that \(H\) and \(I - H\) are symmetric. Properties 3 and 4 state that \(H\) and \(I - H\) are idempotent.

Before proceeding to further properties of \(H\), we note without proof that geometrically, \(H\) can be interpreted as the orthogonal projection onto the subspace of \(\mathbb{R}^n\) which is spanned by the columns of \(X\). This subspace contains the possible output vectors of the linear system and the least squares procedure finds the point \(\hat y\) in this subspace which is closest to the observed data \(y\in\mathbb{R}^n\).

We can now use the transformation properties of the multivariate normal distribution to find the distributions of the fitted values and residuals.

Lemma 4.3 Under the model (4.1), the fitted values and residuals satisfy \[\begin{equation} \hat Y \sim \mathcal{N}\bigl( X\beta, \sigma^2 H \bigr) \tag{4.3} \end{equation}\] and \[\begin{equation} \hat\varepsilon\sim \mathcal{N}\bigl( 0, \sigma^2 (I - H) \bigr). \tag{4.4} \end{equation}\]

Proof. Since \(Y \sim \mathcal{N}(X\beta, \sigma^2 I)\), we have \(\hat Y = HY \sim \mathcal{N}(HX\beta, H\sigma^2 I H^\top)\) by lemma 3.1. Using properties 3 and 5 of lemma 4.2, we find \(\hat Y \sim \mathcal{N}(X\beta, \sigma^2 H)\). Similarly, \(\hat\varepsilon= (I - H)Y \sim \mathcal{N}\bigl((I - H)X\beta, \sigma^2(I - H)\bigr)\), and using lemma 4.2 again, we obtain \(\hat\varepsilon\sim \mathcal{N}\bigl(0, \sigma^2(I - H)\bigr)\).

The distributions of \(\hat Y\) and \(\hat\varepsilon\) reveal a deeper geometric structure, described in the next lemma.

Lemma 4.4 For every vector \(v \in \mathbb{R}^n\) we have \[\begin{equation} \|v\|^2 = \|Hv\|^2 + \|(I-H)v\|^2. \tag{4.5} \end{equation}\] In particular, for the data vector \(y\) we have \(\|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2\).

Proof. We can write \(v\) as \[\begin{align*} v &= (H + I - H)v \\ &= Hv + (I-H)v. \end{align*}\] The inner product between these two components is \[\begin{align*} (Hv)^\top (I-H)v &= v^\top H^\top (I-H) v \\ &= v^\top H (I-H) v \\ &= v^\top (H-H^2) v \\ &= v^\top (H-H) v \\ &= 0, \end{align*}\] so the two vectors are orthogonal. As a result we get \[\begin{align*} \|v\|^2 &= v^\top v \\ &= \bigl( Hv + (I-H)v \bigr)^\top \bigl( Hv + (I-H)v \bigr) \\ &= (Hv)^\top Hv + 2 (Hv)^\top (I-H)v + \bigl((I-H)v\bigr)^\top (I-H)v \\ &= \| Hv \|^2 + \|(I-H)v \|^2 \end{align*}\] for any vector \(v \in \mathbb{R}^n\). (This is Pythagoras’ theorem in \(\mathbb{R}^n\).) Since \(\hat y = Hy\) and \(\hat\varepsilon= (I - H)y\), we can apply this idea to the vector \(v = y\) to get \(\|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2\). This completes the proof.

Some authors define:

  • \(\mathrm{SS}_\mathrm{T} = \sum_{i=1}^n y_i^2\) (where “T” stands for “total”)
  • \(\mathrm{SS}_\mathrm{R} = \sum_{i=1}^n \hat y_i^2\) (where “R” stands for “regression”)
  • \(\mathrm{SS}_\mathrm{E} = \sum_{i=1}^n \hat\varepsilon_i^2 = \sum_{i=1}^n (y_i-\hat y_i)^2\) (where “E” stands for “error”)

Using this notation, our equation \(\|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2\) turns into \[\begin{equation*} \mathrm{SS}_\mathrm{T} = \mathrm{SS}_\mathrm{R} + \mathrm{SS}_\mathrm{E}. \end{equation*}\]

A final algebraic property of the hat matrix concerns its trace.

Lemma 4.5 We have \[\begin{equation} \mathop{\mathrm{tr}}(H) = p+1. \tag{4.6} \end{equation}\]

Proof. Using the property of the trace that \(\mathop{\mathrm{tr}}(ABC) = \mathop{\mathrm{tr}}(BCA)\) (see section A.2.5), we find \[\begin{align*} \mathop{\mathrm{tr}}(H) &= \mathop{\mathrm{tr}}\bigl( X (X^\top X)^{-1} X^\top \bigr) \\ &= \mathop{\mathrm{tr}}\bigl( (X^\top X)^{-1} X^\top X \bigr) \\ &= \mathop{\mathrm{tr}}(I) \\ &= p+1. \end{align*}\] This completes the proof.

4.3 Cochran’s theorem

To prove that our variance estimator is unbiased, we need to understand the distribution of certain expressions involving the error vector. These expressions are of the form \(\varepsilon^\top A \varepsilon\) for suitable matrices \(A\). Such expressions are called quadratic forms.

Definition 4.1 For \(x\in\mathbb{R}^n\) and \(A\in\mathbb{R}^{n\times n}\), the expression \(x^\top A x\) is called a quadratic form.

We use a simplified version of Cochran’s theorem as our main tool in this and the following section:

Theorem 4.1 The following statements are true:

  1. \(\frac{1}{\sigma^2} \varepsilon^\top H \varepsilon\sim \chi^2(p+1)\)

  2. \(\frac{1}{\sigma^2} \varepsilon^\top (I - H) \varepsilon\sim \chi^2(n - p - 1)\)

  3. \(H \varepsilon\) and \((I-H)\varepsilon\) are independent.

Proof. We transform the error vector \(\varepsilon\) to a coordinate system where \(H\) becomes diagonal. In this new coordinate system, the quadratic forms \(\varepsilon^\top H \varepsilon\) and \(\varepsilon^\top (I-H) \varepsilon\) become sums of squares of independent standard normals, making their distributions easy to determine.

We start by diagonalising \(H\). Since \(H\) is symmetric, we can diagonalise \(H\) (see A.2 in the appendix): there is an orthogonal matrix \(U\) such that \(D := U H U^\top\) is diagonal, and the diagonal elements of \(D\) are the eigenvalues of \(H\).

Next we determine the eigenvalues of \(H\). Since \(H\) is idempotent, we have \(H^2 = H\). This implies that every eigenvalue \(\lambda\) of \(H\) satisfies \(\lambda^2 = \lambda\), and thus \(\lambda \in \{0, 1\}\). Therefore, the diagonal elements of \(D\) can only be \(0\) or \(1\).

Since \(U\) is orthogonal, we have \(U^\top U = I\) and thus \[\begin{equation*} U^\top D U = U^\top U H U^\top U = H. \end{equation*}\] The same matrix \(U\) also diagonalises \(I-H\), since \[\begin{equation*} U (I -H) U^\top = U U^\top - U H U^\top = I - D. \end{equation*}\] Note that exactly one of the diagonal elements \(D_{ii}\) and \((I - D)_{ii}\) is \(1\) and the other is \(0\) for every \(i\). This partitions the index set \(\{1, \ldots, n\}\) into two disjoint subsets: the set \(\mathcal{I}_1 = \{ i : D_{ii} = 1 \}\) contains exactly \(p+1\) elements (corresponding to \(\mathop{\mathrm{rank}}(H) = p+1\)), and the set \(\mathcal{I}_0 = \{ i : D_{ii} = 0 \}\) contains the remaining \(n - p - 1\) elements. This partition is the key to understanding both the degrees of freedom in the \(\chi^2\) distributions and the independence of the two quadratic forms.

Since \(\varepsilon\sim \mathcal{N}(0, \sigma^2 I)\) we find that \(\eta := U \varepsilon\) is normally distributed with mean \(U 0 = 0\) and covariance matrix \(\sigma^2 U I U^\top = \sigma^2 U U^\top = \sigma^2 I\). Thus \(\eta\) has the same distribution as \(\varepsilon\) does: \(\eta \sim \mathcal{N}(0, \sigma^2I)\) and the components \(\eta_i\) are independent of each other. We have \[\begin{equation*} H \varepsilon = U^\top D U \varepsilon = U^\top D \eta. \end{equation*}\] and \[\begin{equation*} (I - H) \varepsilon = U^\top (I - D) U \varepsilon = U^\top (I - D) \eta. \end{equation*}\] To prove the independence statement, we examine how the components of \(\eta\) contribute to \(D\eta\) and \((I-D)\eta\). Since \(D\) is diagonal, we have \[\begin{equation*} (D\eta)_i = D_{ii} \eta_i = \begin{cases} \eta_i & \text{if } i \in \mathcal{I}_1 \\ 0 & \text{if } i \in \mathcal{I}_0. \end{cases} \end{equation*}\] Similarly, \[\begin{equation*} \bigl((I - D)\eta\bigr)_i = (1 - D_{ii}) \eta_i = \begin{cases} 0 & \text{if } i \in \mathcal{I}_1 \\ \eta_i & \text{if } i \in \mathcal{I}_0. \end{cases} \end{equation*}\] This shows that each component \(\eta_i\) contributes to exactly one of the two vectors \(D\eta\) and \((I-D)\eta\): it contributes to \(D\eta\) if \(i \in \mathcal{I}_1\), and to \((I-D)\eta\) if \(i \in \mathcal{I}_0\). Since the sets \(\mathcal{I}_1\) and \(\mathcal{I}_0\) are disjoint, the vectors \(D\eta\) and \((I-D)\eta\) depend on disjoint subsets of the independent random variables \(\eta_1, \ldots, \eta_n\). Therefore, \(D\eta\) and \((I-D)\eta\) are independent. Finally, since \(H\varepsilon= U^\top D\eta\) and \((I-H)\varepsilon= U^\top(I-D)\eta\) are deterministic linear transformations of \(D\eta\) and \((I-D)\eta\), it follows that \(H\varepsilon\) and \((I - H)\varepsilon\) are also independent. This proves the third statement of the theorem.

For the first statement, we note that \[\begin{align*} \varepsilon^\top H \varepsilon &= \varepsilon^\top U^\top D U \varepsilon\\ &= \eta^\top D \eta \\ &= \sum_{i=1 \atop D_{ii}=1}^n \eta_i^2. \end{align*}\] Since \((X^\top X) \in\mathbb{R}^{(p+1)\times (p+1)}\) is invertible, one can show that \(\mathop{\mathrm{rank}}(H) = p+1\) and thus that there are \(p+1\) terms contributing to the sum (we skip the proof of this statement here). Thus, \[\begin{equation*} \frac{1}{\sigma^2} \varepsilon^\top H \varepsilon = \sum_{i=1 \atop D_{ii}=1}^n \bigl(\eta_i/\sigma)^2 \end{equation*}\] is the sum of the squares of \(p+1\) independent standard normals, and thus is \(\chi^2(p+1)\)-distributed. This completes the proof of the first statement.

Finally, the second statement follows in much of the same way as the first one, except that \(H\) is replaced with \(I-H\) and the sum is over the \(n - p - 1\) indices \(i\) where \(D_{ii} = 0\). This completes the proof.

Since \(\varepsilon^\top H \varepsilon= (H\varepsilon)^\top (H\varepsilon)\) is a function of \(H\varepsilon\), and similarly for \(I-H\), the independence in part 3 of the theorem extends to the quadratic forms in parts 1 and 2.

4.4 Estimating the Error Variance

Recall from equation (4.2) that the variance of each coefficient estimate is \(\sigma^2 C_{jj}\). To use this result in practice—for constructing confidence intervals or performing hypothesis tests—we need to estimate the unknown parameter \(\sigma^2\). A natural approach is to examine the residuals \(y_i - \hat y_i\), which estimate the unobserved errors \(\varepsilon_i\).

Having established the necessary properties of the hat matrix and Cochran’s theorem, we can now show that an unbiased estimator for \(\sigma^2\) is given by \[\begin{equation} \hat\sigma^2 := \frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat y_i)^2 = \frac{1}{n-p-1} \sum_{i=1}^n \hat\varepsilon_i^2, \tag{4.7}. \end{equation}\] As for the case of simple linear regression, in (1.4), the estimate does not have the prefactor \(1/n\), which one might expect, but the denominator is decreased by one for each component of the vector \(\beta\). Using Cochran’s theorem, we can now show that the estimator \(\hat\sigma^2\) is unbiased.

Lemma 4.6 We have \[\begin{equation*} \mathbb{E}(\hat\sigma^2) = \sigma^2. \end{equation*}\]

Proof. We first note that for data from the model we have \[\begin{equation*} \hat\varepsilon = (I-H) y = (I-H) (X\beta + \varepsilon) = (I-H) \varepsilon, \end{equation*}\] where we used property 6 of lemma 4.2 to see that \((I-H)X\beta = 0\). Using this relation, we get \[\begin{align*} (n - p - 1) \hat\sigma^2 &= \hat\varepsilon^\top \hat\varepsilon\\ &= \varepsilon^\top (I - H)^\top (I - H) \varepsilon\\ &= \varepsilon^\top (I - H) \varepsilon. \end{align*}\]

Now we can apply Cochran’s theorem. This shows that \[\begin{equation} \frac{1}{\sigma^2} (n - p - 1) \hat\sigma^2 = \frac{1}{\sigma^2} \varepsilon^\top (I-H) \varepsilon \sim \chi^2(n - p - 1). \tag{4.8} \end{equation}\] Since the expectation of a \(\chi^2(\nu)\) distribution equals \(\nu\) (see appendix B.5.1), we find \[\begin{equation*} \frac{1}{\sigma^2} (n - p - 1) \mathbb{E}(\hat\sigma^2) = n - p - 1 \end{equation*}\] and thus \[\begin{equation*} \mathbb{E}(\hat\sigma^2) = \sigma^2. \end{equation*}\] This completes the proof.

The lemma shows that \(\hat\sigma^2\) is an unbiased estimator for \(\sigma^2\).

This concludes our study of the statistical properties of the least squares estimator. In the next section we will use these properties to construct confidence intervals and hypothesis tests for the regression coefficients.

Summary

  • The least squares estimator for the regression coefficients is unbiased.
  • The hat matrix is idempotent and symmetric.
  • Cochran’s theorem allows us to understand the distribution of some quadratic forms involving the hat matrix.
  • \(\hat\sigma^2\) is an unbiased estimator for \(\sigma^2\).