Section 4 Properties of the Least Squares Estimate

Like in the one-dimensional case, we can build a statistical model for the data. Here we assume that the residuals are random. More precisely we have \[\begin{equation} Y = X \beta + \varepsilon. \tag{4.1} \end{equation}\] where \(\varepsilon= (\varepsilon_1, \ldots, \varepsilon_n)\) is a random error. The individual errors \(\varepsilon_1, \ldots, \varepsilon_n\) are assumed to be i.i.d. random variables with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\).

Remarks:

The parameters in this model are the regression coefficients \(\beta = (\beta_0, \beta_1, \cdots, \beta_p) \in \mathbb{R}^{p+1}\) and the error variance \(\sigma^2\).
The model assumes that the \(x\)-values are fixed and known. The only random quantities in the model are \(\varepsilon_i\) and \(Y_i\). This is a modelling assumption, there are more general models where the \(x\)-values are also random.
Using the notation from the previous section, we have \(\varepsilon\sim \mathcal{N}(0, \sigma^2 I)\) and \(Y \sim \mathcal{N}(X\beta, \sigma^2 I)\).

The usual approach in statistics to quantify how well an estimator works is to apply it to random samples from the statistical model, where we can assume that we know the parameters, and then to study how well the parameters are reconstructed by the estimators. Since this approach uses random samples as input the the estimator, we obtain random estimates and we need to use statistical methods to quantify how close the estimate is to the truth.

4.1 Mean and Covariance

The bias of an estimator is the difference between the expected value of the estimate and the truth. For the least squares estimator we have \[\begin{equation*} \mathop{\mathrm{bias}}(\hat\beta) = \mathbb{E}(\hat\beta) - \beta, \end{equation*}\] where \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top Y \end{equation*}\] and \(Y\) is the random vector from (4.1).

Lemma 4.1 We have

\(\hat\beta = \beta + (X^\top X)^{-1} X^\top \varepsilon\) and
\(\hat\beta \sim \mathcal{N}\bigl( \beta, \sigma^2 (X^\top X)^{-1} \bigr)\).

Proof. From lemma 2.1 we know \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top Y, \end{equation*}\] Using the definition of \(Y\) we can write this as \[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top Y \\ &= (X^\top X)^{-1} X^\top (X\beta + \varepsilon) \\ &= (X^\top X)^{-1} X^\top X \beta + (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top \varepsilon. \end{align*}\] This proves the first claim.

Since \(\varepsilon\) follows a multi-variate normal distribution, \(\beta + (X^\top X)^{-1} X^\top \varepsilon\) is also normally distributed. Taking expectations we get \[\begin{align*} \mathbb{E}(\hat\beta) &= \mathbb{E}\bigl( \beta + (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= \beta + (X^\top X)^{-1} X^\top \mathbb{E}(\varepsilon) \\ &= \beta, \end{align*}\] since \(\mathbb{E}(\varepsilon) = 0\).

For the covariance we find \[\begin{align*} \mathop{\mathrm{Cov}}(\hat\beta) &= \mathop{\mathrm{Cov}}\bigl( \beta + (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= \mathop{\mathrm{Cov}}\bigl( (X^\top X)^{-1} X^\top \varepsilon\bigr) \\ &= (X^\top X)^{-1} X^\top \mathop{\mathrm{Cov}}(\varepsilon) \bigl( (X^\top X)^{-1} X^\top \bigr)^\top \\ &= (X^\top X)^{-1} X^\top \mathop{\mathrm{Cov}}(\varepsilon) X (X^\top X)^{-1}. \end{align*}\] Since \(\mathop{\mathrm{Cov}}(\varepsilon) = \sigma^2 I\), this simplifies to \[\begin{align*} \mathop{\mathrm{Cov}}(\hat\beta) &= (X^\top X)^{-1} X^\top \sigma^2 I X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1} X^\top X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1}. \end{align*}\] This completes the proof.

The lemma implies that \(\mathbb{E}(\hat\beta) = \beta\), i.e. the estimator \(\hat\beta\) is unbiased. Note that for this statement we only used \(\mathbb{E}(\varepsilon) = 0\) to compute the expectation of \(\hat\beta\). Thus, the estimator will still be unbiased for correlated noise or for noise which is not normally distributed.

We have seen earlier that the diagonal elements of a covariance matrix give the variances of the elements of the random vector. Setting \(C := (X^\top X)^{-1}\) as a shorthand here, we find that the individual estimated coefficients \(\hat\beta_i\) satisfy \[\begin{equation} \hat\beta_i \sim \mathcal{N}\bigl( \beta_i, \sigma^2 C_{ii} \bigr) \tag{4.2} \end{equation}\] for \(i \in \{1, \ldots, n\}\).

These results about the (co-)variances of the estimator are not very useful in practice, because the error variance \(\sigma^2\) is usually unknown. To derive more useful results, we will consider how to estimate this variance.

4.2 Properties of the Hat Matrix

In this and the following sections we will use various properties of the hat matrix \(H = X (X^\top X)^{-1} X^\top\).

Lemma 4.2 The hat matrix \(H\) has the following properties:

\(H\) is symmetric, i.e. \(H^\top = H\).
\(H\) is idempotent, i.e. \(H^2 = H\).

Proof. For the first statement we have \[\begin{align*} H^\top &= \bigl( X (X^\top X)^{-1} X^\top \bigr)^\top \\ &= (X^\top)^\top \bigl((X^\top X)^{-1}\bigr)^\top X^\top \\ &= X (X^\top X)^{-1} X^\top \\ &= H, \end{align*}\] where we used that the inverse of a symmetric matrix is symmetric. The second statement follow from \[\begin{align*} H^2 &= \bigl( X (X^\top X)^{-1} X^\top \bigr) \bigl( X (X^\top X)^{-1} X^\top \bigr) \\ &= X \bigl( (X^\top X)^{-1} X^\top X \bigr) (X^\top X)^{-1} X^\top \\ &= X (X^\top X)^{-1} X^\top. \end{align*}\] This completes the proof.

Both properties from the lemma carry over from \(H\) to \(I - H\): we have \((I - H)^\top = I^\top - H^\top = I - H\) and \[\begin{align*} (I - H)^2 &= (I - H)(I - H) \\ &= I^2 - HI - IH + H^2 \\ &= I - H - H + H \\ &= I - H. \end{align*}\]

For future reference we also state two simple results: we have \[\begin{equation} H X = X (X^\top X)^{-1} X^\top X = X \tag{4.3} \end{equation}\] and \[\begin{equation} (I - H) X = IX - HX = X - X = 0. \tag{4.4} \end{equation}\]

Lemma 4.3 For every vector \(v \in \mathbb{R}^n\) we have \[\begin{equation} \|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2. \tag{4.5} \end{equation}\]

Proof. We can write \(v\) as \[\begin{align*} v &= (H + I - H)v \\ &= Hv + (I-H)v. \end{align*}\] The inner product between these two components is \[\begin{align*} (Hv)^\top (I-H)v &= v^\top H^\top (I-H) v \\ &= v^\top H (I-H) v \\ &= v^\top (H-H^2) v \\ &= v^\top (H-H) v \\ &= 0, \end{align*}\] so the two vectors are orthogonal. As a result we get \[\begin{align*} \|v\|^2 &= v^\top v \\ &= \bigl( Hv + (I-H)v \bigr)^\top \bigl( Hv + (I-H)v \bigr) \\ &= (Hv)^\top Hv + 2 (Hv)^\top (I-H)v + \bigl((I-H)v\bigr)^\top (I-H)v \\ &= \| Hv \|^2 + \|(I-H)v \|^2 \end{align*}\] for any vector \(v \in \mathbb{R}^n\). (This is Pythagoras’ theorem in \(\mathbb{R}^n\).) Since \(\hat y = Hy\) and \(\hat\varepsilon= (I - H)y\), we can apply this idea to the vector \(v = y\) to get \(\|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2\). This completes the proof.

Some authors define:

\(\mathrm{SS}_\mathrm{T} = \sum_{i=1}^n y_i^2\) (where “T” stands for “total”)
\(\mathrm{SS}_\mathrm{R} = \sum_{i=1}^n \hat y_i^2\) (where “R” stands for “regression”)
\(\mathrm{SS}_\mathrm{E} = \sum_{i=1}^n \hat\varepsilon_i^2 = \sum_{i=1}^n (y_i-\hat y_i)^2\) (where “E” stands for “error”)

Using this notation, our equation \(\|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2\) turns into \[\begin{equation*} \mathrm{SS}_\mathrm{T} = \mathrm{SS}_\mathrm{R} + \mathrm{SS}_\mathrm{E}. \end{equation*}\]

We note without proof that geometrically, \(H\) can be interpreted as the orthogonal projection onto the subspace of \(\mathbb{R}^n\) which is spanned by the columns of \(X\). This subspace contains the possible output vectors of the linear system and the least squares procedure finds the point \(\hat y\) in this subspace which is closest to the observed data \(y\in\mathbb{R}^n\).

One final property of \(H\) which we will need later is that the trace of \(H\) is equal to the number of parameters in the model plus one

Lemma 4.4 We have \[\begin{equation} \mathop{\mathrm{tr}}(H) = p+1. \tag{4.6} \end{equation}\]

Proof. Using the property of the trace that \(\mathop{\mathrm{tr}}(ABC) = \mathop{\mathrm{tr}}(BCA)\) (see section A.2.5), we find \[\begin{align*} \mathop{\mathrm{tr}}(H) &= \mathop{\mathrm{tr}}\bigl( X (X^\top X)^{-1} X^\top \bigr) \\ &= \mathop{\mathrm{tr}}\bigl( (X^\top X)^{-1} X^\top X \bigr) \\ &= \mathop{\mathrm{tr}}(I) \\ &= p+1. \end{align*}\] This completes the proof.

4.3 Cochran’s theorem

Our main tool in this and the following section will be a simplified version of Cochran’s theorem:

Theorem 4.1 The following statements are true:

\(\frac{1}{\sigma^2} \varepsilon^\top H \varepsilon\sim \chi^2(p+1)\)
\(\frac{1}{\sigma^2} \varepsilon^\top (I - H) \varepsilon\sim \chi^2(n - p - 1)\)
\(H \varepsilon\) and \((I-H)\varepsilon\) are independent.

Proof. Since \(H\) is symmetric, we can diagonalise \(H\) (see A.2 in the appendix): there is an orthogonal matrix \(U\) such that \(D := U H U^\top\) is diagonal, and the diagonal elements of \(D\) are the eigenvalues of \(H\). Since \(H\) is idempotent, these diagonal elements can only be \(0\) or \(1\). Also, since \(U\) is orthogonal, we have \(U^\top U = I\) and thus \[\begin{equation*} U^\top D U = U^\top U H U^\top U = H. \end{equation*}\] The same matrix \(U\) also diagonalises \(I-H\), since \(U (I -H) U^\top = U U^\top - U H U^\top = I - D\). Exactly one of the diagonal elements \(D_{ii}\) and \((I - D)_{ii}\) is \(1\) and the other one is \(0\) for every \(i\).

Since \(\varepsilon\sim \mathcal{N}(0, \sigma^2 I)\) we find that \(\eta := U \varepsilon\) is normally distributed with mean \(U 0 = 0\) and covariance matrix \(\sigma^2 U I U^\top = \sigma^2 U U^\top = \sigma^2 I\). Thus \(\eta\) has the same distribution as \(\varepsilon\) does: \(\eta \sim \mathcal{N}(0, \sigma^2I)\) and the components \(\eta_i\) are independent of each other. We have \[\begin{equation*} H \varepsilon = U^\top D U \varepsilon = U^\top D \eta. \end{equation*}\] and \[\begin{equation*} (I - H) \varepsilon = U^\top (I - D) U \varepsilon = U^\top (I - D) \eta. \end{equation*}\] Since \((D\eta)_i = 0\) if \(D_{ii}=0\) and \(\bigl((I - D) \eta)_i = 0\) otherwise, each component of \(\eta\) contributes to exactly one of the two vectors \(D\eta\) and \((I-D)\eta\). Thus, \(D\eta\) and \((I-D)\eta\) are independent, and thus \(H\varepsilon\) and \((I - H)\varepsilon\) are also independent. This proves the third statement of the theorem.

For the first statement, we note that \[\begin{align*} \varepsilon^\top H \varepsilon &= \varepsilon^\top U^\top D U \varepsilon\\ &= \eta^\top D \eta \\ &= \sum_{i=1 \atop D_{ii}=1}^n \eta_i^2. \end{align*}\] Since \((X^\top X) \in\mathbb{R}^{(p+1)\times (p+1)}\) is invertible, one can show that \(\mathop{\mathrm{rank}}(H) = p+1\) and thus that there are \(p+1\) terms contributing to the sum (we skip the proof of this statement here). Thus, \[\begin{equation*} \frac{1}{\sigma^2} \varepsilon^\top H \varepsilon = \sum_{i=1 \atop D_{ii}=1}^n \bigl(\eta_i/\sigma)^2 \end{equation*}\] is the sum of the squares of \(p+1\) independent standard normals, and thus is \(\chi^2(p+1)\) distributed. This completes the proof of the first statement.

Finally, the second statement follows in much of the same way as the first one, except that \(H\) is replaced with \(I-H\) and the sum is over the \(n - p - 1\) indices \(i\) where \(D_{ii} = 0\). This completes the proof.

Expressions of the form \(x^\top A x\) for \(x\in\mathbb{R}^n\) and \(A\in\mathbb{R}^{n\times n}\) are called quadratic forms.

While the theorem as written only states that \(H \varepsilon\) and \((I - H)\varepsilon\) are independent of each other, we can replace one or both of these terms the corresponding quadratic forms as still keep the independence. Since \((H \varepsilon)^\top (H \varepsilon) = \varepsilon^\top H^\top H \varepsilon= \varepsilon^\top H \varepsilon\), the quadratic form \(\varepsilon^\top H \varepsilon\) is a function of \(H \varepsilon\) and a similar statement holds with \(H-I\) instead of \(H\).

4.4 Estimating the Error Variance

So far we have only considered how to estimate the parameter vector \(\beta\) and we have ignored the parameter \(\sigma^2\). We will see that an unbiased estimator for \(\sigma^2\) is given by \[\begin{equation} \hat\sigma^2 := \frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat y_i)^2, \tag{4.7} \end{equation}\] where \(\hat y_i\) are the fitted values from equation (2.5). As for the one-dimensional case in (1.4), the estimate does not have the prefactor \(1/n\), which one might naively expect, but the denominator is decreased by one for each component of the vector \(\beta\). Using Cochran’s theorem, we can now show that the estimator \(\hat\sigma^2\) is unbiased.

Lemma 4.5 We have \[\begin{equation*} \mathbb{E}(\hat\sigma^2) = \sigma^2. \end{equation*}\]

Proof. We first note that \[\begin{align*} (n - p - 1) \hat\sigma^2 &= (y - \hat y)^\top (y - \hat y) \\ &= (y - H y)^\top (y - H y) \\ &= y^\top (I - H)^\top (I - H) y \\ &= y^\top (I - H) y. \end{align*}\] To determine the bias, we need to use \(Y = X\beta + \varepsilon\) in place of the data. This gives \[\begin{align*} (n - p - 1) \hat\sigma^2 &= Y^\top (I-H) Y \\ &= (X\beta + \varepsilon)^\top (I-H) (X\beta + \varepsilon) \\ &= \beta^\top X^\top (I-H) X \beta + 2 \varepsilon^\top (I-H) X \beta + \varepsilon^\top (I-H) \varepsilon\\ &= \varepsilon^\top (I-H) \varepsilon, \end{align*}\] where we used equation (4.4) to see that the first two terms in the sum equal zero.

Now we can apply Cochran’s theorem. This shows that \[\begin{equation} \frac{1}{\sigma^2} (n - p - 1) \hat\sigma^2 = \frac{1}{\sigma^2} \varepsilon^\top (I-H) \varepsilon \sim \chi^2(n - p - 1). \tag{4.8} \end{equation}\] Since the expectation of a \(\chi^2(\nu)\) distribution equals \(\nu\) (see appendix B.5.1), we find \[\begin{equation*} \frac{1}{\sigma^2} (n - p - 1) \mathbb{E}(\hat\sigma^2) = n - p - 1 \end{equation*}\] and thus \[\begin{equation*} \mathbb{E}(\hat\sigma^2) = \sigma^2. \end{equation*}\] This completes the proof.

The lemma shows that \(\hat\sigma^2\) is an unbiased estimator for \(\sigma^2\).

Summary

The least squares estimator for the regression coefficients is unbiased.
The hat matrix is idempotent and symmetric.
Cochran’s theorem allows to understand the distribution of some quadratic forms involving the hat matrix.
\(\hat\sigma^2\) is an unbiased estimator for \(\sigma^2\).