Section 11 Measures for Model Fit
11.1 The Coefficient of Multiple Determination
In equation (4.5) we have seen that \[\begin{equation*} \|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2. \end{equation*}\] The following lemma shows that a similar relation also holds after the sample means are subtracted.
Lemma 11.1 Assume that the fitted model includes an intercept. Then we have \[\begin{align*} \sum_{i=1}^n (y_i - \overline{y})^2 &= \sum_{i=1}^n (\hat y_i - \overline{y})^2 + \sum_{i=1}^n \hat \varepsilon_i^2 \\ &= \sum_{i=1}^n \bigl( \hat y_i - \overline{\hat y} \bigr)^2 + \sum_{i=1}^n \bigl( \hat \varepsilon_i - \overline{\hat\varepsilon} \bigr)^2, \end{align*}\] where \(\overline{y}\), \(\overline{\hat y}\) and \(\overline{\hat\varepsilon}\) are the sample means of the three vectors \(y\), \(\hat y\) and \(\hat\varepsilon\).
Proof. Define \(\mathbf{1} = (1, \ldots, 1) \in \mathbb{R}^p\). From exercise 2 on problem sheet 1 we know that \(H\mathbf{1} = \mathbf{1}\) and since \(H\) is symmetric we also have \(\mathbf{1}^\top H = \mathbf{1}^\top\). This gives \[\begin{equation} \mathbf{1}^\top (I - H) = \mathbf{1}^\top - \mathbf{1}^\top H = \mathbf{1}^\top - \mathbf{1}^\top = 0. \tag{11.1} \end{equation}\]
As we have already seen previously, we have \[\begin{align*} \hat y^\top \hat\varepsilon &= (H y)^\top (I - H) y \\ &= y^\top H (I - H) y \\ &= 0 \end{align*}\] and using equation (11.1) we also get \[\begin{equation} \mathbf{1}^\top \hat\varepsilon = \mathbf{1}^\top (I - H) y = 0. \end{equation}\] Thus, \((\hat y - \overline{y} \mathbf{1})^\top \hat\varepsilon= 0\) and using Pythagoras’ theorem as in the section about Properties of the Hat Matrix, we find \[\begin{align*} \| y - \overline{y} \mathbf{1} \|^2 &= \| y - \hat y + \hat y - \overline{y} \mathbf{1} \|^2 \\ &= \| \hat \varepsilon+ \hat y - \overline{y} \mathbf{1} \|^2 \\ &= \| \hat \varepsilon\|^2 + \| \hat y - \overline{y} \mathbf{1} \|^2. \end{align*}\] This proves the first equality.
Using equation (11.1) again, we find \[\begin{equation*} \sum_{i=1}^n \hat\varepsilon_i = \mathbf{1}^\top \hat\varepsilon = \mathbf{1}^\top (y - \hat y) = \mathbf{1}^\top (I - H) y = 0 \end{equation*}\] and thus \(\overline{\hat\varepsilon} = 0\). Since \(y = \hat y + (y - \hat y) = \hat y + \hat\varepsilon\) we also have \[\begin{equation*} \overline{y} = \overline{\hat y} + \overline{\hat\varepsilon} = \overline{\hat y}. \end{equation*}\] This completes the proof.
We can express the terms in the statement of lemma 11.1 as sample variances. For example, we write \[\begin{equation*} \mathrm{s}_y^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2 \end{equation*}\] for the sample variance of \(y\). Using this notation, the statement of the lemma can be written as \[\begin{equation*} \mathrm{s}_y^2 = \mathrm{s}_{\hat y}^2 + \mathrm{s}_{\hat\varepsilon}^2. \end{equation*}\] Different from equation (4.5), the current relation is only true for models which include an intercept.
Some authors define
- \(\mathrm{SS}_\mathrm{tot} = \sum_{i=1}^n (y_i - \bar y)^2\) (where “tot” stands for “total”)
- \(\mathrm{SS}_\mathrm{reg} = \sum_{i=1}^n (\hat y_i - \overline{\hat y})^2\) (where “reg” stands for “regression”)
- \(\mathrm{SS}_\mathrm{res} = \sum_{i=1}^n (y_i-\hat y_i)^2\) (where “res” stands for “residual”)
Using this notation, the statement of lemma 11.1 becomes \[\begin{equation*} \mathrm{SS}_\mathrm{tot} = \mathrm{SS}_\mathrm{reg} + \mathrm{SS}_\mathrm{res}. \end{equation*}\]
Definition 11.1 The coefficient of multiple determination is defined as \[\begin{equation*} R^2 := 1 - \frac{\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} = 1 - \frac{\sum_{i=1}^n \hat\varepsilon_i^2}{\sum_{i=1}^n (y_i - \bar y)^2}. \end{equation*}\]
Using the result of lemma 11.1, for models which include an intercept we can also express \(R^2\) as \[\begin{equation*} R^2 = \mathrm{s}_{\hat y}^2 / \mathrm{s}_y^2. \end{equation*}\] Since both numerator and denominator are positive, if is clear the \(R^2 \geq 0\). Similarly, from the definition it is clear that \(R^2 \leq 1\), so that we always have \(R^2 \in [0, 1]\). A large value of \(R^2\) indicates that the residuals are small compared to the sample variance of the \(y_i\), and thus often indicates good model fit.
Example 11.1 We can easily compute the \(R^2\) value for the stackloss
dataset
manually:
m <- lm(stack.loss ~ ., data = stackloss)
R.squared <- 1 - var(resid(m)) / var(stackloss$stack.loss)
R.squared
[1] 0.9135769
We can also find this value near the bottom of the output of
summary(m)
, listed as Multiple R-squared
:
Call:
lm(formula = stack.loss ~ ., data = stackloss)
Residuals:
Min 1Q Median 3Q Max
-7.2377 -1.7117 -0.4551 2.3614 5.6978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
One problem with the \(R^2\) value is, that it always increases when another input variable is added to the model. Thus, the \(R^2\) value cannot be used to compare model fit for models with different numbers of variables. An attempt to compensate for this effect is the adjusted \(R^2\) value:
Definition 11.2 The adjusted \(R^2\) value is given by \[\begin{equation*} R^2_\mathrm{adj} := 1 - \frac{\frac{1}{n-p-1}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} = 1 - \frac{n-1}{n-p-1}(1 - R^2). \end{equation*}\]
The value \(R^2_\mathrm{adj}\) is always smaller than the \(R^2\) value, and \(R^2_\mathrm{adj}\) can be (slightly) negative.
Example 11.2 The adjusted \(R^2\) value for the stackloss dataset can be found as follows:
n <- nrow(stackloss)
p <- ncol(stackloss) - 1
R.squared.adj <- 1 - (n - 1) / (n - p - 1) * (1 - R.squared)
R.squared.adj
[1] 0.8983258
We can also find this value near the bottom of the output of
summary(m)
, listed as Adjusted R-squared
.
Example 11.3 As suggested in example 9.7, we may want to include a new quadratic term in the stackloss dataset:
Call:
lm(formula = stack.loss ~ . + I(Water.Temp^2), data = stackloss)
Residuals:
Min 1Q Median 3Q Max
-4.1217 -1.5205 -0.3091 0.9753 6.0554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.73727 46.08487 1.253 0.22826
Air.Flow 0.53683 0.14708 3.650 0.00216 **
Water.Temp -7.62030 4.10441 -1.857 0.08188 .
Acid.Conc. -0.09697 0.14371 -0.675 0.50946
I(Water.Temp^2) 0.21224 0.09738 2.179 0.04459 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.936 on 16 degrees of freedom
Multiple R-squared: 0.9334, Adjusted R-squared: 0.9167
F-statistic: 56.02 on 4 and 16 DF, p-value: 3.293e-09
As expected, the \(R^2\) value has increased here (from 0.9136 to 0.9334). Since the adjusted \(R^2\) value has also increased (from 0.8983 to 0.9167), the extended model may be better than the original model.
11.2 Error Variance
The estimate for the error variance, \[\begin{equation*} \hat\sigma^2 = \frac{1}{n - q} \sum_{i=1}^n (y_i - \hat y_i)^2 \end{equation*}\] can be used, where \(q\) is the number of columns of the design matrix. (For the full model with an intercept we have \(q = p+1\).) “Good” models have small \(\hat\sigma^2\). We have \[\begin{align*} R^2_\mathrm{adj} &= 1 - \frac{\frac{1}{n-q}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{n-1}{n-q}\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{1}{n-q} \sum_{i=1}^n (y_i - \hat y_i)^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\hat\sigma^2}{\mathrm{s}_y^2}, \end{align*}\] where \(\mathrm{s}_{\hat\varepsilon}^2\) and \(\mathrm{s}_y^2\) are the sample variances as before. Thus, minimising \(\hat\sigma^2\) is equivalent to maximising \(R^2_\mathrm{adj}\).
11.3 Prediction Error Sum of Squares
The Prediction Error Sum of Squares (PRESS) compares each output \(y_i\) to the fitted value \(\hat y^{(i)}_i\) for observation i, computed without using sample \(i\): \[\begin{equation*} \mathrm{PRESS} := \sum_{i=1}^n \bigl( y_i - \hat y^{(i)}_i \bigr)^2. \end{equation*}\] In the comparison, \(\hat y^{(i)}_i\) is used instead of \(\hat y_i\) so that in models with too many parameters, overfitting does not result in overoptimistic estimates. “Good” models have small PRESS. The following lemma helps to compute PRESS without having to fit \(n\) separate models.
Lemma 11.2 We have \[\begin{equation*} \mathrm{PRESS} = \sum_{i=1}^n \Bigl( \frac{\hat\varepsilon_i}{1 - h_{ii}} \Bigr)^2, \end{equation*}\] where \(h_{ii}\) denotes the diagonal elements of the hat matrix \(H\).
Proof. From equation (8.3) we know how the estimated regression coefficients change, when observation \(i\) is deleted from the model: \[\begin{equation*} \hat\beta^{(i)} - \hat\beta = - (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}}. \end{equation*}\] Thus we have \[\begin{align*} \hat y^{(i)}_i - \hat y_i &= \bigl( X \hat\beta^{(i)} - X \hat\beta \bigr)_i \\ &= - \bigl( X (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \bigr)_i \\ &= - x_i^\top (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \\ &= - \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i. \end{align*}\] From this we get \[\begin{align*} y_i - \hat y^{(i)}_i &= y_i - \hat y_i + \hat y_i - \hat y^{(i)}_i \\ &= \hat\varepsilon_i + \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i \\ &= \frac{1}{1 - h_{ii}} \hat\varepsilon_i. \end{align*}\] Substituting this expression into the definition of PRESS completes the proof.
11.4 Akaike’s Information Criterion
Akaike’s Information Criterion (AIC) is defined as \[\begin{equation*} \mathrm{AIC} = 2q - 2 \log(\hat L), \end{equation*}\] where \(q\) is the number of parameters in the model, and \(\hat L\) is the maximum of the likelihood function when these \(q\) parameters are chosen optimally. “Good” models have small AIC.
Since we assume \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\), i.i.d. and we have \(\varepsilon_i = y_i - x_i^\top\beta\), the likelihood function for our model is \[\begin{align*} L(\beta, \sigma^2; X, y) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl( - \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr) \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\Bigl( - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr). \end{align*}\] Taking logarithms we get \[\begin{align*} \log L(\beta, \sigma^2; X, y) &= -\frac{n}{2} \log(2\pi\sigma^2) - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \\ &= -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} r(\beta), \end{align*}\] where \(r(\beta)\) is the residual sum of squares, as introduced in the section about Least Squares Estimates. To compute the AIC we need to maximise this expression. The values of \(\beta\) and \(\sigma^2\) where the maximum is taken are called the maximum likelihood estimate (MLE) for \(\beta\) and \(\sigma^2\). From lemma 2.1 we know that for fixed \(\sigma^2\), \(L\) takes its maximum when \(\beta\) equals the least squares estimate \(\hat\beta\). Taking derivatives we can then find the optimal value of \(\sigma^2\). The result is \[\begin{equation*} \hat\sigma^2_\mathrm{MLE} = \frac1n r(\hat\beta) = \frac{1}{n} \sum_{i=1}^n \bigl( y_i - (X \hat\beta)_i \bigr)^2 = \frac{n-q}{n} \hat\sigma^2. \end{equation*}\] Thus we get \[\begin{align*} \mathrm{AIC} &= 2q + n \log(2\pi\hat\sigma^2_\mathrm{MLE}) + \frac{1}{\hat\sigma^2_\mathrm{MLE}} r(\hat\beta) \\ &= 2q + n \log\bigl( 2\pi r(\hat\beta) / n \bigr) + n \\ &= 2q + n \log\bigl( \|\hat\varepsilon\|^2 \bigr) + n + n \log( 2\pi / n ). \end{align*}\]
Summary
- The \(R^2\) value is a numerical measure of the goodness of fit of a model.
- If models with different numbers of variables are compared, \(R^2_\mathrm{adj}\) should be used instead of \(R^2\).
- Other measures for the quality of model fit include PRESS and AIC.