# Section 11 Measures for Model Fit

## 11.1 The Coefficient of Multiple Determination

In equation (4.5) we have seen that $\begin{equation*} \|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2. \end{equation*}$ The following lemma shows that a similar relation also holds after the sample means are subtracted.

Lemma 11.1 Assume that the fitted model includes an intercept. Then we have \begin{align*} \sum_{i=1}^n (y_i - \overline{y})^2 &= \sum_{i=1}^n (\hat y_i - \overline{y})^2 + \sum_{i=1}^n \hat \varepsilon_i^2 \\ &= \sum_{i=1}^n \bigl( \hat y_i - \overline{\hat y} \bigr)^2 + \sum_{i=1}^n \bigl( \hat \varepsilon_i - \overline{\hat\varepsilon} \bigr)^2, \end{align*} where $$\overline{y}$$, $$\overline{\hat y}$$ and $$\overline{\hat\varepsilon}$$ are the sample means of the three vectors $$y$$, $$\hat y$$ and $$\hat\varepsilon$$.

Proof. Define $$\mathbf{1} = (1, \ldots, 1) \in \mathbb{R}^p$$. From exercise 2 on problem sheet 1 we know that $$H\mathbf{1} = \mathbf{1}$$ and since $$H$$ is symmetric we also have $$\mathbf{1}^\top H = \mathbf{1}^\top$$. This gives $$$\mathbf{1}^\top (I - H) = \mathbf{1}^\top - \mathbf{1}^\top H = \mathbf{1}^\top - \mathbf{1}^\top = 0. \tag{11.1}$$$

As we have already seen previously, we have \begin{align*} \hat y^\top \hat\varepsilon &= (H y)^\top (I - H) y \\ &= y^\top H (I - H) y \\ &= 0 \end{align*} and using equation (11.1) we also get $$$\mathbf{1}^\top \hat\varepsilon = \mathbf{1}^\top (I - H) y = 0.$$$ Thus, $$(\hat y - \overline{y} \mathbf{1})^\top \hat\varepsilon= 0$$ and using Pythagoras’ theorem as in the section about Properties of the Hat Matrix, we find \begin{align*} \| y - \overline{y} \mathbf{1} \|^2 &= \| y - \hat y + \hat y - \overline{y} \mathbf{1} \|^2 \\ &= \| \hat \varepsilon+ \hat y - \overline{y} \mathbf{1} \|^2 \\ &= \| \hat \varepsilon\|^2 + \| \hat y - \overline{y} \mathbf{1} \|^2. \end{align*} This proves the first equality.

Using equation (11.1) again, we find $\begin{equation*} \sum_{i=1}^n \hat\varepsilon_i = \mathbf{1}^\top \hat\varepsilon = \mathbf{1}^\top (y - \hat y) = \mathbf{1}^\top (I - H) y = 0 \end{equation*}$ and thus $$\overline{\hat\varepsilon} = 0$$. Since $$y = \hat y + (y - \hat y) = \hat y + \hat\varepsilon$$ we also have $\begin{equation*} \overline{y} = \overline{\hat y} + \overline{\hat\varepsilon} = \overline{\hat y}. \end{equation*}$ This completes the proof.

We can express the terms in the statement of lemma 11.1 as sample variances. For example, we write $\begin{equation*} \mathrm{s}_y^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2 \end{equation*}$ for the sample variance of $$y$$. Using this notation, the statement of the lemma can be written as $\begin{equation*} \mathrm{s}_y^2 = \mathrm{s}_{\hat y}^2 + \mathrm{s}_{\hat\varepsilon}^2. \end{equation*}$ Different from equation (4.5), the current relation is only true for models which include an intercept.

Some authors define

• $$\mathrm{SS}_\mathrm{tot} = \sum_{i=1}^n (y_i - \bar y)^2$$ (where “tot” stands for “total”)
• $$\mathrm{SS}_\mathrm{reg} = \sum_{i=1}^n (\hat y_i - \overline{\hat y})^2$$ (where “reg” stands for “regression”)
• $$\mathrm{SS}_\mathrm{res} = \sum_{i=1}^n (y_i-\hat y_i)^2$$ (where “res” stands for “residual”)

Using this notation, the statement of lemma 11.1 becomes $\begin{equation*} \mathrm{SS}_\mathrm{tot} = \mathrm{SS}_\mathrm{reg} + \mathrm{SS}_\mathrm{res}. \end{equation*}$

Definition 11.1 The coefficient of multiple determination is defined as $\begin{equation*} R^2 := 1 - \frac{\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} = 1 - \frac{\sum_{i=1}^n \hat\varepsilon_i^2}{\sum_{i=1}^n (y_i - \bar y)^2}. \end{equation*}$

Using the result of lemma 11.1, for models which include an intercept we can also express $$R^2$$ as $\begin{equation*} R^2 = \mathrm{s}_{\hat y}^2 / \mathrm{s}_y^2. \end{equation*}$ Since both numerator and denominator are positive, if is clear the $$R^2 \geq 0$$. Similarly, from the definition it is clear that $$R^2 \leq 1$$, so that we always have $$R^2 \in [0, 1]$$. A large value of $$R^2$$ indicates that the residuals are small compared to the sample variance of the $$y_i$$, and thus often indicates good model fit.

Example 11.1 We can easily compute the $$R^2$$ value for the stackloss dataset manually:

m <- lm(stack.loss ~ ., data = stackloss)
R.squared <- 1 - var(resid(m)) / var(stackloss\$stack.loss)
R.squared
[1] 0.9135769

We can also find this value near the bottom of the output of summary(m), listed as Multiple R-squared:

summary(m)

Call:
lm(formula = stack.loss ~ ., data = stackloss)

Residuals:
Min      1Q  Median      3Q     Max
-7.2377 -1.7117 -0.4551  2.3614  5.6978

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.9197    11.8960  -3.356  0.00375 **
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 **
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

One problem with the $$R^2$$ value is, that it always increases when another input variable is added to the model. Thus, the $$R^2$$ value cannot be used to compare model fit for models with different numbers of variables. An attempt to compensate for this effect is the adjusted $$R^2$$ value:

Definition 11.2 The adjusted $$R^2$$ value is given by $\begin{equation*} R^2_\mathrm{adj} := 1 - \frac{\frac{1}{n-p-1}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} = 1 - \frac{n-1}{n-p-1}(1 - R^2). \end{equation*}$

The value $$R^2_\mathrm{adj}$$ is always smaller than the $$R^2$$ value, and $$R^2_\mathrm{adj}$$ can be (slightly) negative.

Example 11.2 The adjusted $$R^2$$ value for the stackloss dataset can be found as follows:

n <- nrow(stackloss)
p <- ncol(stackloss) - 1
R.squared.adj <- 1 - (n - 1) / (n - p - 1) * (1 - R.squared)
R.squared.adj
[1] 0.8983258

We can also find this value near the bottom of the output of summary(m), listed as Adjusted R-squared.

Example 11.3 As suggested in example 9.7, we may want to include a new quadratic term in the stackloss dataset:

m2 <- lm(stack.loss ~ . + I(Water.Temp^2), data = stackloss)
summary(m2)

Call:
lm(formula = stack.loss ~ . + I(Water.Temp^2), data = stackloss)

Residuals:
Min      1Q  Median      3Q     Max
-4.1217 -1.5205 -0.3091  0.9753  6.0554

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     57.73727   46.08487   1.253  0.22826
Air.Flow         0.53683    0.14708   3.650  0.00216 **
Water.Temp      -7.62030    4.10441  -1.857  0.08188 .
Acid.Conc.      -0.09697    0.14371  -0.675  0.50946
I(Water.Temp^2)  0.21224    0.09738   2.179  0.04459 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.936 on 16 degrees of freedom
Multiple R-squared:  0.9334,    Adjusted R-squared:  0.9167
F-statistic: 56.02 on 4 and 16 DF,  p-value: 3.293e-09

As expected, the $$R^2$$ value has increased here (from 0.9136 to 0.9334). Since the adjusted $$R^2$$ value has also increased (from 0.8983 to 0.9167), the extended model may be better than the original model.

## 11.2 Error Variance

The estimate for the error variance, $\begin{equation*} \hat\sigma^2 = \frac{1}{n - q} \sum_{i=1}^n (y_i - \hat y_i)^2 \end{equation*}$ can be used, where $$q$$ is the number of columns of the design matrix. (For the full model with an intercept we have $$q = p+1$$.) “Good” models have small $$\hat\sigma^2$$. We have \begin{align*} R^2_\mathrm{adj} &= 1 - \frac{\frac{1}{n-q}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{n-1}{n-q}\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{1}{n-q} \sum_{i=1}^n (y_i - \hat y_i)^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\hat\sigma^2}{\mathrm{s}_y^2}, \end{align*} where $$\mathrm{s}_{\hat\varepsilon}^2$$ and $$\mathrm{s}_y^2$$ are the sample variances as before. Thus, minimising $$\hat\sigma^2$$ is equivalent to maximising $$R^2_\mathrm{adj}$$.

## 11.3 Prediction Error Sum of Squares

The Prediction Error Sum of Squares (PRESS) compares each output $$y_i$$ to the fitted value $$\hat y^{(i)}_i$$ for observation i, computed without using sample $$i$$: $\begin{equation*} \mathrm{PRESS} := \sum_{i=1}^n \bigl( y_i - \hat y^{(i)}_i \bigr)^2. \end{equation*}$ In the comparison, $$\hat y^{(i)}_i$$ is used instead of $$\hat y_i$$ so that in models with too many parameters, overfitting does not result in overoptimistic estimates. “Good” models have small PRESS. The following lemma helps to compute PRESS without having to fit $$n$$ separate models.

Lemma 11.2 We have $\begin{equation*} \mathrm{PRESS} = \sum_{i=1}^n \Bigl( \frac{\hat\varepsilon_i}{1 - h_{ii}} \Bigr)^2, \end{equation*}$ where $$h_{ii}$$ denotes the diagonal elements of the hat matrix $$H$$.

Proof. From equation (8.3) we know how the estimated regression coefficients change, when observation $$i$$ is deleted from the model: $\begin{equation*} \hat\beta^{(i)} - \hat\beta = - (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}}. \end{equation*}$ Thus we have \begin{align*} \hat y^{(i)}_i - \hat y_i &= \bigl( X \hat\beta^{(i)} - X \hat\beta \bigr)_i \\ &= - \bigl( X (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \bigr)_i \\ &= - x_i^\top (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \\ &= - \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i. \end{align*} From this we get \begin{align*} y_i - \hat y^{(i)}_i &= y_i - \hat y_i + \hat y_i - \hat y^{(i)}_i \\ &= \hat\varepsilon_i + \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i \\ &= \frac{1}{1 - h_{ii}} \hat\varepsilon_i. \end{align*} Substituting this expression into the definition of PRESS completes the proof.

## 11.4 Akaike’s Information Criterion

Akaike’s Information Criterion (AIC) is defined as $\begin{equation*} \mathrm{AIC} = 2q - 2 \log(\hat L), \end{equation*}$ where $$q$$ is the number of parameters in the model, and $$\hat L$$ is the maximum of the likelihood function when these $$q$$ parameters are chosen optimally. “Good” models have small AIC.

Since we assume $$\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$$, i.i.d. and we have $$\varepsilon_i = y_i - x_i^\top\beta$$, the likelihood function for our model is \begin{align*} L(\beta, \sigma^2; X, y) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl( - \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr) \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\Bigl( - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr). \end{align*} Taking logarithms we get \begin{align*} \log L(\beta, \sigma^2; X, y) &= -\frac{n}{2} \log(2\pi\sigma^2) - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \\ &= -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} r(\beta), \end{align*} where $$r(\beta)$$ is the residual sum of squares, as introduced in the section about Least Squares Estimates. To compute the AIC we need to maximise this expression. The values of $$\beta$$ and $$\sigma^2$$ where the maximum is taken are called the maximum likelihood estimate (MLE) for $$\beta$$ and $$\sigma^2$$. From lemma 2.1 we know that for fixed $$\sigma^2$$, $$L$$ takes its maximum when $$\beta$$ equals the least squares estimate $$\hat\beta$$. Taking derivatives we can then find the optimal value of $$\sigma^2$$. The result is $\begin{equation*} \hat\sigma^2_\mathrm{MLE} = \frac1n r(\hat\beta) = \frac{1}{n} \sum_{i=1}^n \bigl( y_i - (X \hat\beta)_i \bigr)^2 = \frac{n-q}{n} \hat\sigma^2. \end{equation*}$ Thus we get \begin{align*} \mathrm{AIC} &= 2q + n \log(2\pi\hat\sigma^2_\mathrm{MLE}) + \frac{1}{\hat\sigma^2_\mathrm{MLE}} r(\hat\beta) \\ &= 2q + n \log\bigl( 2\pi r(\hat\beta) / n \bigr) + n \\ &= 2q + n \log\bigl( \|\hat\varepsilon\|^2 \bigr) + n + n \log( 2\pi / n ). \end{align*}

Summary

• The $$R^2$$ value is a numerical measure of the goodness of fit of a model.
• If models with different numbers of variables are compared, $$R^2_\mathrm{adj}$$ should be used instead of $$R^2$$.
• Other measures for the quality of model fit include PRESS and AIC.