Section 10 Measures for Model Fit

10.1 The Coefficient of Multiple Determination

A widely used measure of model fit is the coefficient of multiple determination, commonly denoted \(R^2\). This measure is not specific to linear regression, but can be used for any regression model which produces fitted values \(\hat y_i\) and residuals \(\hat\varepsilon_i\).

Definition 10.1 The coefficient of multiple determination is defined as \[\begin{equation*} R^2 := 1 - \frac{\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} = 1 - \frac{\frac{1}{n-1} \sum_{i=1}^n (\hat\varepsilon_i - \overline{\hat\varepsilon})^2} {\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2}. \end{equation*}\]

The coefficient of multiple determination provides a numerical measure of how well the model fits the data. The numerator \(\mathrm{s}_{\hat\varepsilon}^2\) is the sample variance of the residuals, while the denominator \(\mathrm{s}_y^2\) is the sample variance of the response variable \(y\). Thus, \(R^2\) measures the proportion of total variance in \(y\) which is explained by the regression model. Large values of \(R^2\) indicate good model fit.

We will use the following lemma to simplify the formula for \(R^2\) for the special case of linear regression models.

Lemma 10.1 For multiple linear regression models with an intercept term, the following relations hold:

  1. We have \[\begin{equation*} \sum_{i=1}^n \hat\varepsilon_i = 0, \end{equation*}\] and thus \(\overline{\hat\varepsilon} = 0\).

  2. We have \[\begin{equation*} \sum_{i=1}^n (y_i - \overline{y})^2 = \sum_{i=1}^n \bigl( \hat y_i - \overline{\hat y} \bigr)^2 + \sum_{i=1}^n \bigl( \hat \varepsilon_i - \overline{\hat\varepsilon} \bigr)^2. \end{equation*}\]

Proof. Define \(\mathbf{1} = (1, \ldots, 1) \in \mathbb{R}^n\). From exercise 2 on problem sheet 1 we know that \(H\mathbf{1} = \mathbf{1}\) and since \(H\) is symmetric we also have \(\mathbf{1}^\top H = \mathbf{1}^\top\). This gives \[\begin{equation} \mathbf{1}^\top (I - H) = \mathbf{1}^\top - \mathbf{1}^\top H = \mathbf{1}^\top - \mathbf{1}^\top = 0. \tag{10.1} \end{equation}\]

We first show that the residuals sum to zero. Using equation (10.1) we have \[\begin{equation*} \sum_{i=1}^n \hat\varepsilon_i = \mathbf{1}^\top \hat\varepsilon = \mathbf{1}^\top (I - H) y = 0, \end{equation*}\] which proves part 1 of the lemma. Thus we have \(\overline{\hat\varepsilon} = 0\). Since \(y = \hat y + \hat\varepsilon\), taking sample means we also find \[\begin{equation} \overline{y} = \overline{\hat y} + \overline{\hat\varepsilon} = \overline{\hat y}. \tag{10.2} \end{equation}\]

For part 2, we note that as shown previously, the fitted values and residuals are orthogonal: \[\begin{equation*} \hat y^\top \hat\varepsilon = (H y)^\top (I - H) y = y^\top H (I - H) y = 0. \end{equation*}\] Combining this with \(\mathbf{1}^\top \hat\varepsilon= 0\), we find that \((\hat y - \overline{y} \mathbf{1})^\top \hat\varepsilon= 0\). Using Pythagoras’ theorem as in section 4.2, we obtain \[\begin{align*} \| y - \overline{y} \mathbf{1} \|^2 &= \| (y - \hat y) + (\hat y - \overline{y} \mathbf{1}) \|^2 \\ &= \| \hat \varepsilon\|^2 + \| \hat y - \overline{y} \mathbf{1} \|^2, \end{align*}\] which proves the first equality in part 2. The second equality follows by noting that \(\overline{\hat\varepsilon} = 0\) and using equation (10.2).

Note that in equation (4.5) we have seen that \[\begin{equation*} \|y\|^2 = \|\hat y\|^2 + \|\hat\varepsilon\|^2. \end{equation*}\] The lemma shows that a similar relation also holds after the sample means are subtracted.

We can express the terms in the statement of lemma 10.1 as sample variances. For example, we write \[\begin{equation*} \mathrm{s}_y^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2 \end{equation*}\] for the sample variance of \(y\). Using this notation, the statement of the lemma can be written as \[\begin{equation*} \mathrm{s}_y^2 = \mathrm{s}_{\hat y}^2 + \mathrm{s}_{\hat\varepsilon}^2. \end{equation*}\] Unlike equation (4.5), this relation is only true for models which include an intercept.

Some authors define

  • \(\mathrm{SS}_\mathrm{tot} = \sum_{i=1}^n (y_i - \bar y)^2\) (where “tot” stands for “total”)
  • \(\mathrm{SS}_\mathrm{reg} = \sum_{i=1}^n (\hat y_i - \overline{\hat y})^2\) (where “reg” stands for “regression”)
  • \(\mathrm{SS}_\mathrm{res} = \sum_{i=1}^n (y_i-\hat y_i)^2\) (where “res” stands for “residual”)

Using this notation, the statement of lemma 10.1 becomes \[\begin{equation*} \mathrm{SS}_\mathrm{tot} = \mathrm{SS}_\mathrm{reg} + \mathrm{SS}_\mathrm{res}. \end{equation*}\]

Using the result of lemma 10.1, for linear regression models we can also express \(R^2\) as \[\begin{equation*} R^2 = \mathrm{s}_{\hat y}^2 / \mathrm{s}_y^2. \end{equation*}\] Since both numerator and denominator are positive, we have \(R^2 \geq 0\). Similarly, from the definition it is clear that \(R^2 \leq 1\), so that \(R^2 \in [0, 1]\) always holds.

Example 10.1 We can easily compute the \(R^2\) value for the stackloss dataset manually:

m <- lm(stack.loss ~ ., data = stackloss)
R.squared <- 1 - var(resid(m)) / var(stackloss$stack.loss)
R.squared
[1] 0.9135769

We can also find this value near the bottom of the output of summary(m), listed as Multiple R-squared:

summary(m)

Call:
lm(formula = stack.loss ~ ., data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

Note that the alternative formula \(R^2 = \mathrm{s}_{\hat y}^2 / \mathrm{s}_y^2\) and the variance decomposition in lemma 10.1 require the model to include an intercept term.

10.2 Adjusted \(R^2\)

One problem with the \(R^2\) value is that it always increases when another input variable is added to the model. Thus, the \(R^2\) value cannot be used to compare model fit for models with different numbers of variables. An attempt to compensate for this effect is the adjusted \(R^2\) value.

Definition 10.2 The adjusted \(R^2\) value is given by \[\begin{equation*} R^2_\mathrm{adj} := 1 - \frac{\frac{1}{n-p-1}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} = 1 - \frac{n-1}{n-p-1}(1 - R^2). \end{equation*}\]

The value \(R^2_\mathrm{adj}\) is always smaller than the \(R^2\) value, and \(R^2_\mathrm{adj}\) can be (slightly) negative.

Example 10.2 The adjusted \(R^2\) value for the stackloss dataset can be found as follows:

n <- nrow(stackloss)
p <- ncol(stackloss) - 1
R.squared.adj <- 1 - (n - 1) / (n - p - 1) * (1 - R.squared)
R.squared.adj
[1] 0.8983258

We can also find this value near the bottom of the output of summary(m), listed as Adjusted R-squared.

Example 10.3 As suggested in example 9.7, we may want to include a new quadratic term in the stackloss dataset:

m2 <- lm(stack.loss ~ . + I(Water.Temp^2), data = stackloss)
summary(m2)

Call:
lm(formula = stack.loss ~ . + I(Water.Temp^2), data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1217 -1.5205 -0.3091  0.9753  6.0554 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)     57.73727   46.08487   1.253  0.22826   
Air.Flow         0.53683    0.14708   3.650  0.00216 **
Water.Temp      -7.62030    4.10441  -1.857  0.08188 . 
Acid.Conc.      -0.09697    0.14371  -0.675  0.50946   
I(Water.Temp^2)  0.21224    0.09738   2.179  0.04459 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.936 on 16 degrees of freedom
Multiple R-squared:  0.9334,    Adjusted R-squared:  0.9167 
F-statistic: 56.02 on 4 and 16 DF,  p-value: 3.293e-09

As expected, the \(R^2\) value has increased here (from 0.9136 to 0.9334). Since the adjusted \(R^2\) value has also increased (from 0.8983 to 0.9167), the extended model may be better than the original model.

10.3 Error Variance

Another measure for model fit is the estimated error variance, \[\begin{equation*} \hat\sigma^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \hat y_i)^2. \end{equation*}\] Good models have small \(\hat\sigma^2\). This measure is closely related to the adjusted \(R^2\) value. We have \[\begin{align*} R^2_\mathrm{adj} &= 1 - \frac{\frac{1}{n-p-1}\mathrm{s}_{\hat\varepsilon}^2}{\frac{1}{n-1}\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{n-1}{n-p-1}\mathrm{s}_{\hat\varepsilon}^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat y_i)^2}{\mathrm{s}_y^2} \\ &= 1 - \frac{\hat\sigma^2}{\mathrm{s}_y^2}, \end{align*}\] where \(\mathrm{s}_{\hat\varepsilon}^2\) and \(\mathrm{s}_y^2\) are the sample variances as before. Thus, minimising \(\hat\sigma^2\) is equivalent to maximising \(R^2_\mathrm{adj}\).

10.4 Prediction Error Sum of Squares

The Prediction Error Sum of Squares (PRESS) compares each output \(y_i\) to the fitted value \(\hat y^{(i)}_i\) for observation \(i\), computed without using sample \(i\): \[\begin{equation*} \mathrm{PRESS} := \sum_{i=1}^n \bigl( y_i - \hat y^{(i)}_i \bigr)^2. \end{equation*}\] By using \(\hat y^{(i)}_i\) instead of \(\hat y_i\) in the comparison, we avoid overoptimistic estimates that would arise from overfitting in models with too many parameters. Good models have small PRESS. The following lemma provides a convenient formula for computing PRESS without having to fit \(n\) separate models.

Lemma 10.2 We have \[\begin{equation*} \mathrm{PRESS} = \sum_{i=1}^n \Bigl( \frac{\hat\varepsilon_i}{1 - h_{ii}} \Bigr)^2, \end{equation*}\] where \(h_{ii}\) denotes the diagonal elements of the hat matrix \(H\).

Proof. From equation (8.3) we know how the estimated regression coefficients change, when observation \(i\) is deleted from the model: \[\begin{equation*} \hat\beta^{(i)} - \hat\beta = - (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}}. \end{equation*}\] Thus we have \[\begin{align*} \hat y^{(i)}_i - \hat y_i &= \bigl( X \hat\beta^{(i)} - X \hat\beta \bigr)_i \\ &= - \bigl( X (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \bigr)_i \\ &= - x_i^\top (X^\top X)^{-1} x_i \frac{\hat\varepsilon_i}{1 - h_{ii}} \\ &= - \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i. \end{align*}\] From this we get \[\begin{align*} y_i - \hat y^{(i)}_i &= y_i - \hat y_i + \hat y_i - \hat y^{(i)}_i \\ &= \hat\varepsilon_i + \frac{h_{ii}}{1 - h_{ii}} \hat\varepsilon_i \\ &= \frac{1}{1 - h_{ii}} \hat\varepsilon_i. \end{align*}\] Substituting this expression into the definition of PRESS completes the proof.

10.5 Akaike’s Information Criterion

Akaike’s Information Criterion (AIC) is defined as \[\begin{equation*} \mathrm{AIC} = 2q - 2 \log(\hat L), \end{equation*}\] where \(q\) is the number of parameters in the model, and \(\hat L\) is the maximum of the likelihood function when these \(q\) parameters are chosen optimally. For a model with \(p\) predictors and an intercept, we have \(q = p+1\). Good models have small AIC.

Since we assume \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\), i.i.d. and we have \(\varepsilon_i = y_i - x_i^\top\beta\), the likelihood function for our model is \[\begin{align*} L(\beta, \sigma^2; X, y) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl( - \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr) \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\Bigl( - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \Bigr). \end{align*}\] Taking logarithms we get \[\begin{align*} \log L(\beta, \sigma^2; X, y) &= -\frac{n}{2} \log(2\pi\sigma^2) - \sum_{i=1}^n \frac{(y_i - (X\beta)_i)^2}{2\sigma^2} \\ &= -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} r(\beta), \end{align*}\] where \(r(\beta)\) is the residual sum of squares, as introduced in section 2. To compute the AIC we need to maximise this expression. The values of \(\beta\) and \(\sigma^2\) where the maximum is taken are called the maximum likelihood estimate (MLE) for \(\beta\) and \(\sigma^2\). From lemma 2.1 we know that for fixed \(\sigma^2\), \(L\) takes its maximum when \(\beta\) equals the least squares estimate \(\hat\beta\). Taking derivatives we can then find the optimal value of \(\sigma^2\). The result is \[\begin{equation*} \hat\sigma^2_\mathrm{MLE} = \frac1n r(\hat\beta) = \frac{1}{n} \sum_{i=1}^n \bigl( y_i - (X \hat\beta)_i \bigr)^2 = \frac{n-p-1}{n} \hat\sigma^2. \end{equation*}\] Thus we get \[\begin{align*} \mathrm{AIC} &= 2q + n \log(2\pi\hat\sigma^2_\mathrm{MLE}) + \frac{1}{\hat\sigma^2_\mathrm{MLE}} r(\hat\beta) \\ &= 2q + n \log\bigl( 2\pi r(\hat\beta) / n \bigr) + n \\ &= 2q + n \log\bigl( \|\hat\varepsilon\|^2 \bigr) + n + n \log( 2\pi / n ). \end{align*}\]

When these measures indicate poor model fit, data transformations discussed in section 11 may help to better satisfy the modelling assumptions.

Summary

  • The \(R^2\) value is a numerical measure of the goodness of fit of a model. Large values indicate good fit.
  • If models with different numbers of variables are compared, \(R^2_\mathrm{adj}\) should be used instead of \(R^2\).
  • Minimising the error variance \(\hat\sigma^2\) is equivalent to maximising \(R^2_\mathrm{adj}\).
  • Other measures for model quality include PRESS and AIC. Good models have small PRESS and small AIC.