Section 12 Model Selection

12.1 Candidate Models

Here we assume that we are given data with input variables \(x_1, \ldots, x_p\). The standard model which we have discussed so far adds an intercept to these inputs, so that there are \(q = p+1\) model parameters in total. We can modify this model in various ways:

  • When we discussed hypothesis tests, we allowed for the possibility that some inputs do not influence the output. A reasonable approach would be to fit a model with these inputs omitted. Since each input may either be included in or excluded from the model, independent of the others, and similarly the intercept may be included or not, there are \(2^{p+1}\) possible models to consider.

  • In the section about Improving the Model Fit we have seen that sometimes it is useful to transform input variables before using them in the model. This can either happen individually, e.g. \(x_2' = \log x_2\), or collectively, e.g. \(x^\ast = x_1 x_2\). The number of models we may obtain in this way is unlimited.

If we want to compare these candidate models in a systematic way, we need to use efficient methods for comparison, since often a large number of models needs to be considered.

12.2 Misspecified Models

The model is said to be misspecified, if the data we are using comes from a different model than the one we use to estimate the coefficients.

12.2.1 Missing Variables

Assume that the data comes from the model \[\begin{equation*} Y = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon, \end{equation*}\] (we can include the intercept as \(x_1 = 1\) if needed), but we are only using the first \(q\) variables \(x_1, \ldots, x_q\), where \(q < p\), to estimate the coefficients. Then the least squares estimate of \(\beta = (\beta_1, \ldots, \beta_q)\) is given by \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top y \end{equation*}\] where \(X \in \mathbb{R}^{n\times q}\), and our estimate for the error variance \(\sigma^2\) is \[\begin{equation*} \hat\sigma^2 = \frac{1}{n-q} y^\top (I - H) y, \end{equation*}\] where \(H = X (X^\top X)^{-1} X^\top\) is the hat matrix computed without using \(x_{q+1}, \ldots, x_p\).

If we write \(\tilde X \in \mathbb{R}^{n\times (p-q)}\) for the matrix with columns \(x_{q+1}, \ldots, x_p\) and \(\tilde\beta = (\beta_{q+1}, \ldots, \beta_p) \in \mathbb{R}^{p-q}\), then the full model can be written as \[\begin{equation*} Y = X \beta + \tilde X \tilde\beta + \varepsilon \end{equation*}\] and we get \[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top Y \\ &= (X^\top X)^{-1} X^\top (X \beta + \tilde X \tilde\beta + \varepsilon) \\ &= (X^\top X)^{-1} X^\top X \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon. \end{align*}\] Thus, we have \[\begin{equation*} \mathbb{E}(\hat\beta) = \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta \end{equation*}\] and \[\begin{equation*} \mathop{\mathrm{bias}}(\hat\beta) = (X^\top X)^{-1} X^\top \tilde X \tilde\beta. \end{equation*}\] This shows that now the estimate is biased in general. There are two special cases when omitting variables does not introduce a bias: The first is when all of the omitted coefficients equal zero, i.e. when we have \(\tilde\beta = 0\). The second case is when the omitted inputs are orthogonal to the included inputs, since then we have \(X^\top \tilde X = 0\).

Using the formula for \(\hat\beta\) we find \[\begin{align*} \mathop{\mathrm{Cov}}( \hat\beta ) &= \mathop{\mathrm{Cov}}\Bigl( \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon\Bigr) \\ &= \mathop{\mathrm{Cov}}\Bigl( (X^\top X)^{-1} X^\top \varepsilon\Bigr) \\ &= \sigma^2 (X^\top X)^{-1} X^\top I X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1}. \end{align*}\] Thus, the variance of \(\hat\beta\) is not affected by the omitted variables. Similar to our derivation in the section Estimating the Error Variance one can show that \[\begin{equation*} \mathbb{E}\bigl( \hat\sigma^2 \bigr) = \sigma^2 + \frac{\tilde\beta^\top \tilde X^\top (I - H) \tilde X \tilde \beta}{n-q}. \end{equation*}\] Thus, the estimate of the error variance is in general also biased.

This shows that our estimates can become biased if some of the inputs are missing from our model. As in the previous section, it may happen that the MSE of the parameter estimates in the reduced model is smaller than the MSE in the correct model; this is the case if the variance of the estimates decreases enough to compensate for the introduced bias.

12.2.2 Unnecessary Variables

If the data comes from a model with fewer inputs than the ones we are using, the model is not “misspecified” in the strict sense. It is just the case that some of the true \(\beta_j\) exactly equal zero. In this case, our previous results show that the least squares estimate is unbiased.

Including additional variables into a model still can cause problems: One can show that each unnecessary variable added increases the variance of the estimates \(\hat\beta_j\). Instead of giving a proof of this fact, we illustrate the effect using a numerical example.

Example 12.1 Here we consider the stackloss dataset with an added column of noise. We have seen that \[\begin{equation*} \mathop{\mathrm{Var}}( \hat\beta_j ) = \sigma^2 \bigl( X^\top X \bigr)^{-1}_{jj}. \end{equation*}\] While we don’t know the true value of \(\sigma^2\), we can determine the relative change in variance when adding a column, since this process does not change \(\sigma^2\) and thus \(\sigma^2\) will cancel when we determine the relative change in variance.

We first compute the diagonal elements of \(\bigl( X^\top X \bigr)^{-1}\) for the original dataset:

X1 <- model.matrix(stack.loss ~ ., data = stackloss)
d1 <- diag(solve(t(X1) %*% X1))
d1
 (Intercept)     Air.Flow   Water.Temp   Acid.Conc. 
13.452726695  0.001728874  0.012875424  0.002322167 

Now we add a column of noise and re-compute the values:

set.seed(20211115)
n <- nrow(X1)
X2 <- cbind(X1, rnorm(n))
d2 <- diag(solve(t(X2) %*% X2))
d2
 (Intercept)     Air.Flow   Water.Temp   Acid.Conc.              
14.397515774  0.001730570  0.015195467  0.002796487  0.064828744 

Finally, we determine the change in variance in percent:

round(100 * (d2[-5] - d1) / d1, 3)
(Intercept)    Air.Flow  Water.Temp  Acid.Conc. 
      7.023       0.098      18.019      20.426 

We see that the variances of the \(\hat\beta_j\) increased by up to \(20\%\) when we added the unnecessary input.

The example illustrates that it is important to keep the model as small as possible.

12.3 Choosing Between Models

12.3.1 Criteria for Model Selection

When comparing different models, we need a criterion to decide which is “best”. Several options exist:

  • \(R^2\) (coefficient of determination): Measures the proportion of variance explained by the model. However, \(R^2\) always increases when variables are added, even if they are irrelevant. This makes it unsuitable for model selection.

  • Adjusted \(R^2\) (written \(R^2_\mathrm{adj}\)): Adjusts \(R^2\) to penalize model complexity. This is the criterion we will use throughout this section and the next.

  • AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Alternative criteria that also penalize model complexity, discussed in section 10.

In the following subsections, we describe algorithms that use these criteria to select models efficiently.

12.3.3 Stepwise Forward Selection

Here the idea is to start with a minimal model, and then add variables one by one, until no further (large enough) improvements can be achieved.

Start with only intercept term: \(y=\beta_0 + \varepsilon\), then consider each of \(p\) models: \[\begin{equation*} y = \beta_0 + \beta_j x_j + \varepsilon, \end{equation*}\] for \(j \in \{1, \ldots, p\}\).

Choose the model with the smallest residual sum of squares, provided that the “significance of the fitted model” achieves a specified threshold. The process continues by adding more variables, one at a time, until either

  • All variables are in the model.
  • The significance level can not be achieved by any variable not in the model.

The “significance of the model” can be examined by considering a \(t\)-test as in lemma 5.4.

12.3.4 Stepwise Backward Selection

Here the idea is to start with the full model, and to remove variables one by one until no good enough candidates for removal are left.

In each step we consider the test statistic \(|T_j|\) for the tests \[\begin{equation*} H_0\colon \beta_j = 0 \quad \textit{vs.} \quad H_1\colon \beta_j \neq 0 \end{equation*}\] for \(j \in \{1, \ldots, p\}\), again as in lemma 5.4. The method selects \(x_j\) corresponding to the smallest \(|T_j|\). If this is below a given threshold, then remove \(x_j\) and re-fit the model. Repeat until either:

  • No variables are left in the model.
  • The smallest \(|T_j|\) is above the threshold.

12.3.5 Hybrid Methods

Either start with a full model or just intercept. At each stage consider:

  • removing least significant variable already in the model,
  • adding most significant variable not currently in the model,

with significance levels set to avoid a cyclical behaviour.

Summary

  • Model selection requires choosing a criterion such as \(R^2_\mathrm{adj}\), AIC, or BIC to compare models.
  • Stepwise methods (forward, backward, hybrid) provide computationally efficient heuristics for large \(p\).
  • These methods do not guarantee finding the optimal model.
  • Exhaustive search can find the optimal model when \(p\) is small to moderate.