Section 12 Model Selection
12.1 Candidate Models
Here we assume that we are given data with input variables \(x_1, \ldots, x_p\). The standard model which we have discussed so far adds an intercept to these inputs, so that there are \(q = p+1\) model parameters in total. We can modify this model in various ways:
When we discussed hypothesis tests, we allowed for the possibility that some inputs do not influence the output. A reasonable approach would be to fit a model with these inputs omitted. Since each input may either be included in or excluded from the model, independent of the others, and similarly the intercept may be included or not, there are \(2^{p+1}\) possible models to consider.
In section 11 we have seen that sometimes it is useful to transform input variables before using them in the model. This can either happen individually, e.g. \(x_2' = \log x_2\), or collectively, e.g. \(x^\ast = x_1 x_2\). The number of models we can obtain by adding transformed versions of the inputs is unlimited.
If we want to compare these candidate models in a systematic way, we need to use efficient methods for comparison, since often a large number of models needs to be considered.
12.2 Misspecified Models
The model is said to be misspecified, if the data we are using comes from a different model than the one we use to estimate the coefficients.
12.2.1 Missing Variables
Assume that the data comes from the model \[\begin{equation*} Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon, \end{equation*}\] where \(p\) denotes the number of predictors (not including the intercept), but we are only using the first \(q\) predictors \(x_1, \ldots, x_q\), where \(q < p\), to estimate the coefficients. Then the least squares estimate of \(\beta = (\beta_0, \beta_1, \ldots, \beta_q)\) is given by \[\begin{equation*} \hat\beta = (X^\top X)^{-1} X^\top y \end{equation*}\] where \(X \in \mathbb{R}^{n\times (q+1)}\) has columns \([1, x_1, \ldots, x_q]\), and our estimate for the error variance \(\sigma^2\) is \[\begin{equation*} \hat\sigma^2 = \frac{1}{n-q-1} y^\top (I - H) y, \end{equation*}\] where \(H = X (X^\top X)^{-1} X^\top\) is the hat matrix computed using only the first \(q\) predictors.
If we write \(\tilde X \in \mathbb{R}^{n\times (p-q)}\) for the matrix with columns \(x_{q+1}, \ldots, x_p\) and \(\tilde\beta = (\beta_{q+1}, \ldots, \beta_p) \in \mathbb{R}^{p-q}\), then the full model can be written as \[\begin{equation*} Y = X \beta + \tilde X \tilde\beta + \varepsilon \end{equation*}\] and we get \[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top Y \\ &= (X^\top X)^{-1} X^\top (X \beta + \tilde X \tilde\beta + \varepsilon) \\ &= (X^\top X)^{-1} X^\top X \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon. \end{align*}\] Thus, we have \[\begin{equation*} \mathbb{E}(\hat\beta) = \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta \end{equation*}\] and \[\begin{equation*} \mathop{\mathrm{bias}}(\hat\beta) = (X^\top X)^{-1} X^\top \tilde X \tilde\beta. \end{equation*}\] This shows that now the estimate is biased in general. There are two special cases when omitting variables does not introduce a bias: The first is when all of the omitted coefficients equal zero, i.e. when we have \(\tilde\beta = 0\). The second case is when the omitted inputs are orthogonal to the included inputs, since then we have \(X^\top \tilde X = 0\).
Using the formula for \(\hat\beta\) we find \[\begin{align*} \mathop{\mathrm{Cov}}( \hat\beta ) &= \mathop{\mathrm{Cov}}\Bigl( \beta + (X^\top X)^{-1} X^\top \tilde X \tilde\beta + (X^\top X)^{-1} X^\top \varepsilon\Bigr) \\ &= \mathop{\mathrm{Cov}}\Bigl( (X^\top X)^{-1} X^\top \varepsilon\Bigr) \\ &= \sigma^2 (X^\top X)^{-1} X^\top I X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1}. \end{align*}\] Thus, the variance of \(\hat\beta\) is not affected by the omitted variables. Similar to our derivation in section 4.4 one can show that \[\begin{equation*} \mathbb{E}\bigl( \hat\sigma^2 \bigr) = \sigma^2 + \frac{\tilde\beta^\top \tilde X^\top (I - H) \tilde X \tilde \beta}{n-q-1}. \end{equation*}\] Thus, the estimate of the error variance is in general also biased.
This shows that our estimates can become biased if some of the inputs are missing from our model. As in the previous section, it may happen that the MSE of the parameter estimates in the reduced model is smaller than the MSE in the correct model; this is the case if the variance of the estimates decreases enough to compensate for the introduced bias.
12.2.2 Unnecessary Variables
If the data comes from a model with fewer inputs than the ones we are using, the model is not “misspecified” in the strict sense. It is just the case that some of the true \(\beta_j\) exactly equal zero. In this case, our previous results show that the least squares estimate is unbiased.
Including additional variables into a model still can cause problems: One can show that each unnecessary variable added increases the variance of the estimates \(\hat\beta_j\). Instead of giving a proof of this fact, we illustrate the effect using a numerical example.
Example 12.1 Here we consider the stackloss dataset with an added column of
noise. We have seen that
\[\begin{equation*}
\mathop{\mathrm{Var}}( \hat\beta_j )
= \sigma^2 \bigl( X^\top X \bigr)^{-1}_{jj}.
\end{equation*}\]
While we don’t know the true value of \(\sigma^2\), we can determine
the relative change in variance when adding a column, since this process
does not change \(\sigma^2\) and thus \(\sigma^2\) will cancel when we
determine the relative change in variance.
We first compute the diagonal elements of \(\bigl( X^\top X \bigr)^{-1}\) for the original dataset:
(Intercept) Air.Flow Water.Temp Acid.Conc.
13.452726695 0.001728874 0.012875424 0.002322167
Now we add a column of noise and re-compute the values:
(Intercept) Air.Flow Water.Temp Acid.Conc.
14.397515774 0.001730570 0.015195467 0.002796487 0.064828744
Finally, we determine the change in variance in percent:
(Intercept) Air.Flow Water.Temp Acid.Conc.
7.023 0.098 18.019 20.426
We see that the variances of the \(\hat\beta_j\) increased by up to \(20\%\) when we added the unnecessary input.
The example illustrates that it is important to keep the model as small as possible.
12.3 Testing Nested Models
In the preceding section, we discussed the bias-variance trade-off when choosing between a full model and a reduced model. A natural question arises: how can we formally test whether the additional predictors in the full model are needed? In section 6, we developed the \(F\)-test for testing linear combinations of coefficients. Here we show that for nested models, this test simplifies to a formula involving only the residual sums of squares.
We use the same setup as in the preceding section: the reduced model uses only the first \(q\) predictors, with design matrix \(X \in \mathbb{R}^{n\times (q+1)}\) (including intercept), while the full model uses all \(p\) predictors, with additional predictors in \(\tilde X \in \mathbb{R}^{n\times (p-q)}\). We wish to test \[\begin{equation*} H_0\colon \tilde\beta = 0 \quad \text{vs.} \quad H_1\colon \tilde\beta \neq 0. \end{equation*}\] Let \(X_{\text{full}} = [X\; \tilde X] \in \mathbb{R}^{n \times (p+1)}\) be the design matrix for the full model. We define the hat matrices \[\begin{equation*} H_{\text{reduced}} = X(X^\top X)^{-1} X^\top \quad \text{and} \quad H_{\text{full}} = X_{\text{full}}(X_{\text{full}}^\top X_{\text{full}})^{-1} X_{\text{full}}^\top, \end{equation*}\] and the residual sums of squares as \[\begin{equation*} \mathrm{RSS}_{\text{reduced}} := \| y - H_{\text{reduced}} y \|^2 \quad \text{and} \quad \mathrm{RSS}_{\text{full}} := \| y - H_{\text{full}} y \|^2. \end{equation*}\]
Theorem 12.1 Under the model \(Y = X\beta + \tilde X \tilde\beta + \varepsilon\) with \(\varepsilon\sim \mathcal{N}(0, \sigma^2 I)\), to test \(H_0\colon \tilde\beta = 0\), the \(F\)-statistic is given by \[\begin{equation} F = \frac{(\mathrm{RSS}_{\text{reduced}} - \mathrm{RSS}_{\text{full}})/(p-q)} {\mathrm{RSS}_{\text{full}}/(n-p-1)}. \tag{12.1} \end{equation}\] Under \(H_0\), we have \(F \sim F_{p-q, n-p-1}\).
Proof. In section 6, we have seen that the \(F\)-statistic for testing \(k\) linear combinations of coefficients is given by \[\begin{equation*} F = \frac{\bigl\| K \hat\beta - K \beta \bigr\|_{K(X^\top X)^{-1} K^\top}^2} {k \hat\sigma^2}, \end{equation*}\] where \(K\) is a matrix that selects the coefficients of interest. For testing \(H_0\colon \tilde\beta = 0\), we have \(k = p-q\). It can be shown (though we omit the details) that for this choice of \(K\), we have \[\begin{equation*} \bigl\| K \hat\beta - K \beta \bigr\|_{K(X^\top X)^{-1} K^\top}^2 = \mathrm{RSS}_{\text{reduced}} - \mathrm{RSS}_{\text{full}}. \end{equation*}\] Furthermore, under the full model, we have \[\begin{equation*} \hat\sigma^2 = \frac{\mathrm{RSS}_{\text{full}}}{n-p-1}. \end{equation*}\] Substituting these expressions into the \(F\)-statistic, we obtain \[\begin{equation*} F = \frac{(\mathrm{RSS}_{\text{reduced}} - \mathrm{RSS}_{\text{full}})/(p-q)} {\mathrm{RSS}_{\text{full}}/(n-p-1)}. \end{equation*}\] By lemma 6.1, this follows an \(F_{p-q, n-p-1}\) distribution under \(H_0\).
This formula is particularly useful in practice: we can test whether predictors should be included simply by comparing residual sums of squares, without computing the Mahalanobis norm from section 6.
12.4 Choosing Between Models
12.4.1 Criteria for Model Selection
When comparing different models, we need a criterion to decide which is “best”. Several options exist:
\(R^2\) (coefficient of determination): Measures the proportion of variance explained by the model. However, \(R^2\) always increases when variables are added, even if they are irrelevant. This makes it unsuitable for model selection.
Adjusted \(R^2\) (written \(R^2_\mathrm{adj}\)): Adjusts \(R^2\) to penalize model complexity. This is the criterion we will use throughout this section and section 13.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Alternative criteria that also penalize model complexity, discussed in section 10.
In the following subsections, we describe algorithms that use these criteria to select models efficiently.
12.4.2 Exhaustive Search
When \(p\) is small (typically \(p < 20\)), we can fit all \(2^p\) possible models and choose the one with the best criterion value (e.g., highest \(R^2_\mathrm{adj}\)). This guarantees finding the optimal model but becomes computationally infeasible for larger \(p\) (since \(2^{20} > 1\) million). See section 13 for efficient algorithms.
12.4.3 Stepwise Forward Selection
Forward selection starts with the intercept-only model \(y=\beta_0 + \varepsilon\), then iteratively adds variables. At each step, for each variable not yet in the model, we compute the \(t\)-statistic for testing whether that variable’s coefficient is zero (lemma 5.4). We add the variable (if any) with the largest \(|t|\)-value, provided this is above a chosen threshold (commonly \(|t| > 2\), corresponding to approximate 5% significance). The process continues until no remaining variable is above the threshold.
12.4.4 Stepwise Backward Selection
Backward selection starts with the full model containing all \(p\) predictors, then iteratively removes variables. At each step, we compute the \(t\)-statistic \(T_j\) for testing \(H_0\colon \beta_j = 0\) for each variable currently in the model (lemma 5.4). We remove the variable with the smallest \(|T_j|\), provided this is below a chosen threshold (commonly \(|t| < 2\)). The process continues until all remaining variables have \(|t|\)-values above the threshold.
12.4.5 Bidirectional Selection
Bidirectional (or hybrid) methods combine forward and backward selection. At each step, we consider both adding the most significant variable not in the model and removing the least significant variable currently in the model. Typically, different significance thresholds are used for entry and removal to avoid cycling.
Remark. While computationally efficient, stepwise procedures have important limitations: the resulting test statistics and confidence intervals are biased (since variable selection and testing use the same data), the methods are prone to overfitting, and small changes in the data can lead to very different selected models. These procedures should be used primarily for exploratory analysis, and any final model should ideally be validated on independent data.
Summary
- Misspecified models affect estimates: missing variables introduce bias \(\mathop{\mathrm{bias}}(\hat\beta) = (X^\top X)^{-1} X^\top \tilde X \tilde\beta\), while unnecessary variables increase variance.
- The bias-variance tradeoff: smaller models may have lower MSE despite bias if variance reduction is sufficient.
- For nested models, the F-test simplifies to comparing residual sums of squares: \(F = (\mathrm{RSS}_{\text{reduced}} - \mathrm{RSS}_{\text{full}})/(p-q) \div \mathrm{RSS}_{\text{full}}/(n-p-1)\).
- Model selection requires choosing a criterion such as \(R^2_\mathrm{adj}\), AIC, or BIC to compare models.
- Stepwise methods (forward, backward, hybrid) provide computationally efficient heuristics for large \(p\), but do not guarantee finding the optimal model.
- Exhaustive search can find the optimal model when \(p\) is small to moderate.