Section 5 Uncertainty for Individual Regression Coefficients

In this section we will consider different ways to study the uncertainty in the estimates \(\hat\beta_i\) for the regression coefficient \(\beta_i\), individually. In the following sections we will then consider the problem of simultaneously estimating several or all coefficients.

5.1 Measuring the Estimation Error

We have seen that \(\hat\beta \sim \mathcal{N}(\beta, \sigma^2 C)\), where \(C := (X^\top X)^{-1}\). Restricting this to a single coefficient, we find \[\begin{equation*} \hat\beta_i \sim \mathcal{N}\bigl( \beta_i, \sigma^2 C_{ii} \bigr), \end{equation*}\] since the diagonal elements of the covariance matrix contain the variances of the elements of a random vector. In practice we will not know the value of \(\sigma^2\), so we have to estimate this from data, using the estimator \[\begin{equation*} \hat\sigma^2 = \frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat y_i)^2 \end{equation*}\] from equation (4.7). As a first application of Cochran’s theorem we showed in equation (4.8) that \[\begin{equation*} \frac{1}{\sigma^2} (n - p - 1) \hat\sigma^2 \sim \chi^2(n - p - 1). \end{equation*}\]

Note that in the equations above, we index the rows and columns of \(C\) using \(i,j\in \{0, 1, \ldots, p\}\), i.e. the first row and column are using the index 0 each. This is to match the convention for the components of \(\beta = (\beta_0, \beta_1, \ldots, \beta_p)\).

Lemma 5.1 The random vector \(\hat\beta\) and the random number \(\hat\sigma^2\) are independent of each other.

Proof. We will show that \(\hat\beta\) can be written as a function of \(H\varepsilon\) and that \(\hat\sigma^2\) can be written as a function of \((I-H)\varepsilon\). The result then follows from Cochran’s theorem.

From lemma 4.1 we know that the least squares estimate \(\hat\beta\) can be written as \[\begin{equation*} \hat\beta = \beta + (X^\top X)^{-1} X^\top \varepsilon \end{equation*}\] and that \(H = X (X^\top X)^{-1} X^\top\). Thus we can write \(\hat\beta\) as \[\begin{align*} \hat\beta &= \beta + (X^\top X)^{-1} X^\top X \, (X^\top X)^{-1} X^\top \varepsilon\\ &= \beta + (X^\top X)^{-1} X^\top H \varepsilon, \end{align*}\] which is a function of \(H\varepsilon\).

Similar to the argument at the end of the previous section, we can write \(\hat\sigma^2\) as \[\begin{align*} (n - p - 1) \hat\sigma^2 &= (Y - \hat Y)^\top (Y - \hat Y) \\ &= (Y - H Y)^\top (Y - H Y) \\ &= Y^\top (I - H)^\top (I - H) Y \\ &= \bigl\| (I - H) Y \|^2. \end{align*}\] Since \(Y = X\beta + \varepsilon\) and since we know \((I-H)X = 0\) from lemma 4.2, we find \[\begin{align*} \hat\sigma^2 &= \frac{1}{n-p-1} \bigl\| (I - H) (X\beta + \varepsilon) \|^2 \\ &= \frac{1}{n-p-1} \bigl\| (I - H) \varepsilon\|^2, \end{align*}\] which is a function of \((I - H)\varepsilon\).

From Cochran’s theorem we know that \(H \varepsilon\) and \((I-H)\varepsilon\) are independent and thus we can conclude that \(\hat\beta\) and \(\hat\sigma^2\) are also independent of each other. This completes the proof.

We now construct a quantity \(T\) which measures the distance between the estimated value \(\hat\beta_i\) and the unknown true value \(\beta_i\): \[\begin{equation} T := \frac{\hat\beta_i - \beta_i}{\sqrt{\hat\sigma^2 C_{ii}}}. \tag{5.1} \end{equation}\] While there are many ways to measure this distance, the \(T\) constructed here has two main advantages:

The value of \(T\) can be computed from the given data, without any reference to unknown quantities (except for \(\beta_i\)).
Below, we will be able to find the distribution of \(T\). This will allow us to use \(T\) to construct confidence intervals and statistical tests.

Lemma 5.2 Assume that the data follows the model (4.1). Then \(T \sim t(n-p-1)\), i.e. \(T\) follows a \(t\)-distribution with \(n-p-1\) degrees of freedom (see appendix B.5.2).

Proof. We have \[\begin{align*} T &= \frac{(\hat\beta_i - \beta_i) / \sqrt{C_{ii}}} {\sqrt{\hat\sigma^2}} \\ &= \frac{(\hat\beta_i - \beta_i) / \sqrt{\sigma^2 C_{ii}}} {\sqrt{\hat\sigma^2 / \sigma^2}} \\ &= \frac{(\hat\beta_i - \beta_i) / \sqrt{\sigma^2 C_{ii}}} {\sqrt{((n - p - 1) \hat\sigma^2 / \sigma^2) / (n - p -1)}} \\ &=: \frac{Z}{\sqrt{Y / (n - p - 1)}}, \end{align*}\] where \(Z = (\hat\beta_i - \beta_i) / \sqrt{\sigma^2 C_{ii}} \sim \mathcal{N}(0,1)\) and \(Y = (n - p - 1) \hat\sigma^2/\sigma^2 \sim \chi^2(n-p-1)\) are independent, by lemma 5.1. Thus, \(T \sim t(n-p-1)\) as required.

The quantity \(\sqrt{\sigma^2 C_{ii}}\) is sometimes called the standard error of the estimator \(\hat\beta_i\), denoted by \(\mathop{\mathrm{se}}(\hat\beta_i)\).

5.2 Confidence Intervals

Using the scaled distance \(T\), it is easy to construct a confidence interval for \(\hat\beta_i\): For \(\alpha \in (0, 1)\), say \(\alpha = 5\%\), lemma 5.2 shows that \[\begin{equation*} P\Bigl( T \in \bigl[-t_{n-p-1}(\alpha/2), +t_{n-p-1}(\alpha/2)\bigr] \Bigr) = 1 - \alpha, \end{equation*}\] where \(t_{n-p-1}(\alpha/2)\) is the \((1 - \alpha/2)\)-quantile of the \(t(n-p-1)\)-distribution. Rewriting this expression as a condition on \(\hat\beta_i\) instead of on \(T\) gives a confidence interval for \(\beta_i\).

Lemma 5.3 The interval \[\begin{equation*} [U, V] := \Bigl[ \hat\beta_i - \sqrt{\hat\sigma^2 C_{ii}}t_{n-p-1}(\alpha/2), \hat\beta_i + \sqrt{\hat\sigma^2 C_{ii}}t_{n-p-1}(\alpha/2) \Bigr] \end{equation*}\] is a \((1-\alpha)\)-confidence interval for \(\beta_i\).

Proof. We have to show that \(P(\beta_i \in [U, V]) \geq 1-\alpha\). We have \[\begin{align*} &\hskip-5mm \beta_i \in [U, V] \\ &\Longleftrightarrow \bigl| \hat\beta_i - \beta_i \bigr| \leq \sqrt{\hat\sigma^2 C_{ii}}t_{n-p-1}(\alpha/2) \\ &\Longleftrightarrow \Bigl| \frac{\hat\beta_i - \beta_i}{\sqrt{\hat\sigma^2 C_{ii}}} \Bigr| \leq t_{n-p-1}(\alpha/2) \\ &\Longleftrightarrow T \in \bigl[-t_{n-p-1}(\alpha/2), +t_{n-p-1}(\alpha/2)\bigr] \end{align*}\] and thus \(P(\beta_i \in [U, V]) = 1 - \alpha\). This completes the proof.

5.3 Hypothesis Tests

Very similar to the argument for confidence intervals, we can derive a hypothesis test to test the hypothesis \[\begin{equation*} H_0\colon \beta_i = b \end{equation*}\] against the alternative \[\begin{equation*} H_1\colon \beta_i \neq b. \end{equation*}\] If we use \(b = 0\) in the test, we can test whether \(\beta_i = 0\). If \(\beta_i = 0\) is true, the corresponding input \(x_i\) has no influence on the output.

Here we redefine \(T\) as \[\begin{equation} T := \frac{\hat\beta_i - b}{\sqrt{\hat\sigma^2 C_{ii}}}, \tag{5.2} \end{equation}\] using \(b\) in place of the \(\beta_i\) above. When \(H_0\) is true, this new definition of \(T\) is the same as (5.1).

Lemma 5.4 The test which rejects \(H_0\) if and only if \(|T| > t_{n-p-1}(\alpha/2)\) has significance level \(\alpha\).

Proof. We have to show that the probability of type I errors (i.e. of wrongly rejecting \(H_0\) when it is true) is less than or equal to \(\alpha\). Assume that \(H_0\) is true. Then we have \(\beta_i = b\) and thus the \(T\) defined in this section coincides with the expression from equation (5.1). From lemma 5.2 we know that \(T \sim t(n-p-1)\). Thus we have \[\begin{align*} P( \mbox{type I error} ) &= P\bigl( |T| > t_{n-p-1}(\alpha/2) \bigr) \\ &= P\bigl(T < -t_{n-p-1}(\alpha/2) \bigr) + P\bigl(T > t_{n-p-1}(\alpha/2) \bigr) \\ &= 2 P\bigl(T > t_{n-p-1}(\alpha/2) \bigr) \\ &= 2 \frac{\alpha}{2} \\ &= \alpha. \end{align*}\] This completes the proof.

As usual with statistical tests, one needs to be extremely careful when performing several tests on the same data. In particular, it would be unwise to test more than one component of \(\beta\) using this procedure for the same data. Instead, in the next section we will consider how to perform tests for several components of \(\beta\) simultaneously. Before we do this, we will perform some experiments with R.

5.4 R Experiments

5.4.1 Fitting the model

m <- lm(stack.loss ~ ., data = stackloss)
summary(m)


Call:
lm(formula = stack.loss ~ ., data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

5.4.2 Estimating the Variance of the Error

We can get the design matrix \(X\) and the covariance matrix \(C\) as follows:

X <- model.matrix(m)
C <- solve(t(X) %*% X)
round(C, 4)

            (Intercept) Air.Flow Water.Temp Acid.Conc.
(Intercept)     13.4527   0.0273    -0.0620    -0.1594
Air.Flow         0.0273   0.0017    -0.0035    -0.0007
Water.Temp      -0.0620  -0.0035     0.0129     0.0000
Acid.Conc.      -0.1594  -0.0007     0.0000     0.0023

Next we need to estimate the variance \(\sigma^2\):

y <- stackloss$stack.loss
n <- nrow(stackloss)
p <- ncol(stackloss) - 1
hat.sigma2 <- sum((y - fitted(m))^2) / (n - p - 1)
hat.sigma2

[1] 10.51941

The square root of this number, so the estimated standard deviation of the \(\varepsilon_i\) is shown as Residual standard error in the summary output above. We check that we get the same result:

sqrt(hat.sigma2)

[1] 3.243364

This result is also listed as the Residual standard error near the bottom of the summary(m) output, above.

5.4.3 Estimating the Standard Errors

We can estimate the standard errors, i.e. the standard deviations \(\mathop{\mathrm{stdev}}(\hat\beta)_i\) as \(\sqrt{\hat\sigma^2 C_{ii}}\):

se <- sqrt(hat.sigma2 * diag(C))
se

(Intercept)    Air.Flow  Water.Temp  Acid.Conc. 
 11.8959969   0.1348582   0.3680243   0.1562940

These values are also listed in the Std. Error column of the summary(m) output.

5.4.4 Hypothesis tests

Let us now test the hypothesis \(H_0\colon \beta_i = 0\). The test statistic for this case is the following:

T <- coef(m) / se
T

(Intercept)    Air.Flow  Water.Temp  Acid.Conc. 
 -3.3557234   5.3066130   3.5195672  -0.9733098

These values are also listed in the t value column of the summary(m) output.

Before we can perform the test, we need to choose \(\alpha\) and to find the corresponding critical value:

alpha <- 0.05
t <- qt(1 - alpha/2 , n - p - 1)
t

[1] 2.109816

Using the critical value \(t\) we can decide whether \(H_0\) should be accepted or rejected. For example, looking at the intercept \(\beta_0\), we find \(|T_0| = |-3.3557234| > 2.109816 = t_{n-p-1}(1-\alpha/2)\) and thus we can reject the hypothesis \(H_0\colon \beta_0 = 0\). This means that the intercept is significantly different from 0.

5.4.5 Confidence Intervals

Using the quantile t we can also get confidence intervals. Here we only show the confidence interval for the intercept \(\beta_0\):

c(coef(m)[1] - se[1] * t, coef(m)[1] + se[1] * t)

(Intercept) (Intercept) 
  -65.01803   -14.82131

Confidence intervals for the remaining coefficients can be obtained similarly.

Summary

We know how to scale the distance between individual parameter estimates and the truth.
We have seen how to construct confidence intervals for \(\beta_i\).
We have seen how to construct statistical tests for \(\beta_i\).
We have understood some more of the summary output for the lm() function in R.