Section 7 Examples

To illustrate the results of the last two sections, we consider a series of examples.

7.1 Simple Confidence Interval

In this subsection we will illustrate how to compute a simple confidence interval for a single regression coefficient. We use a dataset about toxicity of chemicals towards aquatic life from the from the UCI Machine Learning Repository. The data is available at

There are nine different variables. (The web page explains the interpretation of these variables.) We will fit a model which describes LC50 as a function of the remaining variables:

# data from https://archive.ics.uci.edu/ml/datasets/QSAR+aquatic+toxicity
qsar <- read.csv("data/qsar_aquatic_toxicity.csv", sep = ";", header = FALSE)
names(qsar) <- c(
    "TPSA",
    "SAacc",
    "H050",
    "MLOGP",
    "RDCHI",
    "GATS1p",
    "nN",
    "C040",
    "LC50"
)
m <- lm(LC50 ~ ., data = qsar)
summary(m)

Call:
lm(formula = LC50 ~ ., data = qsar)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4934 -0.7579 -0.1120  0.5829  4.9778 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.698887   0.244554  11.036  < 2e-16 ***
TPSA         0.027201   0.002661  10.220  < 2e-16 ***
SAacc       -0.015081   0.002091  -7.211 1.90e-12 ***
H050         0.040619   0.059787   0.679 0.497186    
MLOGP        0.446108   0.063296   7.048 5.60e-12 ***
RDCHI        0.513928   0.135565   3.791 0.000167 ***
GATS1p      -0.571313   0.153882  -3.713 0.000227 ***
nN          -0.224751   0.048301  -4.653 4.12e-06 ***
C040         0.003194   0.077972   0.041 0.967340    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.203 on 537 degrees of freedom
Multiple R-squared:  0.4861,    Adjusted R-squared:  0.4785 
F-statistic:  63.5 on 8 and 537 DF,  p-value: < 2.2e-16

Our aim is to find a 95% confidence interval for GATS1p.

7.1.1 From First Principles

To compute the confidence interval from first principles, we can use Lemma 5.3. The centre of the confidence interval is given by \(\hat\beta_i\):

mid <- coef(m)[7]
mid
    GATS1p 
-0.5713131 

To find the half-width of the interval, we need to first construct the covariance matrix \(C\) and the estimated residual variance \(\hat\sigma^2\). We will also need the corresponding quantile of the \(t\)-distribution.

X <- model.matrix(m)
y <- qsar$LC50
n <- nrow(X)
p <- ncol(X) - 1

Cii <- solve(t(X) %*% X)[7, 7]
cat("Cii =", Cii, "\n")
Cii = 0.01637454 
sigma.hat.squared <- sum((fitted(m) - y)^2) / (n - p - 1)
cat("hat.sigma.square =", sigma.hat.squared, "\n")
hat.sigma.square = 1.446134 
alpha <- 0.05
t <- qt(1 - alpha/2, n - p - 1)
cat("t =", t, "\n")
t = 1.964391 

With this information in place, we can easily find the confidence interval:

d <- sqrt(Cii * sigma.hat.squared) * t
cat("[", mid - d, ", ", mid + d, "]\n", sep = "")
[-0.8735983, -0.2690279]

7.1.2 From the lm() Output

We can obtain the same result using only the quantile \(t\) and numbers shows in the summary(m) output. We focus on the row corresponding to GATS1p in the output shown above:

GATS1p      -0.571313   0.153882  -3.713 0.000227 ***

The first number in this row is the centre of the confidence interval, the second number corresponds to \(\sqrt{\hat\sigma^2 C_{ii}}\):

alpha <- 0.05
t <- qt(1 - alpha/2, n - p - 1)
cat("[", -0.571313 - 0.153882*t, ", ", -0.571313 + 0.153882*t, "]\n", sep = "")
[-0.8735975, -0.2690285]

This is the same result as above.

7.2 Confidence Intervals for the Mean

So far we have only seen how to compute confidence intervals for the regression coefficients \(\beta_i\). Using the same techniques we can also compute confidence intervals for the model mean corresponding to an input \((\tilde x_1, \ldots, \tilde x_p)\). If we write \[\begin{equation*} \tilde x = (1, \tilde x_1, \ldots, \tilde x_p), \end{equation*}\] then the model mean corresponding to this input is \[\begin{equation*} \tilde y = \beta_0 + \tilde x_1 \beta_1 + \cdots + \tilde x_p \beta_p = \tilde x^\top \beta, \end{equation*}\] To derive a confidence interval for this quantity, we follow the same steps as we did when we derived a confidence interval for \(\hat\beta_i\).

Our estimator for this \(\tilde y\) is \[\begin{equation*} \hat y = \tilde x^\top \hat\beta \sim \mathcal{N}\bigl( \tilde x^\top \beta, \sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x \bigr). \end{equation*}\] Thus, \[\begin{align*} \tilde T &:= \frac{\tilde x^\top \hat\beta - \tilde x^\top \beta} {\sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x}} \\ &= \frac{\frac{1}{\sqrt{\sigma^2\tilde x^\top (X^\top X)^{-1} \tilde x}}\bigl(\tilde x^\top \hat\beta - \tilde x^\top \beta\bigr)} {\sqrt{\frac{1}{\sigma^2}\hat\sigma^2}} \end{align*}\] is \(t(n-p-1)\)-distributed, since the numerator is \(\mathcal{N}(0,1)\), the denominator is \(\sqrt{\chi^2(n-p-1) / (n-p-1)}\) and \(\hat\beta\) is independent of \(\hat\sigma^2\). Let \(t_{n-p-1}(\alpha/2)\) be the \((1 - \alpha/2)\)-quantile of the \(t(n-p-1)\)-distribution. Then we can solve the inequality \(|T| \leq t_{n-p-1}(\alpha/2)\) for \(\tilde x^\top \beta\) to find the required confidence interval: \[\begin{equation*} \Bigl[ \tilde x^\top \hat\beta - \sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x} t_{n-p-1}(\alpha/2), \tilde x^\top \hat\beta + \sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x} t_{n-p-1}(\alpha/2) \Bigr]. \end{equation*}\]

We illustrate this using a numerical example. For the stackloss dataset with input values Air.Flow = 60, Water.Temp = 21 and Acid.Conc = 87, we can get the following results:

m <- lm(stack.loss ~ ., data = stackloss)
X <- model.matrix(m)
n <- nrow(X)
p <- ncol(X) - 1

tilde.x <- c(1, 60, 21, 87)
hat.beta <- coef(m)

c <- t(tilde.x) %*% solve(t(X) %*% X, tilde.x)
sigma.hat.squared <- sum((fitted(m) - stackloss$stack.loss)^2) / (n - p - 1)
t <- qt(0.975, n - p - 1)

mid <- t(tilde.x) %*% hat.beta
d <- sqrt(c * sigma.hat.squared) * t
cat("[", mid - d, ", ", mid + d, "]\n", sep = "")
[15.46463, 18.50554]

7.3 Testing a Single Coefficient

Continuing with the dataset from the previous example, we can test whether the regression coefficient corresponding to the water temperature might equal zero. We will perform a test at \(5\%\)-level.

7.3.1 From First Principles

We can compute the test statistic manually, using the formula from equation (5.2) with \(b = 0\):

T <- hat.beta[3] / sqrt(sigma.hat.squared * solve(t(X) %*% X)[3, 3])
T
Water.Temp 
  3.519567 
t.crit <- qt(0.975, n - p - 1)
abs(T) > t.crit
Water.Temp 
      TRUE 

Since the test statistic is larger than the critical value, we reject the hypothesis that \(\beta_2 = 0\).

7.3.2 Using the lm() Output, I

We can also find the test statistic in the lm() output:

summary(m)

Call:
lm(formula = stack.loss ~ ., data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

The column t value shows the value of the test statistic \(T\) for the hypothesis that \(\beta_i = 0\). For \(\beta_2\) we find the value \(3.520\), which is a rounded version of the \(3.519567\) we found above.

Again, we can use the command qt(0.975, 17) to get the critical value, where we can find the degrees of freedom in the Residual standard error row near the bottom. Alternatively, we can use the so called “p-value” listed in the Pr(>|t|) column. This value is the smallest significance level at which \(H_0\) is still rejected. Here we find \(0.00263\) and since we have \(\alpha = 0.05 > 0.00263\), we can see that \(H_0\) is rejected.

If we wanted to test \(\beta_2 = 1\) instead, we could use the value \(\hat\beta_2 = 1.2953\) from the Estimate column and \(\sqrt{\hat\sigma^2 C_{ii}} = 0.3680\) from the Std. Error column, to get \(T = (1.2953 - 1) / 0.3680\) and then we would accept or reject the hypothesis \(\beta_2 = 1\) by comparing the result to the critical value qt(0.975, 17).

7.3.3 Using the lm() Output, II

For the hypotheses \(\beta_i = 0\), we can also use the stars (or sometimes dots) shown in the right margin to perform the test. Since the Water.Temp row ends in **, we would reject \(\beta_2 = 0\) even at the stricter \(\alpha=0.01\) level, and thus we would also reject this hypothesis at \(\alpha=0.05\) level.

7.4 Testing Multiple Coefficents

This section illustrates how a group of coefficients can be tested simultaneously. We consider a small, synthetic dataset:

# data from https://teaching.seehuhn.de/data/small/small.csv
small <- read.csv("data/small.csv")
m <- lm(y ~ ., data = small)
summary(m)

Call:
lm(formula = y ~ ., data = small)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64664 -0.09369 -0.03420  0.07768  0.63353 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.53329    0.12360  44.768 6.92e-12 ***
x1           0.18903    0.12059   1.568   0.1514    
x2          -0.30731    0.18011  -1.706   0.1221    
x3          -0.12382    0.05639  -2.196   0.0557 .  
x4          -0.11731    0.12619  -0.930   0.3768    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3822 on 9 degrees of freedom
Multiple R-squared:  0.6404,    Adjusted R-squared:  0.4806 
F-statistic: 4.007 on 4 and 9 DF,  p-value: 0.03891

From the R output it is clear that the intercept is non-zero, but none of the remaining regression coefficients are significantly different from zero when testing at \(5\%\) level. This poses the question whether the inputs have any effect on the output at all. To answer this question, we test the hypothesis \[\begin{equation*} H_0\colon \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0, \end{equation*}\] omitting only \(\beta_0\) in the hypothesis. We use the method from section 6.3 to perform this test.

7.4.1 From First Principles

The matrix \(K\) which selects the regression coefficents is \[\begin{equation*} K = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{pmatrix} \in \mathbb{R}^{4\times 5}. \end{equation*}\] The test statistic is \[\begin{equation*} F = \frac{\bigl\| K \hat\beta - 0 \bigr\|_{K(X^\top X)^{-1} K^\top}^2} {4 \hat\sigma^2}, \end{equation*}\] where \(K \hat\beta = (\hat\beta_1, \ldots, \hat\beta_4)\) and \(K(X^\top X)^{-1} K^\top\) is obtained by removing the first row and column from \((X^\top X)^{-1}\):

b <- coef(m)[2:5]
X <- model.matrix(m)
C <- solve(t(X) %*% X)[2:5, 2:5]
F <- t(b) %*% solve(C, b) / (4 * summary(m)$sigma^2)
F
         [,1]
[1,] 4.007393

From lemma 6.3 we know that under \(H_0\) the test statistic is \(F_{k, n-p-1}\) distributed, and the critical value is the \(1-\alpha\) quantile of this distribution:

n <- nrow(X)      # 14
p <- ncol(X) - 1  # 4
k <- 4
f <- qf(0.95, k, n - p - 1)
f
[1] 3.633089

Since the test statistic is larger than the critical value, we can reject \(H_0\) and conclude at significance level \(5\%\) that not all regression coefficients can be zero.

7.4.2 Using the lm() output

The information required to perform this \(F\) test is contained in the last row of the summary(m) output:

F-statistic: 4.007 on 4 and 9 DF,  p-value: 0.03891

Here we can see that the test statistic is \(4.007\) (a rounded version of the value we found above) and the degreen of freedom for the \(F\)-test are \(k = 4\) and \(n - p - 1 = 9\). The “p-value” listed is the smallest significance level at which the test still rejects \(H_0\). We see that \(H_0\) is rejected at significance level \(0.03891\), and thus also at \(\alpha=0.05\).

Summary

  • We have illustrated the results of sections 5 and 6 with the help of examples.
  • We have learned how to use the lm() output to compute confidence intervals and to perform hypothesis tests.