Section 7 Examples
To illustrate the results of the last two sections, we consider a series of examples.
7.1 Simple Confidence Interval
In this subsection we will illustrate how to compute a simple confidence interval for a single regression coefficient. We use a dataset about toxicity of chemicals towards aquatic life from the from the UCI Machine Learning Repository. The data is available at
There are nine different variables. (The web page explains the interpretation
of these variables.) We will fit a model which describes LC50
as a function
of the remaining variables:
# data from https://archive.ics.uci.edu/ml/datasets/QSAR+aquatic+toxicity
qsar <- read.csv("data/qsar_aquatic_toxicity.csv", sep = ";", header = FALSE)
names(qsar) <- c(
"TPSA",
"SAacc",
"H050",
"MLOGP",
"RDCHI",
"GATS1p",
"nN",
"C040",
"LC50"
)
m <- lm(LC50 ~ ., data = qsar)
summary(m)
Call:
lm(formula = LC50 ~ ., data = qsar)
Residuals:
Min 1Q Median 3Q Max
-4.4934 -0.7579 -0.1120 0.5829 4.9778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.698887 0.244554 11.036 < 2e-16 ***
TPSA 0.027201 0.002661 10.220 < 2e-16 ***
SAacc -0.015081 0.002091 -7.211 1.90e-12 ***
H050 0.040619 0.059787 0.679 0.497186
MLOGP 0.446108 0.063296 7.048 5.60e-12 ***
RDCHI 0.513928 0.135565 3.791 0.000167 ***
GATS1p -0.571313 0.153882 -3.713 0.000227 ***
nN -0.224751 0.048301 -4.653 4.12e-06 ***
C040 0.003194 0.077972 0.041 0.967340
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.203 on 537 degrees of freedom
Multiple R-squared: 0.4861, Adjusted R-squared: 0.4785
F-statistic: 63.5 on 8 and 537 DF, p-value: < 2.2e-16
Our aim is to find a 95% confidence interval for GATS1p
.
7.1.1 From First Principles
To compute the confidence interval from first principles, we can use Lemma 5.3. The centre of the confidence interval is given by \(\hat\beta_i\):
GATS1p
-0.5713131
To find the half-width of the interval, we need to first construct the covariance matrix \(C\) and the estimated residual variance \(\hat\sigma^2\). We will also need the corresponding quantile of the \(t\)-distribution.
X <- model.matrix(m)
y <- qsar$LC50
n <- nrow(X)
p <- ncol(X) - 1
Cii <- solve(t(X) %*% X)[7, 7]
cat("Cii =", Cii, "\n")
Cii = 0.01637454
sigma.hat.squared <- sum((fitted(m) - y)^2) / (n - p - 1)
cat("hat.sigma.square =", sigma.hat.squared, "\n")
hat.sigma.square = 1.446134
t = 1.964391
With this information in place, we can easily find the confidence interval:
[-0.8735983, -0.2690279]
7.1.2 From the lm()
Output
We can obtain the same result using only the quantile \(t\) and
numbers shows in the summary(m)
output. We focus on the
row corresponding to GATS1p
in the output shown above:
GATS1p -0.571313 0.153882 -3.713 0.000227 ***
The first number in this row is the centre of the confidence interval, the second number corresponds to \(\sqrt{\hat\sigma^2 C_{ii}}\):
alpha <- 0.05
t <- qt(1 - alpha/2, n - p - 1)
cat("[", -0.571313 - 0.153882*t, ", ", -0.571313 + 0.153882*t, "]\n", sep = "")
[-0.8735975, -0.2690285]
This is the same result as above.
7.2 Confidence Intervals for the Mean
So far we have only seen how to compute confidence intervals for the regression coefficients \(\beta_i\). Using the same techniques we can also compute confidence intervals for the model mean corresponding to an input \((\tilde x_1, \ldots, \tilde x_p)\). If we write \[\begin{equation*} \tilde x = (1, \tilde x_1, \ldots, \tilde x_p), \end{equation*}\] then the model mean corresponding to this input is \[\begin{equation*} \tilde y = \beta_0 + \tilde x_1 \beta_1 + \cdots + \tilde x_p \beta_p = \tilde x^\top \beta, \end{equation*}\] To derive a confidence interval for this quantity, we follow the same steps as we did when we derived a confidence interval for \(\hat\beta_i\).
Our estimator for this \(\tilde y\) is \[\begin{equation*} \hat y = \tilde x^\top \hat\beta \sim \mathcal{N}\bigl( \tilde x^\top \beta, \sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x \bigr). \end{equation*}\] Thus, \[\begin{align*} \tilde T &:= \frac{\tilde x^\top \hat\beta - \tilde x^\top \beta} {\sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x}} \\ &= \frac{\frac{1}{\sqrt{\sigma^2\tilde x^\top (X^\top X)^{-1} \tilde x}}\bigl(\tilde x^\top \hat\beta - \tilde x^\top \beta\bigr)} {\sqrt{\frac{1}{\sigma^2}\hat\sigma^2}} \end{align*}\] is \(t(n-p-1)\)-distributed, since the numerator is \(\mathcal{N}(0,1)\), the denominator is \(\sqrt{\chi^2(n-p-1) / (n-p-1)}\) and \(\hat\beta\) is independent of \(\hat\sigma^2\). Let \(t_{n-p-1}(\alpha/2)\) be the \((1 - \alpha/2)\)-quantile of the \(t(n-p-1)\)-distribution. Then we can solve the inequality \(|T| \leq t_{n-p-1}(\alpha/2)\) for \(\tilde x^\top \beta\) to find the required confidence interval: \[\begin{equation*} \Bigl[ \tilde x^\top \hat\beta - \sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x} t_{n-p-1}(\alpha/2), \tilde x^\top \hat\beta + \sqrt{\hat\sigma^2 \tilde x^\top (X^\top X)^{-1} \tilde x} t_{n-p-1}(\alpha/2) \Bigr]. \end{equation*}\]
We illustrate this using a numerical example.
For the stackloss
dataset
with input values Air.Flow = 60
, Water.Temp = 21
and Acid.Conc = 87
,
we can get the following results:
m <- lm(stack.loss ~ ., data = stackloss)
X <- model.matrix(m)
n <- nrow(X)
p <- ncol(X) - 1
tilde.x <- c(1, 60, 21, 87)
hat.beta <- coef(m)
c <- t(tilde.x) %*% solve(t(X) %*% X, tilde.x)
sigma.hat.squared <- sum((fitted(m) - stackloss$stack.loss)^2) / (n - p - 1)
t <- qt(0.975, n - p - 1)
mid <- t(tilde.x) %*% hat.beta
d <- sqrt(c * sigma.hat.squared) * t
cat("[", mid - d, ", ", mid + d, "]\n", sep = "")
[15.46463, 18.50554]
7.3 Testing a Single Coefficient
Continuing with the dataset from the previous example, we can test whether the regression coefficient corresponding to the water temperature might equal zero. We will perform a test at \(5\%\)-level.
7.3.1 From First Principles
We can compute the test statistic manually, using the formula from equation (5.2) with \(b = 0\):
Water.Temp
3.519567
Water.Temp
TRUE
Since the test statistic is larger than the critical value, we reject the hypothesis that \(\beta_2 = 0\).
7.3.2 Using the lm()
Output, I
We can also find the test statistic in the lm()
output:
Call:
lm(formula = stack.loss ~ ., data = stackloss)
Residuals:
Min 1Q Median 3Q Max
-7.2377 -1.7117 -0.4551 2.3614 5.6978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
The column t value
shows the value of the test statistic \(T\) for
the hypothesis that \(\beta_i = 0\). For \(\beta_2\) we find the value \(3.520\),
which is a rounded version of the \(3.519567\) we found above.
Again, we can use the command qt(0.975, 17)
to get the critical value, where
we can find the degrees of freedom in the Residual standard error
row near
the bottom. Alternatively, we can use the so called “p-value” listed in the
Pr(>|t|)
column. This value is the smallest significance level at which
\(H_0\) is still rejected. Here we find \(0.00263\) and since we have
\(\alpha = 0.05 > 0.00263\), we can see that \(H_0\) is rejected.
If we wanted to test \(\beta_2 = 1\) instead, we could use the value \(\hat\beta_2
= 1.2953\) from the Estimate
column and \(\sqrt{\hat\sigma^2 C_{ii}} = 0.3680\)
from the Std. Error
column, to get \(T = (1.2953 - 1) / 0.3680\) and then we
would accept or reject the hypothesis \(\beta_2 = 1\) by comparing the result to
the critical value qt(0.975, 17)
.
7.3.3 Using the lm()
Output, II
For the hypotheses \(\beta_i = 0\), we can also use the stars (or sometimes dots)
shown in the right margin to perform the test. Since the Water.Temp
row ends
in **
, we would reject \(\beta_2 = 0\) even at the stricter \(\alpha=0.01\)
level, and thus we would also reject this hypothesis at \(\alpha=0.05\) level.
7.4 Testing Multiple Coefficents
This section illustrates how a group of coefficients can be tested simultaneously. We consider a small, synthetic dataset:
# data from https://teaching.seehuhn.de/data/small/small.csv
small <- read.csv("data/small.csv")
m <- lm(y ~ ., data = small)
summary(m)
Call:
lm(formula = y ~ ., data = small)
Residuals:
Min 1Q Median 3Q Max
-0.64664 -0.09369 -0.03420 0.07768 0.63353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.53329 0.12360 44.768 6.92e-12 ***
x1 0.18903 0.12059 1.568 0.1514
x2 -0.30731 0.18011 -1.706 0.1221
x3 -0.12382 0.05639 -2.196 0.0557 .
x4 -0.11731 0.12619 -0.930 0.3768
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3822 on 9 degrees of freedom
Multiple R-squared: 0.6404, Adjusted R-squared: 0.4806
F-statistic: 4.007 on 4 and 9 DF, p-value: 0.03891
From the R output it is clear that the intercept is non-zero, but none of the remaining regression coefficients are significantly different from zero when testing at \(5\%\) level. This poses the question whether the inputs have any effect on the output at all. To answer this question, we test the hypothesis \[\begin{equation*} H_0\colon \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0, \end{equation*}\] omitting only \(\beta_0\) in the hypothesis. We use the method from section 6.3 to perform this test.
7.4.1 From First Principles
The matrix \(K\) which selects the regression coefficents is \[\begin{equation*} K = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{pmatrix} \in \mathbb{R}^{4\times 5}. \end{equation*}\] The test statistic is \[\begin{equation*} F = \frac{\bigl\| K \hat\beta - 0 \bigr\|_{K(X^\top X)^{-1} K^\top}^2} {4 \hat\sigma^2}, \end{equation*}\] where \(K \hat\beta = (\hat\beta_1, \ldots, \hat\beta_4)\) and \(K(X^\top X)^{-1} K^\top\) is obtained by removing the first row and column from \((X^\top X)^{-1}\):
b <- coef(m)[2:5]
X <- model.matrix(m)
C <- solve(t(X) %*% X)[2:5, 2:5]
F <- t(b) %*% solve(C, b) / (4 * summary(m)$sigma^2)
F
[,1]
[1,] 4.007393
From lemma 6.3 we know that under \(H_0\) the test statistic is \(F_{k, n-p-1}\) distributed, and the critical value is the \(1-\alpha\) quantile of this distribution:
[1] 3.633089
Since the test statistic is larger than the critical value, we can reject \(H_0\) and conclude at significance level \(5\%\) that not all regression coefficients can be zero.
7.4.2 Using the lm()
output
The information required to perform this \(F\) test is contained
in the last row of the summary(m)
output:
F-statistic: 4.007 on 4 and 9 DF, p-value: 0.03891
Here we can see that the test statistic is \(4.007\) (a rounded version of the value we found above) and the degreen of freedom for the \(F\)-test are \(k = 4\) and \(n - p - 1 = 9\). The “p-value” listed is the smallest significance level at which the test still rejects \(H_0\). We see that \(H_0\) is rejected at significance level \(0.03891\), and thus also at \(\alpha=0.05\).
Summary
- We have illustrated the results of sections 5 and 6 with the help of examples.
- We have learned how to use the
lm()
output to compute confidence intervals and to perform hypothesis tests.