Section 14 Examples
In this section we illustrate the results of the previous sections with the help of an example.
14.1 Use of Interaction Terms in Modelling
Example 14.1 Researchers in Food Science studied how big people’s mouths tend to be. It transpires that mouth volume is related to more easily measured quantities like height and weight. Here we have data which gives the mouth volume (in cubic centimetres), age, sex (female = 0, male = 1), height (in metres) and weight (in kilos) for 61 volunteers. The first few samples are as shown:
# data from https://teaching.seehuhn.de/2022/MATH3714/mouth-volume.txt
dd <- read.table("data/mouth-volume.txt", header = TRUE)
head(dd)
Mouth_Volume Age Sex Height Weight
1 56.659 29 1 1.730 70.455
2 47.938 24 0 1.632 60.000
3 60.995 45 0 1.727 102.273
4 68.917 23 0 1.641 78.636
5 60.956 27 0 1.600 57.273
6 76.204 23 1 1.746 66.818
Normally it is important to check that categorical variables are correctly
recognised as factors, when importing data into R. As an exception,
the fact that sex
is stored as an integer variable (instead of a factor)
here does not matter, since the default factor coding would be identical
to the coding in the present data.
Our aim is to fit a model which describes mouth volume in terms of the other variables. We start by fitting a basic model:
Call:
lm(formula = Mouth_Volume ~ ., data = dd)
Residuals:
Min 1Q Median 3Q Max
-33.242 -10.721 -3.223 8.800 43.218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9882 45.2736 0.022 0.9827
Age 0.3617 0.2607 1.387 0.1709
Sex 5.5030 5.9385 0.927 0.3581
Height 15.8980 29.1952 0.545 0.5882
Weight 0.2640 0.1400 1.885 0.0646 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.04 on 56 degrees of freedom
Multiple R-squared: 0.2588, Adjusted R-squared: 0.2059
F-statistic: 4.888 on 4 and 56 DF, p-value: 0.001882
Since no coefficient is significantly different from zero (at \(5\%\)-level), and since the \(R^2\) value is not close to~\(1\), model fit seems poor. To improve model fit, we try to include all pairwise interaction terms:
Call:
lm(formula = Mouth_Volume ~ .^2, data = dd)
Residuals:
Min 1Q Median 3Q Max
-35.528 -8.168 -1.995 6.704 44.136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.875e+02 2.337e+02 1.230 0.225
Age -8.370e+00 6.352e+00 -1.318 0.194
Sex -9.318e+01 1.325e+02 -0.703 0.485
Height -1.540e+02 1.505e+02 -1.023 0.311
Weight 5.889e-01 2.713e+00 0.217 0.829
Age:Sex -9.023e-01 1.035e+00 -0.872 0.388
Age:Height 5.374e+00 4.062e+00 1.323 0.192
Age:Weight -1.754e-03 2.027e-02 -0.087 0.931
Sex:Height 5.989e+01 7.536e+01 0.795 0.431
Sex:Weight 3.110e-01 4.177e-01 0.744 0.460
Height:Weight -2.680e-01 1.755e+00 -0.153 0.879
Residual standard error: 15.22 on 50 degrees of freedom
Multiple R-squared: 0.3219, Adjusted R-squared: 0.1863
F-statistic: 2.374 on 10 and 50 DF, p-value: 0.0218
None of the interaction terms look useful when used in combination. We try to select a subset of variables to get a better fit:
Subset selection object
Call: regsubsets.formula(Mouth_Volume ~ .^2, data = dd, nvmax = 10)
10 Variables (and intercept)
Forced in Forced out
Age FALSE FALSE
Sex FALSE FALSE
Height FALSE FALSE
Weight FALSE FALSE
Age:Sex FALSE FALSE
Age:Height FALSE FALSE
Age:Weight FALSE FALSE
Sex:Height FALSE FALSE
Sex:Weight FALSE FALSE
Height:Weight FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: exhaustive
Age Sex Height Weight Age:Sex Age:Height Age:Weight Sex:Height
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " "*" " "
3 ( 1 ) " " "*" " " " " " " " " "*" "*"
4 ( 1 ) " " "*" " " " " " " " " "*" "*"
5 ( 1 ) "*" " " "*" " " "*" "*" " " " "
6 ( 1 ) "*" " " "*" " " "*" "*" " " "*"
7 ( 1 ) "*" "*" "*" " " "*" "*" " " "*"
8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
9 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Sex:Weight Height:Weight
1 ( 1 ) " " "*"
2 ( 1 ) "*" " "
3 ( 1 ) " " " "
4 ( 1 ) "*" " "
5 ( 1 ) "*" " "
6 ( 1 ) "*" " "
7 ( 1 ) "*" " "
8 ( 1 ) "*" " "
9 ( 1 ) "*" "*"
10 ( 1 ) "*" "*"
The table shows the optimal model for every number of variables, for \(p\) ranging from \(1\) to \(10\). To choose the number of variables we can use the adjusted \(R^2\)-value:
(Intercept) Age Sex Height Weight
TRUE TRUE FALSE TRUE FALSE
Age:Sex Age:Height Age:Weight Sex:Height Sex:Weight
TRUE TRUE FALSE FALSE TRUE
Height:Weight
FALSE
The optimal model consists of the five variables listed. For further
examination, we fit the corresponding model using lm()
:
Call:
lm(formula = Mouth_Volume ~ Age + Height + Age:Sex + Age:Height +
Sex:Weight, data = dd)
Residuals:
Min 1Q Median 3Q Max
-36.590 -9.577 -1.545 7.728 42.779
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 265.6236 122.3362 2.171 0.0342 *
Age -8.4905 4.0559 -2.093 0.0409 *
Height -134.0835 72.8589 -1.840 0.0711 .
Age:Sex -0.8633 0.4944 -1.746 0.0864 .
Age:Height 5.3644 2.4047 2.231 0.0298 *
Sex:Weight 0.4059 0.1659 2.446 0.0177 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.65 on 55 degrees of freedom
Multiple R-squared: 0.3096, Adjusted R-squared: 0.2468
F-statistic: 4.932 on 5 and 55 DF, p-value: 0.0008437
This looks much better: now most variables are significantly different from zero and the adjusted \(R^2\) value has (slightly) improved. Nevertheless, the structure of the model seems hard to explain and the model seems difficult to interpret. For comparison, we consider the second-best model, with \(p=2\):
Call:
lm(formula = Mouth_Volume ~ Age:Weight + Sex:Weight, data = dd)
Residuals:
Min 1Q Median 3Q Max
-34.161 -10.206 -3.738 9.087 43.747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.678187 4.788206 9.122 8.35e-13 ***
Age:Weight 0.005554 0.002372 2.342 0.0226 *
Weight:Sex 0.127360 0.048640 2.618 0.0113 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.72 on 58 degrees of freedom
Multiple R-squared: 0.2651, Adjusted R-squared: 0.2398
F-statistic: 10.46 on 2 and 58 DF, p-value: 0.000132
The adjusted \(R^2\)-value of this model is only slightly reduced, and all regression coefficients are significantly different from zero.
\[\begin{equation*} \textrm{mouth volume} = 43.68 + \begin{cases} 0.0056 \, \mathrm{age} \times \mathrm{weight} & \mbox{for females} \\ (0.0056 \, \mathrm{age} + 0.1274) \, \mathrm{weight} & \mbox{for males.} \end{cases} \end{equation*}\]
This model is still not trivial to interpret, but it does show that mouth volume increases with weight and age, and that the dependency on weight is stronger for males.
14.2 Alternative Factor Codings
The above coding of factors using indicator variables is the default procedure in R. There are other choices of coding. It transpires that any mapping to vectors which (together with the intercept, if present) span a \(k\) dimensional space, will work for a factor with \(k\) levels. The choice of coding will not affect fitted values and the goodness of fit, but it will affect the estimates \(\hat\beta\), so the coding needs to be known before any interpretation of the parameters.
Differnt choices of factor codings are built into R. Here we illustrate some of these choices using a factor which represents week days:
Indicator variables. This is the default coding:
Tue Wed Thu Fri Sat Sun Mon 0 0 0 0 0 0 Tue 1 0 0 0 0 0 Wed 0 1 0 0 0 0 Thu 0 0 1 0 0 0 Fri 0 0 0 1 0 0 Sat 0 0 0 0 1 0 Sun 0 0 0 0 0 1
The seven rows correspond to the weekdays, the six columns show how each weekday is encoded. The regression coefficients can be interpreted as the change of intercept relative to the reference level (in this case Monday).
Deviation coding uses “sum to zero” contrasts:
[,1] [,2] [,3] [,4] [,5] [,6] Mon 1 0 0 0 0 0 Tue 0 1 0 0 0 0 Wed 0 0 1 0 0 0 Thu 0 0 0 1 0 0 Fri 0 0 0 0 1 0 Sat 0 0 0 0 0 1 Sun -1 -1 -1 -1 -1 -1
This constrasts each level with the final level. For example, \(\hat\beta_1\) is the amount the intercept is increased for Mondays and decreased for Sundays. This coding can sometimes reduce collinearity in the design matrix.
Helmert coding contrasts the second level with the first, the third with the average of the first two, and so on:
[,1] [,2] [,3] [,4] [,5] [,6] Mon -1 -1 -1 -1 -1 -1 Tue 1 -1 -1 -1 -1 -1 Wed 0 2 -1 -1 -1 -1 Thu 0 0 3 -1 -1 -1 Fri 0 0 0 4 -1 -1 Sat 0 0 0 0 5 -1 Sun 0 0 0 0 0 6
We can set the coding for each factor separately, using the
optional contrasts.arg
argument to lm()
and model.matrix()
:
set.seed(20211127)
n <- 20
x1 <- runif(n)
day.names <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
x2 <- sample(day.names, size = n, replace = TRUE)
# Explicitly specify the levels, in case a day is missing
# from the sample and to fix the order of days:
x2 <- factor(x2, levels = day.names)
dd <- data.frame(x1, x2)
X1 <- model.matrix(~ ., data = dd)
head(X1)
(Intercept) x1 x2Tue x2Wed x2Thu x2Fri x2Sat x2Sun
1 1 0.41745367 1 0 0 0 0 0
2 1 0.88671893 0 0 1 0 0 0
3 1 0.41924327 0 0 1 0 0 0
4 1 0.31986826 0 0 0 0 0 1
5 1 0.27863787 0 0 0 0 0 0
6 1 0.03346104 0 1 0 0 0 0
[1] 10.37243
(Intercept) x1 x21 x22 x23 x24 x25 x26
1 1 0.41745367 0 1 0 0 0 0
2 1 0.88671893 0 0 0 1 0 0
3 1 0.41924327 0 0 0 1 0 0
4 1 0.31986826 -1 -1 -1 -1 -1 -1
5 1 0.27863787 1 0 0 0 0 0
6 1 0.03346104 0 0 1 0 0 0
[1] 6.677298
Summary
- We have seen how interaction terms can be used to improve model fit.
- We have seen how different factor codings can be used in R.