Section 14 Examples

In this section we illustrate the results of the previous sections with the help of an example.

14.1 Use of Interaction Terms in Modelling

Example 14.1 Researchers in Food Science studied how big people’s mouths tend to be. It transpires that mouth volume is related to more easily measured quantities like height and weight. Here we have data which gives the mouth volume (in cubic centimetres), age, sex (female = 0, male = 1), height (in metres) and weight (in kilos) for 61 volunteers. The first few samples are as shown:

# data from https://teaching.seehuhn.de/2022/MATH3714/mouth-volume.txt
dd <- read.table("data/mouth-volume.txt", header = TRUE)
head(dd)

  Mouth_Volume Age Sex Height  Weight
1       56.659  29   1  1.730  70.455
2       47.938  24   0  1.632  60.000
3       60.995  45   0  1.727 102.273
4       68.917  23   0  1.641  78.636
5       60.956  27   0  1.600  57.273
6       76.204  23   1  1.746  66.818

Normally it is important to check that categorical variables are correctly recognised as factors, when importing data into R. As an exception, the fact that sex is stored as an integer variable (instead of a factor) here does not matter, since the default factor coding would be identical to the coding in the present data.

Our aim is to fit a model which describes mouth volume in terms of the other variables. We start by fitting a basic model:

m1 <- lm(Mouth_Volume ~ ., data = dd)
summary(m1)


Call:
lm(formula = Mouth_Volume ~ ., data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.242 -10.721  -3.223   8.800  43.218 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.9882    45.2736   0.022   0.9827  
Age           0.3617     0.2607   1.387   0.1709  
Sex           5.5030     5.9385   0.927   0.3581  
Height       15.8980    29.1952   0.545   0.5882  
Weight        0.2640     0.1400   1.885   0.0646 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.04 on 56 degrees of freedom
Multiple R-squared:  0.2588,    Adjusted R-squared:  0.2059 
F-statistic: 4.888 on 4 and 56 DF,  p-value: 0.001882

Since no coefficient is significantly different from zero (at \(5\%\)-level), and since the \(R^2\) value is not close to~\(1\), model fit seems poor. To improve model fit, we try to include all pairwise interaction terms:

m2 <- lm(Mouth_Volume ~ .^2, data = dd)
summary(m2)


Call:
lm(formula = Mouth_Volume ~ .^2, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.528  -8.168  -1.995   6.704  44.136 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)    2.875e+02  2.337e+02   1.230    0.225
Age           -8.370e+00  6.352e+00  -1.318    0.194
Sex           -9.318e+01  1.325e+02  -0.703    0.485
Height        -1.540e+02  1.505e+02  -1.023    0.311
Weight         5.889e-01  2.713e+00   0.217    0.829
Age:Sex       -9.023e-01  1.035e+00  -0.872    0.388
Age:Height     5.374e+00  4.062e+00   1.323    0.192
Age:Weight    -1.754e-03  2.027e-02  -0.087    0.931
Sex:Height     5.989e+01  7.536e+01   0.795    0.431
Sex:Weight     3.110e-01  4.177e-01   0.744    0.460
Height:Weight -2.680e-01  1.755e+00  -0.153    0.879

Residual standard error: 15.22 on 50 degrees of freedom
Multiple R-squared:  0.3219,    Adjusted R-squared:  0.1863 
F-statistic: 2.374 on 10 and 50 DF,  p-value: 0.0218

None of the interaction terms look useful when used in combination. We try to select a subset of variables to get a better fit:

library(leaps)
r3 <- regsubsets(Mouth_Volume ~ .^2, data = dd, nvmax = 10)
s <- summary(r3)
s

Subset selection object
Call: regsubsets.formula(Mouth_Volume ~ .^2, data = dd, nvmax = 10)
10 Variables  (and intercept)
              Forced in Forced out
Age               FALSE      FALSE
Sex               FALSE      FALSE
Height            FALSE      FALSE
Weight            FALSE      FALSE
Age:Sex           FALSE      FALSE
Age:Height        FALSE      FALSE
Age:Weight        FALSE      FALSE
Sex:Height        FALSE      FALSE
Sex:Weight        FALSE      FALSE
Height:Weight     FALSE      FALSE
1 subsets of each size up to 10
Selection Algorithm: exhaustive
          Age Sex Height Weight Age:Sex Age:Height Age:Weight Sex:Height
1  ( 1 )  " " " " " "    " "    " "     " "        " "        " "       
2  ( 1 )  " " " " " "    " "    " "     " "        "*"        " "       
3  ( 1 )  " " "*" " "    " "    " "     " "        "*"        "*"       
4  ( 1 )  " " "*" " "    " "    " "     " "        "*"        "*"       
5  ( 1 )  "*" " " "*"    " "    "*"     "*"        " "        " "       
6  ( 1 )  "*" " " "*"    " "    "*"     "*"        " "        "*"       
7  ( 1 )  "*" "*" "*"    " "    "*"     "*"        " "        "*"       
8  ( 1 )  "*" "*" "*"    "*"    "*"     "*"        " "        "*"       
9  ( 1 )  "*" "*" "*"    "*"    "*"     "*"        " "        "*"       
10  ( 1 ) "*" "*" "*"    "*"    "*"     "*"        "*"        "*"       
          Sex:Weight Height:Weight
1  ( 1 )  " "        "*"          
2  ( 1 )  "*"        " "          
3  ( 1 )  " "        " "          
4  ( 1 )  "*"        " "          
5  ( 1 )  "*"        " "          
6  ( 1 )  "*"        " "          
7  ( 1 )  "*"        " "          
8  ( 1 )  "*"        " "          
9  ( 1 )  "*"        "*"          
10  ( 1 ) "*"        "*"

The table shows the optimal model for every number of variables, for \(p\) ranging from \(1\) to \(10\). To choose the number of variables we can use the adjusted \(R^2\)-value:

s$which[which.max(s$adjr2),]

  (Intercept)           Age           Sex        Height        Weight 
         TRUE          TRUE         FALSE          TRUE         FALSE 
      Age:Sex    Age:Height    Age:Weight    Sex:Height    Sex:Weight 
         TRUE          TRUE         FALSE         FALSE          TRUE 
Height:Weight 
        FALSE

The optimal model consists of the five variables listed. For further examination, we fit the corresponding model using lm():

m3 <- lm(Mouth_Volume
         ~ Age + Height + Age:Sex + Age:Height + Sex:Weight,
         data = dd)
summary(m3)


Call:
lm(formula = Mouth_Volume ~ Age + Height + Age:Sex + Age:Height + 
    Sex:Weight, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.590  -9.577  -1.545   7.728  42.779 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  265.6236   122.3362   2.171   0.0342 *
Age           -8.4905     4.0559  -2.093   0.0409 *
Height      -134.0835    72.8589  -1.840   0.0711 .
Age:Sex       -0.8633     0.4944  -1.746   0.0864 .
Age:Height     5.3644     2.4047   2.231   0.0298 *
Sex:Weight     0.4059     0.1659   2.446   0.0177 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.65 on 55 degrees of freedom
Multiple R-squared:  0.3096,    Adjusted R-squared:  0.2468 
F-statistic: 4.932 on 5 and 55 DF,  p-value: 0.0008437

This looks much better: now most variables are significantly different from zero and the adjusted \(R^2\) value has (slightly) improved. Nevertheless, the structure of the model seems hard to explain and the model seems difficult to interpret. For comparison, we consider the second-best model, with \(p=2\):

m4 <- lm(Mouth_Volume ~ Age:Weight + Sex:Weight, data = dd)
summary(m4)


Call:
lm(formula = Mouth_Volume ~ Age:Weight + Sex:Weight, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.161 -10.206  -3.738   9.087  43.747 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 43.678187   4.788206   9.122 8.35e-13 ***
Age:Weight   0.005554   0.002372   2.342   0.0226 *  
Weight:Sex   0.127360   0.048640   2.618   0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.72 on 58 degrees of freedom
Multiple R-squared:  0.2651,    Adjusted R-squared:  0.2398 
F-statistic: 10.46 on 2 and 58 DF,  p-value: 0.000132

The adjusted \(R^2\)-value of this model is only slightly reduced, and all regression coefficients are significantly different from zero.

\[\begin{equation*} \textrm{mouth volume} = 43.68 + \begin{cases} 0.0056 \, \mathrm{age} \times \mathrm{weight} & \mbox{for females} \\ (0.0056 \, \mathrm{age} + 0.1274) \, \mathrm{weight} & \mbox{for males.} \end{cases} \end{equation*}\]

This model is still not trivial to interpret, but it does show that mouth volume increases with weight and age, and that the dependency on weight is stronger for males.

14.2 Alternative Factor Codings

The above coding of factors using indicator variables is the default procedure in R. There are other choices of coding. It transpires that any mapping to vectors which (together with the intercept, if present) span a \(k\) dimensional space, will work for a factor with \(k\) levels. The choice of coding will not affect fitted values and the goodness of fit, but it will affect the estimates \(\hat\beta\), so the coding needs to be known before any interpretation of the parameters.

Differnt choices of factor codings are built into R. Here we illustrate some of these choices using a factor which represents week days:

days <- factor(c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

Indicator variables. This is the default coding:
```
contr.treatment(days)
```
```
    Tue Wed Thu Fri Sat Sun
Mon   0   0   0   0   0   0
Tue   1   0   0   0   0   0
Wed   0   1   0   0   0   0
Thu   0   0   1   0   0   0
Fri   0   0   0   1   0   0
Sat   0   0   0   0   1   0
Sun   0   0   0   0   0   1
```
The seven rows correspond to the weekdays, the six columns show how each weekday is encoded. The regression coefficients can be interpreted as the change of intercept relative to the reference level (in this case Monday).

Deviation coding uses “sum to zero” contrasts:

contr.sum(days)

    [,1] [,2] [,3] [,4] [,5] [,6]
Mon    1    0    0    0    0    0
Tue    0    1    0    0    0    0
Wed    0    0    1    0    0    0
Thu    0    0    0    1    0    0
Fri    0    0    0    0    1    0
Sat    0    0    0    0    0    1
Sun   -1   -1   -1   -1   -1   -1

This constrasts each level with the final level. For example, \(\hat\beta_1\) is the amount the intercept is increased for Mondays and decreased for Sundays. This coding can sometimes reduce collinearity in the design matrix.

Helmert coding contrasts the second level with the first, the third with the average of the first two, and so on:

contr.helmert(days)

    [,1] [,2] [,3] [,4] [,5] [,6]
Mon   -1   -1   -1   -1   -1   -1
Tue    1   -1   -1   -1   -1   -1
Wed    0    2   -1   -1   -1   -1
Thu    0    0    3   -1   -1   -1
Fri    0    0    0    4   -1   -1
Sat    0    0    0    0    5   -1
Sun    0    0    0    0    0    6

We can set the coding for each factor separately, using the optional contrasts.arg argument to lm() and model.matrix():

set.seed(20211127)
n <- 20

x1 <- runif(n)

day.names <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
x2 <- sample(day.names, size = n, replace = TRUE)
# Explicitly specify the levels, in case a day is missing
# from the sample and to fix the order of days:
x2 <- factor(x2, levels = day.names)

dd <- data.frame(x1, x2)

X1 <- model.matrix(~ ., data = dd)
head(X1)

  (Intercept)         x1 x2Tue x2Wed x2Thu x2Fri x2Sat x2Sun
1           1 0.41745367     1     0     0     0     0     0
2           1 0.88671893     0     0     1     0     0     0
3           1 0.41924327     0     0     1     0     0     0
4           1 0.31986826     0     0     0     0     0     1
5           1 0.27863787     0     0     0     0     0     0
6           1 0.03346104     0     1     0     0     0     0

kappa(X1, exact = TRUE)

[1] 10.37243

X2 <- model.matrix(~ ., data = dd,
                   contrasts.arg = list(x2 = contr.sum))
head(X2)

  (Intercept)         x1 x21 x22 x23 x24 x25 x26
1           1 0.41745367   0   1   0   0   0   0
2           1 0.88671893   0   0   0   1   0   0
3           1 0.41924327   0   0   0   1   0   0
4           1 0.31986826  -1  -1  -1  -1  -1  -1
5           1 0.27863787   1   0   0   0   0   0
6           1 0.03346104   0   0   1   0   0   0

kappa(X2, exact = TRUE)

[1] 6.677298

Summary

We have seen how interaction terms can be used to improve model fit.
We have seen how different factor codings can be used in R.