Interlude: Linear Regression in R

Fitting a Model

The function lm() is used to fit a linear model in R. There are different ways to specify the form of the model and the data to be used for fitting the model.

  • The most basic way to call lm() is the case where the explanatory variables and the response variable are stored as separate vectors. Assuming, for example, that the explanatory variables are x1, x2, x3 and that the response variable is y in R, we can tell R to fit the linear model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\) by using the following command:

      lm(y ~ x1 + x2 + x3)

    Note that R automatically added the intercept term \(\beta_0\) to this model. If we want to fit a model without an intercept, i.e. the model \(y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\), we have to add 0 + in front of the explanatory variables:

        lm(y ~ 0 + x1 + x2 + x3)

    The general form of a model specification is the response variable, followed by ~, followed by a plus-separated list of explanatory variables. For this form of calling lm(), the variables y, x1, x2, and x3 in the examples above must be already defined before lm() is called. It may be a good idea to double-check that the variables have the correct values before trying to call lm().

  • Both for the response and for explanatory variables we can specify arbitrary R expressions to compute the numeric values to be used. For example, to fit the model \(\log(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\) (assuming that all \(y_i\) are positive) we can use the following command:

      lm(log(y) ~ x1 + x2)

    Some care is needed, because +, * and ^ have a special meaning inside the first argument of lm(); any time we want to compute a variable for lm() using these operations, we need to surround the corresponding expression with I(), to tell R that +, * or ^ should have their usual, arithmetic meaning. For example, to fit a model of the form \(y \sim \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon\), we can use the following R command:

      lm(y ~ x + I(x^2))

    Here, the use of I() tells R that x^2 is to be interpreted as the vector \((x_1^2, \ldots, x_n^2)\). Similarly, we can fit a model of the form \(y = \beta_0 + \beta_1 (x_1 + x_2) + \varepsilon\):

      lm(y ~ I(x1+x2))

    Here, the use of I() tells R that x1+x2 indicates the vector \((x_{1,1}+x_{2,1}, \ldots, x_{1,n}+x_{2,n})\) instead of two separate explanatory variables.

    Details about how to specify models in calls to lm() can be found by using the command help(formula) in R.

  • If the response and the explanatory variables are stored in the columns of a data frame, we can use the data=... argument to lm() to specify this data frame and then just use the column names to specify the regression model. For example, the stackloss dataset built into R consists of a data frame with columns Air.Flow, Water.Temp, Acid.Conc., stack.loss. To predict stackloss$stack.loss from stackloss$Air.Flow we can write

      lm(stack.loss ~ Air.Flow, data=stackloss)

    As a special case, a single dot “.” can be used in place of the explanatory variables in the model to indicate that all columns except for the given response should be used. Thus, the following two commands are equivalent:

      lm(stack.loss ~ ., data=stackloss)
      lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)

Understanding the Model

The output of the lm() function is an R object which can be used the extract information about the fitted model. A good way to work with this object is to store it in a variable and then use commands like the ones listed below to work with this variable. For example, the following R command fits a model for the stackloss dataset and stores it in the variable m:

  m <- lm(stack.loss ~ ., data=stackloss)

Many operations are available to use with this object m:

  • Printing m to the screen:

      m
    
    Call:
    lm(formula = stack.loss ~ ., data = stackloss)
    
    Coefficients:
    (Intercept)     Air.Flow   Water.Temp   Acid.Conc.  
       -39.9197       0.7156       1.2953      -0.1521  

    This shows the estimated values for the regression coefficient.

  • The command summary() can be used to print additional information about the fitted model:

      summary(m)
    
    Call:
    lm(formula = stack.loss ~ ., data = stackloss)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -7.2377 -1.7117 -0.4551  2.3614  5.6978 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
    Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
    Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
    Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 3.243 on 17 degrees of freedom
    Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
    F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

    We will learn over the course of this module how to interpret all of this output.

  • The coefficient vector \(\beta\) can be obtained using coef(m):

      coef(m)
    (Intercept)    Air.Flow  Water.Temp  Acid.Conc. 
    -39.9196744   0.7156402   1.2952861  -0.1521225 
  • The fitted values \(\hat y_i\) can be obtained using the command fitted(m):

      fitted(m)
            1         2         3         4         5         6         7         8 
    38.765363 38.917485 32.444467 22.302226 19.711654 21.006940 21.389491 21.389491 
            9        10        11        12        13        14        15        16 
    18.144379 12.732806 11.363703 10.220540 12.428561 12.050499  5.638582  6.094949 
           17        18        19        20        21 
     9.519951  8.455093  9.598257 13.587853 22.237713 
  • The estimated residuals \(\hat\varepsilon_i = y_i - \hat y_i\) can be obtained using the command resid(m):

      resid(m)
              1           2           3           4           5           6 
     3.23463723 -1.91748529  4.55553300  5.69777417 -1.71165358 -3.00693970 
              7           8           9          10          11          12 
    -2.38949071 -1.38949071 -3.14437890  1.26719408  2.63629676  2.77946036 
             13          14          15          16          17          18 
    -1.42856088 -0.05049929  2.36141836  0.90505080 -1.51995059 -0.45509295 
             19          20          21 
    -0.59825656  1.41214728 -7.23771286 
  • The design matrix \(X\) can be found using model.matrix(m):

      model.matrix(m)
       (Intercept) Air.Flow Water.Temp Acid.Conc.
    1            1       80         27         89
    2            1       80         27         88
    3            1       75         25         90
    4            1       62         24         87
    5            1       62         22         87
    6            1       62         23         87
    7            1       62         24         93
    8            1       62         24         93
    9            1       58         23         87
    10           1       58         18         80
    11           1       58         18         89
    12           1       58         17         88
    13           1       58         18         82
    14           1       58         19         93
    15           1       50         18         89
    16           1       50         18         86
    17           1       50         19         72
    18           1       50         19         79
    19           1       50         20         80
    20           1       56         20         82
    21           1       70         20         91
    attr(,"assign")
    [1] 0 1 2 3

    Instead of giving a fitted model m, we can also just give the data and formula as in the call to lm(), e.g. model.matrix(y ~ x1 + x2 + x3).

Making Predictions

One of the main aims of fitting a linear model is to use the model to make predictions for new, not previously observed \(x\)-values, i.e. to compute \(y_{\mathrm{new}} = X_{\mathrm{new}} \hat\beta\). The general form of the command for prediction is predict(m, newdata), where m is the model previously fitted using lm(), and newdata specifies the new \(x\)-values to predict responses for. The argument newdata should be a data.frame and for each variable in the original model there should be a column in newdata which has the name of the original variable and contains the new values. For example, if the model was fitted using

  m <- lm(y ~ x + I(x^2))

and if the new samples are stored in x.new, then responses for the \(x\)-values in x.new can be predicted using the following command:

  predict(m, data.frame(x=x.new))

As a second example, for the stackloss dataset, the following commands can be used to predict stack.loss for two new \(x\)-values:

  m <- lm(stack.loss ~ ., data = stackloss)
  newdata <- data.frame(Air.Flow=c(70, 73), Water.Temp=c(25,24), Acid.Conc.=c(78,90))
  predict(m, newdata)
       1        2 
30.69174 29.71790 

More information about predict() can be found by reading the output of help(predict.lm).