Interlude: Linear Regression in R
Fitting a Model
The function lm()
is used to fit a linear model in R. There are different
ways to specify the form of the model and the data to be used for fitting the
model.
The most basic way to call
lm()
is the case where the explanatory variables and the response variable are stored as separate vectors. Assuming, for example, that the explanatory variables arex1
,x2
,x3
and that the response variable isy
in R, we can tell R to fit the linear model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\) by using the following command:Note that R automatically added the intercept term \(\beta_0\) to this model. If we want to fit a model without an intercept, i.e. the model \(y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\), we have to add
0 +
in front of the explanatory variables:The general form of a model specification is the response variable, followed by
~
, followed by a plus-separated list of explanatory variables. For this form of callinglm()
, the variablesy
,x1
,x2
, andx3
in the examples above must be already defined beforelm()
is called. It may be a good idea to double-check that the variables have the correct values before trying to calllm()
.Both for the response and for explanatory variables we can specify arbitrary R expressions to compute the numeric values to be used. For example, to fit the model \(\log(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\) (assuming that all \(y_i\) are positive) we can use the following command:
Some care is needed, because
+
,*
and^
have a special meaning inside the first argument oflm()
; any time we want to compute a variable forlm()
using these operations, we need to surround the corresponding expression withI()
, to tell R that+
,*
or^
should have their usual, arithmetic meaning. For example, to fit a model of the form \(y \sim \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon\), we can use the following R command:Here, the use of
I()
tells R thatx^2
is to be interpreted as the vector \((x_1^2, \ldots, x_n^2)\). Similarly, we can fit a model of the form \(y = \beta_0 + \beta_1 (x_1 + x_2) + \varepsilon\):Here, the use of
I()
tells R thatx1+x2
indicates the vector \((x_{1,1}+x_{2,1}, \ldots, x_{1,n}+x_{2,n})\) instead of two separate explanatory variables.Details about how to specify models in calls to
lm()
can be found by using the commandhelp(formula)
in R.If the response and the explanatory variables are stored in the columns of a data frame, we can use the
data=...
argument tolm()
to specify this data frame and then just use the column names to specify the regression model. For example, thestackloss
dataset built into R consists of a data frame with columnsAir.Flow
,Water.Temp
,Acid.Conc.
,stack.loss
. To predictstackloss$stack.loss
fromstackloss$Air.Flow
we can writeAs a special case, a single dot “
.
” can be used in place of the explanatory variables in the model to indicate that all columns except for the given response should be used. Thus, the following two commands are equivalent:
Understanding the Model
The output of the lm()
function is an R object which can be used the extract
information about the fitted model. A good way to work with this object is to
store it in a variable and then use commands like the ones listed below to work
with this variable. For example, the following R command fits a model for the
stackloss
dataset and stores it in the variable m
:
Many operations are available to use with this object m
:
Printing
m
to the screen:Call: lm(formula = stack.loss ~ ., data = stackloss) Coefficients: (Intercept) Air.Flow Water.Temp Acid.Conc. -39.9197 0.7156 1.2953 -0.1521
This shows the estimated values for the regression coefficient.
The command
summary()
can be used to print additional information about the fitted model:Call: lm(formula = stack.loss ~ ., data = stackloss) Residuals: Min 1Q Median 3Q Max -7.2377 -1.7117 -0.4551 2.3614 5.6978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -39.9197 11.8960 -3.356 0.00375 ** Air.Flow 0.7156 0.1349 5.307 5.8e-05 *** Water.Temp 1.2953 0.3680 3.520 0.00263 ** Acid.Conc. -0.1521 0.1563 -0.973 0.34405 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
We will learn over the course of this module how to interpret all of this output.
The coefficient vector \(\beta\) can be obtained using
coef(m)
:(Intercept) Air.Flow Water.Temp Acid.Conc. -39.9196744 0.7156402 1.2952861 -0.1521225
The fitted values \(\hat y_i\) can be obtained using the command
fitted(m)
:1 2 3 4 5 6 7 8 38.765363 38.917485 32.444467 22.302226 19.711654 21.006940 21.389491 21.389491 9 10 11 12 13 14 15 16 18.144379 12.732806 11.363703 10.220540 12.428561 12.050499 5.638582 6.094949 17 18 19 20 21 9.519951 8.455093 9.598257 13.587853 22.237713
The estimated residuals \(\hat\varepsilon_i = y_i - \hat y_i\) can be obtained using the command
resid(m)
:1 2 3 4 5 6 3.23463723 -1.91748529 4.55553300 5.69777417 -1.71165358 -3.00693970 7 8 9 10 11 12 -2.38949071 -1.38949071 -3.14437890 1.26719408 2.63629676 2.77946036 13 14 15 16 17 18 -1.42856088 -0.05049929 2.36141836 0.90505080 -1.51995059 -0.45509295 19 20 21 -0.59825656 1.41214728 -7.23771286
The design matrix \(X\) can be found using
model.matrix(m)
:(Intercept) Air.Flow Water.Temp Acid.Conc. 1 1 80 27 89 2 1 80 27 88 3 1 75 25 90 4 1 62 24 87 5 1 62 22 87 6 1 62 23 87 7 1 62 24 93 8 1 62 24 93 9 1 58 23 87 10 1 58 18 80 11 1 58 18 89 12 1 58 17 88 13 1 58 18 82 14 1 58 19 93 15 1 50 18 89 16 1 50 18 86 17 1 50 19 72 18 1 50 19 79 19 1 50 20 80 20 1 56 20 82 21 1 70 20 91 attr(,"assign") [1] 0 1 2 3
Instead of giving a fitted model
m
, we can also just give the data and formula as in the call tolm()
, e.g.model.matrix(y ~ x1 + x2 + x3)
.
Making Predictions
One of the main aims of fitting a linear model is to use the model to make
predictions for new, not previously observed \(x\)-values, i.e. to compute
\(y_{\mathrm{new}} = X_{\mathrm{new}} \hat\beta\). The general form of the
command for prediction is predict(m, newdata)
, where m
is the model
previously fitted using lm()
, and newdata
specifies the new \(x\)-values to
predict responses for. The argument newdata
should be a data.frame
and
for each variable in the original model there should be a column in newdata
which has the name of the original variable and contains the new values. For
example, if the model was fitted using
and if the new samples are stored in x.new
, then responses for
the \(x\)-values in x.new
can be predicted using the following
command:
As a second example, for the stackloss
dataset, the following
commands can be used to predict stack.loss
for two new
\(x\)-values:
m <- lm(stack.loss ~ ., data = stackloss)
newdata <- data.frame(Air.Flow=c(70, 73), Water.Temp=c(25,24), Acid.Conc.=c(78,90))
predict(m, newdata)
1 2
30.69174 29.71790
More information about predict()
can be found by reading
the output of help(predict.lm)
.