Interlude: Linear Regression in R
Fitting a Model
The function lm() is used to fit a linear model in R. There are different
ways to specify the form of the model and the data to be used for fitting the
model.
The most basic way to call
lm()is the case where the explanatory variables and the response variable are stored as separate vectors. Assuming, for example, that the explanatory variables arex1,x2,x3and that the response variable isyin R, we can tell R to fit the linear model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\) by using the following command:Note that R automatically added the intercept term \(\beta_0\) to this model. If we want to fit a model without an intercept, i.e. the model \(y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\), we have to add
0 +in front of the explanatory variables:The general form of a model specification is the response variable, followed by
~, followed by a plus-separated list of explanatory variables. For this form of callinglm(), the variablesy,x1,x2, andx3in the examples above must be already defined beforelm()is called. It may be a good idea to double-check that the variables have the correct values before trying to calllm().Both for the response and for explanatory variables we can specify arbitrary R expressions to compute the numeric values to be used. For example, to fit the model \(\log(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\) (assuming that all \(y_i\) are positive) we can use the following command:
Some care is needed, because
+,*and^have a special meaning inside the first argument oflm(); any time we want to compute a variable forlm()using these operations, we need to surround the corresponding expression withI(), to tell R that+,*or^should have their usual, arithmetic meaning. For example, to fit a model of the form \(y \sim \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon\), we can use the following R command:Here, the use of
I()tells R thatx^2is to be interpreted as the vector \((x_1^2, \ldots, x_n^2)\). Similarly, we can fit a model of the form \(y = \beta_0 + \beta_1 (x_1 + x_2) + \varepsilon\):Here, the use of
I()tells R thatx1+x2indicates the vector \((x_{1,1}+x_{2,1}, \ldots, x_{1,n}+x_{2,n})\) instead of two separate explanatory variables.Details about how to specify models in calls to
lm()can be found by using the commandhelp(formula)in R.If the response and the explanatory variables are stored in the columns of a data frame, we can use the
data=...argument tolm()to specify this data frame and then just use the column names to specify the regression model. For example, thestacklossdataset built into R consists of a data frame with columnsAir.Flow,Water.Temp,Acid.Conc.,stack.loss. To predictstackloss$stack.lossfromstackloss$Air.Flowwe can writeAs a special case, a single dot “
.” can be used in place of the explanatory variables in the model to indicate that all columns except for the given response should be used. Thus, the following two commands are equivalent:
Understanding the Model
The output of the lm() function is an R object which can be used the extract
information about the fitted model. A good way to work with this object is to
store it in a variable and then use commands like the ones listed below to work
with this variable. For example, the following R command fits a model for the
stackloss dataset and stores it in the variable m:
Many operations are available to use with this object m:
Printing
mto the screen:Call: lm(formula = stack.loss ~ ., data = stackloss) Coefficients: (Intercept) Air.Flow Water.Temp Acid.Conc. -39.9197 0.7156 1.2953 -0.1521This shows the estimated values for the regression coefficient.
The command
summary()can be used to print additional information about the fitted model:Call: lm(formula = stack.loss ~ ., data = stackloss) Residuals: Min 1Q Median 3Q Max -7.2377 -1.7117 -0.4551 2.3614 5.6978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -39.9197 11.8960 -3.356 0.00375 ** Air.Flow 0.7156 0.1349 5.307 5.8e-05 *** Water.Temp 1.2953 0.3680 3.520 0.00263 ** Acid.Conc. -0.1521 0.1563 -0.973 0.34405 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09We will learn over the course of this module how to interpret all of this output.
The coefficient vector \(\beta\) can be obtained using
coef(m):(Intercept) Air.Flow Water.Temp Acid.Conc. -39.9196744 0.7156402 1.2952861 -0.1521225The fitted values \(\hat y_i\) can be obtained using the command
fitted(m):1 2 3 4 5 6 7 8 38.765363 38.917485 32.444467 22.302226 19.711654 21.006940 21.389491 21.389491 9 10 11 12 13 14 15 16 18.144379 12.732806 11.363703 10.220540 12.428561 12.050499 5.638582 6.094949 17 18 19 20 21 9.519951 8.455093 9.598257 13.587853 22.237713The estimated residuals \(\hat\varepsilon_i = y_i - \hat y_i\) can be obtained using the command
resid(m):1 2 3 4 5 6 3.23463723 -1.91748529 4.55553300 5.69777417 -1.71165358 -3.00693970 7 8 9 10 11 12 -2.38949071 -1.38949071 -3.14437890 1.26719408 2.63629676 2.77946036 13 14 15 16 17 18 -1.42856088 -0.05049929 2.36141836 0.90505080 -1.51995059 -0.45509295 19 20 21 -0.59825656 1.41214728 -7.23771286The design matrix \(X\) can be found using
model.matrix(m):(Intercept) Air.Flow Water.Temp Acid.Conc. 1 1 80 27 89 2 1 80 27 88 3 1 75 25 90 4 1 62 24 87 5 1 62 22 87 6 1 62 23 87 7 1 62 24 93 8 1 62 24 93 9 1 58 23 87 10 1 58 18 80 11 1 58 18 89 12 1 58 17 88 13 1 58 18 82 14 1 58 19 93 15 1 50 18 89 16 1 50 18 86 17 1 50 19 72 18 1 50 19 79 19 1 50 20 80 20 1 56 20 82 21 1 70 20 91 attr(,"assign") [1] 0 1 2 3Instead of giving a fitted model
m, we can also just give the data and formula as in the call tolm(), e.g.model.matrix(y ~ x1 + x2 + x3).
Making Predictions
One of the main aims of fitting a linear model is to use the model to make
predictions for new, not previously observed \(x\)-values, i.e. to compute
\(y_{\mathrm{new}} = X_{\mathrm{new}} \hat\beta\). The general form of the
command for prediction is predict(m, newdata), where m is the model
previously fitted using lm(), and newdata specifies the new \(x\)-values to
predict responses for. The argument newdata should be a data.frame and
for each variable in the original model there should be a column in newdata
which has the name of the original variable and contains the new values. For
example, if the model was fitted using
and if the new samples are stored in x.new, then responses for
the \(x\)-values in x.new can be predicted using the following
command:
As a second example, for the stackloss dataset, the following
commands can be used to predict stack.loss for two new
\(x\)-values:
m <- lm(stack.loss ~ ., data = stackloss)
newdata <- data.frame(Air.Flow=c(70, 73), Water.Temp=c(25,24), Acid.Conc.=c(78,90))
predict(m, newdata) 1 2
30.69174 29.71790
More information about predict() can be found by reading
the output of help(predict.lm).