Problem Sheet 1

This problem sheet covers material from Sections 1-3 of the course. You should attempt all these questions and write up your solutions in advance of the workshop in week 2. Your solutions are not assessed. Instead, the answers will be discussed in the workshop and will be made available online afterwards.

You may find it helpful to review the following key concepts:

  • Simple linear regression model: \(y_i = \beta_0 + x_i \beta_1 + \varepsilon_i\)
  • Least squares estimation for regression coefficients
  • Matrix notation for linear regression: \(y = X\beta + \varepsilon\)
  • Properties of the design matrix \(X\) and the hat matrix \(H = X(X^\top X)^{-1}X^\top\)
  • Expectation and covariance of random vectors
  • Basic properties of the multivariate normal distribution

1. Consider a simple linear regression model fitted to a dataset with \(n\) pairs of observations \((x_1, y_1), \ldots, (x_n, y_n)\). Let \(\hat\beta = (\hat\beta_0, \hat\beta_1)\) be the least squares estimate of the coefficients in the model.

  1. Show that if we shift all \(y\)-values by a constant \(c\) (i.e., replace each \(y_i\) with \(y_i+c\)), the new least squares estimate for the slope \(\hat\beta_1\) remains unchanged, while the new estimate for the intercept becomes \(\hat\beta_0+c\).

Let us denote the new \(y\)-values as \(y'_i = y_i + c\). Then the new RSS is \[\begin{align*} \text{RSS}'(\beta_0, \beta_1) &= \sum_{i=1}^n (y'_i - (\beta_0 + \beta_1 x_i))^2 \\ &= \sum_{i=1}^n ((y_i + c) - (\beta_0 + \beta_1 x_i))^2 \\ &= \sum_{i=1}^n (y_i - (\beta_0 - c + \beta_1 x_i))^2. \end{align*}\] This is equivalent to the original RSS with \(\beta_0\) replaced by \(\beta_0 - c\). Therefore, the value of \(\beta_1\) that minimizes this will be the same as before, while the value of \(\beta_0\) that minimizes it will be \(\hat\beta_0+c\).

  1. Now consider shifting all \(x\)-values by a constant \(d\) (i.e., replace each \(x_i\) with \(x_i+d\)). How does this affect the least squares estimates \(\hat\beta_0\) and \(\hat\beta_1\)? Express the new estimates in terms of the original estimates and the constant \(d\).

Let us denote the new \(x\)-values as \(x'_i = x_i + d\). Then the new RSS is \[\begin{align*} \text{RSS}'(\beta_0, \beta_1) &= \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 (x_i + d)))^2 \\ &= \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 d + \beta_1 x_i))^2. \end{align*}\] This is equivalent to the original RSS with \(\beta_0\) replaced by \(\beta_0 + \beta_1 d\). Therefore, the new estimates will be \[\begin{align*} \hat\beta_1' &= \hat\beta_1, \\ \hat\beta_0' &= \hat\beta_0 - \hat\beta_1 d. \end{align*}\] The slope remains unchanged, but the intercept changes to maintain the same fitted line in the shifted coordinate system.

2. Consider the simple linear regression model \(y_i = \beta_0 + x_{i} \beta_1 + \varepsilon_i\) for \(i \in \{1, 2, \ldots, n\}\) and let \(X\) be the design matrix.

  1. Show that \(\displaystyle X^\top X = \begin{pmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{pmatrix} \in \mathbb{R}^{2\times 2}\).

For simple linear regression, we have \(p=1\) and the design matrix is \[\begin{equation*} X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}. \end{equation*}\] Thus we have \[\begin{align*} X^\top X &= \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \\ &= \begin{pmatrix} \sum_{i=1}^n 1 & \sum_{i=1}^n 1 \cdot x_i \\ \sum_{i=1}^n x_i \cdot 1 & \sum_{i=1}^n x_i^2 \end{pmatrix} \\ &= \begin{pmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{pmatrix}. \end{align*}\]

  1. Using the formula \[\begin{equation*} \begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} = \frac{1}{ad-bc} \begin{pmatrix} \;d & -b \\ -c & \,a \end{pmatrix}, \end{equation*}\] find \((X^\top X)^{-1}\).

Using the formula for the inverse of a \(2\times 2\)-matrix, we find \[\begin{align*} (X^\top X)^{-1} &= \begin{pmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{pmatrix}^{-1} \\ &= \frac{1}{n \sum_{i=1}^n x_i^2 - \bigl(\sum_{i=1}^n x_i\bigr)^2} \begin{pmatrix} \sum_{i=1}^n x_i^2 & -\sum_{i=1}^n x_i \\ -\sum_{i=1}^n x_i & n \end{pmatrix}. \end{align*}\]

  1. Find \(X^\top y\) and use this to derive an explicit formula for the least squares estimate \(\hat\beta = (X^\top X)^{-1} X^\top y\).

Omitting the indices in the sums for brevity, we have \[\begin{equation*} X^\top y = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} \sum y_i \\ \sum x_i y_i \end{pmatrix} \end{equation*}\] and thus \[\begin{align*} \hat\beta &= (X^\top X)^{-1} X^\top y \\ &= \frac{1}{n \sum x_i^2 - \bigl(\sum x_i\bigr)^2} \begin{pmatrix} \sum x_i^2 & -\sum x_i \\ -\sum x_i & n \end{pmatrix} \begin{pmatrix} \sum y_i \\ \sum x_i y_i \end{pmatrix} \\ &= \frac{1}{n \sum x_i^2 - \bigl(\sum x_i\bigr)^2} \begin{pmatrix} \sum x_i^2 \sum y_i - \sum x_i \sum x_i y_i \\ -\sum x_i \sum y_i + n \sum x_i y_i \end{pmatrix} \\ &= \frac{1}{\frac1n \sum_{i=1}^n x_i^2 - \bigl(\frac1n \sum_{i=1}^n x_i\bigr)^2} \begin{pmatrix} \frac1n \sum_{i=1}^n x_i^2 \cdot \frac1n\sum_{i=1}^n y_i - \frac1n\sum_{i=1}^n x_i \cdot \frac1n\sum_{i=1}^n x_i y_i \\ \frac1n\sum_{i=1}^n x_i y_i - \frac1n\sum_{i=1}^n x_i \cdot \frac1n\sum_{i=1}^n y_i \end{pmatrix}. \end{align*}\] This completes the answer.

Inspection of the final result shows that we have recovered the traditional formula for the coefficients in simple linear regression, only written in slightly unusual form. For example, the term \(\frac1n \sum_{i=1}^n x_i^2 - \bigl(\frac1n \sum_{i=1}^n x_i\bigr)^2\) equals the sample variance of the \(x_i\) (up to a factor of \(n/(n-1)\)). The algebra in this answer could be slightly simplified by changing to new coordinates \(\tilde x_i = x_i - \bar x\) and \(\tilde y_i = y_i - \bar y\) before fitting the regression model.

3. Let \(X\) be the design matrix of a model including an intercept, and let \(H = X (X^\top X)^{-1} X^\top \in\mathbb{R}^{n\times n}\) be the hat matrix. Finally, let \(\mathbf{1} = (1, 1, \ldots, 1) \in\mathbb{R}^n\). Show that \(H \mathbf{1} = \mathbf{1}\).

We already know that \(H X = X (X^\top X)^{-1} X^\top X = X\). Since the first column of \(X\) equals \(\mathbf{1}\), the first column of the matrix equation \(HX = X\) is \(H\mathbf{1} = \mathbf{1}\). This completes the proof.

4. For the stackloss dataset built into R, predict a value for stack.loss when the inputs are Air.Flow = 60, Water.Temp = 21 and Acid.Conc = 87.

We can fit the model using lm() as usual:

m <- lm(stack.loss ~ ., data = stackloss)
m

Call:
lm(formula = stack.loss ~ ., data = stackloss)

Coefficients:
(Intercept)     Air.Flow   Water.Temp   Acid.Conc.  
   -39.9197       0.7156       1.2953      -0.1521  

To predict a new value, we use the predict() command. Note that Acid.Conc. is spelled with a trailing dot, which we need to include in the name when we supply the new input values here.

predict(m, data.frame(Air.Flow = 60, Water.Temp = 21, Acid.Conc. = 87))
       1 
16.98509 

Thus, the predicted value for stack.loss is \(16.98509\).

5. Let \(\varepsilon_1, \ldots, \varepsilon_n \sim \mathcal{N}(\mu, \sigma^2)\) be independent. Then \(\varepsilon= (\varepsilon_1, \ldots, \varepsilon_n)\) is a random vector. Determine \(\mathbb{E}(\varepsilon)\), \(\mathop{\mathrm{Cov}}(\varepsilon)\) and \(\mathbb{E}\bigl( \|\varepsilon\|^2 \bigr)\).

We immediately find \(\mathbb{E}(\varepsilon) = \mu \mathbf{1}\) where \(\mathbf{1} = (1, \ldots, 1) \in\mathbb{R}^n\). Since the \(\varepsilon_i\) are independent, we have \(\mathop{\mathrm{Cov}}(\varepsilon) = \sigma^2 I\). Finally we have \[\begin{align*} \mathbb{E}\bigl(\|\varepsilon\|^2\bigr) &= \mathbb{E}\bigl( \sum_{i=1}^n \varepsilon_i^2 \bigr) \\ &= \sum_{i=1}^n E(\varepsilon_i^2) \\ &= n \mathbb{E}(\varepsilon_1^2). \end{align*}\] Since \(\mathbb{E}(\varepsilon_1^2) = \mathop{\mathrm{Var}}(\varepsilon_1) + \mathbb{E}(\varepsilon_1)^2 = \sigma^2 + \mu^2\) we get \(\mathbb{E}\bigl(\|\varepsilon\|^2\bigr) = n (\sigma^2 + \mu^2)\).