• The MATH3714 and MATH5714M modules are assessed by an examination (80%) and a practical (20%). This is the practical, worth 20% of your final module mark.
  • You must hand in your solution via Gradescope by Thursday, 14th December 2023, 2pm.
  • Reports must be typeset (not handwritten) and should be no more than 7 pages in length (9 pages for MATH5714M).
  • Within reason you may talk to your friends about this piece of work, but you should not send R code (or output) to each other. Your report must be your own work.


In this practical we examine a dataset which lists house prices in California (from the 1990 census) together with a variety of related variables. You can download the dataset from here:

For most tasks you will only need the “training dataset”. The “test dataset” is only used in task 3.


The aim of the practical is to fit an appropriate model to these data, which predicts the median house value medianHouseValue from the other variables.

This practical is deliberately open-ended, with little guidance on how to proceed. There is no single right or wrong answer to this practical. The important thing is that you justify your approach.

Task 1. We start by considering medianHouseValue as a function of medianIncome only.

  • Using the training dataset, fit a linear model which can describe the relationship between medianHouseValue (response) and medianIncome (input).

  • Using appropriate diagnostics, discuss how well the model fits the data.

  • Determine a 95% confidence interval for the intercept in your model.

Task 2. Now we also consider the remaining variables in the dataset.

  • With due consideration to

    • transformations (where, and if, necessary)
    • appropriate choice of variables
    • model selection
    • model checking
    • etc.

    obtain a model which is able to predict medianHouseValue using some or all of these additional variables.

  • Discuss the special role of latitude and longitude and whether/how these two variables should be included in a model.

  • Justify your choice of model by comparing at least two “competing” models. The comparison should take note of at least (a) model selection criteria, (b) diagnostics, and (c) interpretability.

  • Interpret the parameters in your preferred model.

Task 3. Now also load the test dataset into R.

  • For each sample in the test dataset, use your model (still trained on the training dataset) to predict medianHouseValue.

  • Compare the predicted values to the actual values by computing the mean squared error \[\begin{equation*} \mathrm{MSE} = \frac{1}{n_\mathrm{test}} \sum_{i=1}^{n_\mathrm{test}} (\hat y_i - y_i)^2 \end{equation*}\]

  • Comment on the result.

Task 4 (only for MATH5714M). Use the Nadaraya-Watsen estimator to estimate the average median house value as a function of latitude.

  • Create a plot of your estimate.

  • Carefully explain how you chose the bandwidth for your estimator.

  • Discuss your results.