• The MATH3714 and MATH5714M modules are assessed by an examination (80%) and a practical (20%). This is the practical, worth 20% of your final module mark.
  • You must hand in your solution (printed on paper) by Friday, 16th December 2022, 2pm. The easiest way to hand in is to pass your report to me after one of the lectures. Alternatively, I will also set up a pidgeonhole for you to drop your report into.
  • Reports must be typeset (not handwritten) and should be no more than 8 pages in length (but can be shorter). Your report must show your name and student ID on the front page.
  • Within reason you may talk to your friends about this piece of work, but you should not send R code (or output) to each other. Your report must be your own work.


In this practical we are interested in factors which influence the life expectancy at birth. We consider the following dataset:

You can read the data into R using the following command:

d <- read.csv("",
              stringsAsFactors = TRUE)

The dataset contain the following variables:

  • country: country
  • region: geographic region the country is in
  • sub.region: geographic sub-region the country is in
  • year: year the data refers to
  • GDP.per.capita: Per capita GDP at current prices (USD)
  • health.spending: Per capita government expenditure on health at average exchange rate (USD, only until 2010)
  • HIV: Prevalence of HIV among adults aged 15 to 49 (%)
  • alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • tobacco: Prevalence of current tobacco use among adults aged \(\geq 15\) years
  • population.size: population size, for the given country and year (thousands)
  • population.density: Population per square kilometre, for the given country and year (thousands)
  • life.expectancy: Life expectancy at birth (years). This is the variable we are interested in.


The aim of this practical is to fit an appropriate model to these data, which predicts the life expectancy life.expectancy from the other variables.

This practical is deliberately open-ended, with little guidance on how to proceed.

Task 1. We start by considering life.expectancy as a function of GDP.per.capita only.

  • Using appropriate transformations of the data, find a linear model which can describe the relationship between life.expectancy (response) and GDP.per.capita (input).

  • Using appropriate diagnostics, confirm that your model is acceptable. In your answer, you only need to describe your final model and one other model which you have examined, but deemed less appropriate.

  • Using your model obtain a 95% confidence interval for the mean (expected) life expectancy at birth, for a country with a per capita GDP of 5000 USD.

Task 2. Now we also consider the remaining variables in the dataset.

  • With due consideration to

    • transformations (where, and if, necessary)
    • appropriate choice of variables
    • model selection
    • model checking
    • missing data
    • etc.

    obtain a model which is able to predict life.expectancy using some or all of these additional variables.

  • Note that some variables are missing data. Missing data is indicated by the value NA (not available). You should decide how to deal with missing data, e.g. by ignoring the corresponding samples. You should explain your choice in your report.

  • Justify your choice by comparing at least two “competing” models. The comparison should take note of at least (a) model selection criteria, (b) diagnostics, and (c) interpretability.

  • Interpret the parameters in your preferred model.

There is no single right or wrong answer to this practical. The important thing is that you justify your approach.