• The MATH3714 and MATH5714M modules are assessed by an examination (80%) and a practical (20%). This is the practical, worth 20% of your final module mark.
  • You must hand in your solution by Thursday, 9th December 2021, 5pm, via Gradescope.
  • Reports must be typeset (not handwritten) and should be no more than 10 pages in length (but can be shorter).
  • Within reason you may talk to your friends about this piece of work, but you should not send R code (or output) to each other. Your report must be only your own work.


In this practical we are interesting in factors which influence the life expectancy at birth. We consider the following dataset:

You can read the data into R using the following command:

d <- read.csv("",
              stringsAsFactors = TRUE)

The dataset contain the following variables:

  • country: country
  • region: geographic region the country is in
  • sub.region: geographic sub-region the country is in
  • year: year the data refers to
  • gdp.per.capita: Per capita GDP at current prices (USD)
  • health.spending: Per capita government expenditure on health at average exchange rate (USD, only until 2010)
  • hiv: Prevalence of HIV among adults aged 15 to 49 (%)
  • alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • tobacco: Prevalence of current tobacco use among adults aged \(\geq 15\) years
  • pop.size: population size, for the given country and year (thousands)
  • pop.density: Population per square kilometre, for the given country and year (thousands)
  • life.expectancy: Life expectancy at birth (years). This is the variable we are interested in.


The aim of this practical is to fit an appropriate model to these data, which predicts the life expectancy life.expectancy from the other variables.

This practical is deliberately open-ended, with little guidance on how to proceed.

Task 1. We start by considering life.expectancy as a function of gdp.per.capita only.

  • Using appropriate transformations of the data, find a linear model which can describe the relationship between life.expectancy (response) and gdp.per.capita (input).

  • Using appropriate diagnostics, confirm that your model is acceptable. In your answer, you only need to describe your final model, and one other model which you have examined, but deemed less appropriate.

  • Using your model obtain a 95% confidence interval for the mean (expected) life expectancy at birth, for a country with a per capita GDP of 4000 USD.

Task 2. Now we also consider the remaining variables in the dataset.

  • With due consideration to

    • transformations (where, and if, necessary)
    • appropriate choice of variables
    • model selection
    • model checking
    • variable selection
    • missing data
    • etc.

    obtain a model which is able to predict life.expectancy from some or all of the other variables.

  • Justify your choice by comparing at least two “competing” models. The comparison should take note of at least (a) model selection criteria, (b) diagnostics, and (c) interpretability.

  • Interpret the parameters in your preferred model.

There is no single right or wrong answer to this practical. The important thing is that you justify your approach.