Practical

The MATH3714 and MATH5714M modules are assessed by an examination (80%) and a practical (20%). This is the practical, worth 20% of your final module mark.
You must hand in your solution via Gradescope by Thursday, 5th December 2023, 5pm.
Reports must be typeset (not handwritten) and should be no more than 6 pages in length (8 pages for MATH5714M).
Within reason you may talk to your friends about this piece of work, but you should not send R code (or output) to each other. Your report must be your own work.

Dataset

The Met Office historic station dataset contains long-term weather measurements from a network of weather stations across the UK. The full dataset includes records from 37 stations, with some series extending more than 100 years – the oldest stations (Oxford and Armagh) began recording in 1853. Each station is identified by its name and precise geographical coordinates.

The dataset contains five key monthly measurements:

Mean daily maximum temperature (tmax)
Mean daily minimum temperature (tmin)
Days of air frost (af)
Total rainfall (rain)
Total sunshine duration (sun)

For the practical we will consider a subset of this dataset. Please download the practical data from here:

https://teaching.seehuhn.de/data/historic-station-data/

Guidance

This practical is deliberately open-ended. There is no single right or wrong answer to this practical. Sometimes you need to chose your own approach and justify your decisions.
Include relevant R code in your report. The reader should be able to replicate your analysis and get the same results.
Use the page limit to guide you in deciding how much detail to include.
Only include code, plots and output that are relevant to your discussion.
Where multiple plots demonstrate similar points, include only the most illustrative example.

Tasks

Task 1 (15 marks)

Fit a linear model relating average maximum temperature (tmax) to year. Create a scatter plot showing the data and fitted line. Test whether there is a significant relationship between year and maximum temperature at the 5% significance level. State your conclusion in context.

Marking criteria:

Correct model fitting
Appropriate scatter plot
Correct hypothesis test
Clear conclusion
Presentation of results

Task 2 (25 marks)

The data shows strong seasonal variation. Explain how this affects the results from task 1. Fit an improved model for maximum temperature, using the year and month as inputs. Using plots and numerical summaries, demonstrate that the new model is an improvement over the original model.

Marking criteria:

Brief explanation of how seasonality affects task 1 results
Correct model specification with month as a factor
Appropriate model diagnostics
Clear explanation of model improvements
Presentation of results

Task 3 (20 marks)

Using the weather station as another categorical variable, fit a model for maximum temperature which uses the year, month and station as inputs. Discuss whether there is any evidence that the temperature trend varies between stations. Discuss your results in context.

Marking criteria:

Correct model specification including station effects
Appropriate analysis and discuss
Presentation of results

Task 4 (20 marks)

Modify the model from task 2 (using the year and month as inputs) to include a quadratic term for the year. Explain why this might be a good idea. Using plots and numerical summaries, discuss whether the quadratic term improves the model fit.

Marking criteria:

Correct model specification with clear justification
Appropriate model diagnostics
Presentation of results

Task 5 (MATH5714M only, 10 marks)

Use kernel density estimation to estimate the distribution of the residuals for the model from task 2. Based on your results, discuss whether the residuals are normally distributed.

Marking criteria:

Correct use of kernel density estimation
Appropriate choice of bandwidth
Clear discussion of normality
Presentation of results

Task 6 (MATH5714M only, 10 marks)

The column af in the dataset records the number of days with air frost. Using either the Nadaraya-Watson kernel regression or local polynomial regression, fit a model for af as a function of tmax. Create a plot showing both the data and the fitted regression curve. Justify your choice of regression method. Discuss your results in context.

Marking criteria:

Correct model fitting
Appropriate choice of regression bandwidth
Clear justification of regression method used
Appropriate analysis and discussion
Presentation of results