Section 6 Spline Smoothing

In the previous sections we have seen how kernel methods and $k$-nearest neighbour methods can be used to estimate the regression function $m\colon \mathbb{R}\to \mathbb{R}$ from data $(x_1, y_1), \ldots, (x_n, y_n)$. While these methods work well in many situations, they have some limitations:

The choice of bandwidth $h$ or neighbourhood size $k$ can be difficult.
Kernel methods can suffer from boundary effects, where the estimate is biased near the edges of the data range.
All methods are local, averaging nearby observations, which can lead to increased variance.

In this section we introduce an alternative approach based on spline smoothing. Instead of local averaging, we fit a smooth function globally to the entire dataset, using a penalty term to control the smoothness.

6.1 Smoothing Splines

The key idea of spline smoothing is to find a function $m$ which balances two competing goals:

The function should fit the data well, i.e. the residual sum of squares $\sum_{i=1}^n \bigl( y_i - m(x_i) \bigr)^2$ should be small.
The function should be smooth, i.e. it should not have too much curvature.

To measure the curvature of a function $m$, we use the integral of the squared second derivative: \[\begin{equation*} \int_{-\infty}^\infty \bigl( m''(x) \bigr)^2 \,dx. \end{equation*}\] This integral is large if $m$ has high curvature, and small if $m$ is nearly linear. (For a linear function $m(x) = a + bx$ we have $m''(x) = 0$ and thus the integral equals zero.)

Definition 6.1 The smoothing spline estimate for the regression function $m$ is the function $\hat m_\lambda$ which minimizes \[\begin{equation} \sum_{i=1}^n \bigl( y_i - m(x_i) \bigr)^2 + \lambda \int_{-\infty}^\infty \bigl( m''(x) \bigr)^2 \,dx \tag{6.1} \end{equation}\] over all twice differentiable functions $m\colon \mathbb{R}\to \mathbb{R}$. The parameter $\lambda \geq 0$ is called the smoothing parameter.

The smoothing parameter $\lambda$ controls the trade-off between fitting the data and smoothness:

For $\lambda = 0$, only the residual sum of squares matters, and the solution is the interpolating function which passes through all data points.
For $\lambda \to \infty$, only the smoothness matters, and the solution is the linear regression line (which has zero curvature).
For intermediate values of $\lambda$, the solution balances fit and smoothness.

This is similar to ridge regression (see section 16 of the level 3 notes), where a penalty term $\lambda \|\beta\|^2$ is added to the residual sum of squares to control the size of the coefficients.

6.2 Cubic Splines

Before we can state the solution to the optimization problem in definition 6.1, we need to introduce the concept of a cubic spline.

Definition 6.2 A cubic spline with knots $\kappa_1 < \kappa_2 < \cdots < \kappa_k$ is a function $s\colon \mathbb{R}\to \mathbb{R}$ such that

$s$ is a cubic polynomial on each interval $(-\infty, \kappa_1)$, $(\kappa_1, \kappa_2)$, …, $(\kappa_{k-1}, \kappa_k)$, and $(\kappa_k, \infty)$, and
$s$ is twice continuously differentiable, i.e. $s$, $s'$ and $s''$ are all continuous.

A natural cubic spline is a cubic spline which is linear on the intervals $(-\infty, \kappa_1)$ and $(\kappa_k, \infty)$.

The points $\kappa_1, \ldots, \kappa_k$ are called knots. At each knot, the function $s$ transitions from one cubic polynomial to another, but the transition is smooth because $s$, $s'$ and $s''$ are all continuous.

Example 6.1 TODO: add a simple example showing a cubic spline with 2-3 knots

The following theorem shows that the solution to the smoothing spline problem is always a natural cubic spline.

Theorem 6.1 The solution $\hat m_\lambda$ to the optimization problem in definition 6.1 is a natural cubic spline with knots at the data points $x_1, \ldots, x_n$.

Proof. We sketch the main idea of the proof, showing that the solution must be a cubic spline. (The full proof, including the boundary conditions which make the spline “natural”, is more technical.)

Suppose $\hat m_\lambda$ is the solution but $\hat m_\lambda$ is not a cubic polynomial on some interval $[x_i, x_{i+1}]$. We will show that this leads to a contradiction.

Let $p$ be the cubic polynomial which interpolates $\hat m_\lambda$, $\hat m_\lambda'$, $\hat m_\lambda''$ at $x_i$ and $\hat m_\lambda''$ at $x_{i+1}$. (Such a polynomial exists and is unique.) Now consider the function $\tilde m$ which equals $\hat m_\lambda$ outside $[x_i, x_{i+1}]$ and equals $p$ on $[x_i, x_{i+1}]$.

Since $\tilde m$ agrees with $\hat m_\lambda$ at all data points, the residual sum of squares is the same for both functions. However, the penalty term is smaller for $\tilde m$: \[\begin{equation*} \int_{-\infty}^\infty \bigl( \tilde m''(x) \bigr)^2 \,dx < \int_{-\infty}^\infty \bigl( \hat m_\lambda''(x) \bigr)^2 \,dx. \end{equation*}\] This is because $p''$ is linear (the second derivative of a cubic polynomial), and among all functions with given values at the endpoints of an interval, the linear function has the smallest integral of the square.

This contradicts the assumption that $\hat m_\lambda$ minimizes equation (6.1). Therefore, $\hat m_\lambda$ must be a cubic polynomial on each interval $[x_i, x_{i+1}]$.

A similar argument shows that $\hat m_\lambda$ must be twice continuously differentiable at the knots, completing the proof that $\hat m_\lambda$ is a cubic spline.

The theorem shows that the smoothing spline has knots at all data points $x_1, \ldots, x_n$. This might seem surprising, since a spline with $n$ knots has $n + 4$ degrees of freedom (four coefficients for each of the $n+1$ pieces, minus $3n$ constraints from continuity and smoothness, giving $4(n+1) - 3n = n + 4$). However, the penalty term $\lambda \int (m'')^2 dx$ prevents overfitting by penalizing functions with high curvature.

6.3 Degrees of Freedom

As with linear regression, we can write the smoothing spline estimate in matrix form. Let $y = (y_1, \ldots, y_n)^\top$ be the vector of responses, and let $\hat y = (\hat m_\lambda(x_1), \ldots, \hat m_\lambda(x_n))^\top$ be the vector of fitted values. Then we can write \[\begin{equation} \hat y = S_\lambda y, \tag{6.2} \end{equation}\] where $S_\lambda \in \mathbb{R}^{n\times n}$ is the smoother matrix. This is analogous to the hat matrix $H = X(X^\top X)^{-1}X^\top$ in linear regression (see section 2 of the level 3 notes).

The smoother matrix $S_\lambda$ depends on the smoothing parameter $\lambda$ and on the data points $x_1, \ldots, x_n$, but not on the responses $y_1, \ldots, y_n$. We will not give the explicit formula for $S_\lambda$ here, but note that it can be computed efficiently.

Definition 6.3 The effective degrees of freedom of the smoothing spline estimate is \[\begin{equation*} \mathop{\mathrm{df}}\nolimits(\lambda) = \mathop{\mathrm{tr}}\nolimits(S_\lambda), \end{equation*}\] where $\mathop{\mathrm{tr}}\nolimits$ denotes the trace of a matrix (the sum of the diagonal elements).

The effective degrees of freedom measure the complexity of the fitted model:

For $\lambda = 0$, the smoother matrix is $S_0 = I$ (the identity matrix), and we have $\mathop{\mathrm{df}}\nolimits(0) = n$. This corresponds to interpolation, where we use all $n$ data points.
For $\lambda \to \infty$, the smoother matrix converges to the hat matrix of linear regression, and we have $\mathop{\mathrm{df}}\nolimits(\infty) = 2$. This corresponds to fitting a straight line.
For intermediate values of $\lambda$, we have $2 < \mathop{\mathrm{df}}\nolimits(\lambda) < n$.

The effective degrees of freedom provide an alternative way to specify the amount of smoothing: instead of choosing $\lambda$ directly, we can choose a target value for $\mathop{\mathrm{df}}\nolimits(\lambda)$ and then find the corresponding $\lambda$.

6.4 Smoothing Splines in R

Smoothing splines can be computed in R using the built-in function smooth.spline(). The function takes the following arguments:

x and y: the data points.
spar: the smoothing parameter. This is related to $\lambda$ but uses a different scale. If not specified, a default value is chosen.
df: the target degrees of freedom. If specified, the function finds the value of spar which gives the desired degrees of freedom.

The return value is an object which contains the fitted spline. The most important components are:

$x and $y: the fitted spline evaluated at the data points (or at a grid of points if the optional argument all.knots = FALSE is used).
$df: the effective degrees of freedom of the fitted spline.
$lambda: the smoothing parameter $\lambda$ (on a different scale than spar).

Example 6.2 TODO: add example using faithful dataset, comparing to kernel methods

6.5 Comparison with Kernel Methods

Both smoothing splines and kernel methods can be used to estimate the regression function $m$ from data. The two approaches have different strengths and weaknesses.

Advantages of smoothing splines:

Global method: The spline is fitted to the entire dataset at once, which can give better results at the boundaries of the data range.
Automatic smoothness: The solution is guaranteed to be twice continuously differentiable.
Efficient computation: The spline can be computed by solving a system of linear equations, without the need to perform local fits at many points.

Advantages of kernel methods:

Local adaptation: Kernel methods can adapt to local features of the data, for example by using different bandwidths in different regions (as in $k$-NN methods).
Geometric interpretation: The weighted average interpretation of kernel methods is easier to understand than the penalized least squares formulation.
Robustness: Kernel methods can be made robust to outliers by using robust kernels or by trimming extreme values.

In practice, the choice between smoothing splines and kernel methods often depends on the specific application and the properties of the data.

Example 6.3 TODO: add comparison example showing spline vs. NW vs. local linear

Summary

Smoothing splines balance fit and smoothness using a penalty term.
The solution is a natural cubic spline with knots at the data points.
The smoothing parameter $\lambda$ controls the trade-off between fit and smoothness.
Effective degrees of freedom measure model complexity.
Smoothing splines are a global alternative to local kernel methods.