Problem Sheet 3

This problem sheet is for self-study only. It is not assessed.

8. The $k$-nearest neighbour estimator is defined as \[\begin{equation*} \hat m_k(x) = \frac{1}{k} \sum_{i \in J_k(x)} y_i, \end{equation*}\] where $J_k(x) = \bigl\{ i \bigm| \text{$x_i$ is one of the $k$ nearest observations to $x$} \bigr\}$. Assuming the $x_i$ are distinct, show that $\hat m_1(x_j) = y_j$ for each data point $x_j$.

For $k = 1$, the set $J_1(x_j)$ contains the single index $i$ for which $x_i$ is closest to $x_j$. Since the $x_i$ are distinct, the closest point to $x_j$ is $x_j$ itself (with distance zero). Thus $J_1(x_j) = \{j\}$ and \[\begin{equation*} \hat m_1(x_j) = \frac{1}{1} \sum_{i \in \{j\}} y_i = y_j. \end{equation*}\]

9. Consider the $k$-nearest neighbour estimator with $k = 1$ and ordered data $x_1 < x_2 < \cdots < x_n$.

For $x \in (x_i, x_{i+1})$, find the value of $x$ at which the nearest neighbour changes from $x_i$ to $x_{i+1}$.

The nearest neighbour is $x_i$ when $|x - x_i| < |x - x_{i+1}|$, i.e. when $x - x_i < x_{i+1} - x$, which simplifies to $x < \frac{x_i + x_{i+1}}{2}$. Similarly, the nearest neighbour is $x_{i+1}$ when $x > \frac{x_i + x_{i+1}}{2}$. Thus the nearest neighbour changes at the midpoint $\frac{x_i + x_{i+1}}{2}$.

Deduce that $\hat m_1$ is a step function, and state the intervals on which it is constant.

From part (a), the nearest neighbour changes only at the midpoints between consecutive data points. For $x < x_1$, the nearest neighbour is $x_1$; for $x > x_n$, it is $x_n$. Thus $\hat m_1$ is constant on each of the $n$ intervals \[\begin{equation*} \Bigl( -\infty, \frac{x_1 + x_2}{2} \Bigr), \quad \Bigl( \frac{x_1 + x_2}{2}, \frac{x_2 + x_3}{2} \Bigr), \quad \ldots, \quad \Bigl( \frac{x_{n-1} + x_n}{2}, \infty \Bigr), \end{equation*}\] taking values $y_1, y_2, \ldots, y_n$ respectively. This makes $\hat m_1$ a step function with jumps at the $n - 1$ midpoints.

10. Among smooth interpolants, the natural cubic spline minimises roughness $\int (m''(x))^2 \, dx$. This question illustrates this for two points within a family of cubics.

Consider fitting the points $(0, 0)$ and $(1, 1)$. For any constants $p$ and $q$, the cubic \[\begin{equation*} m(x) = x + x(1-x)\bigl(p(1-x) + qx\bigr) \end{equation*}\] passes through both points, with the linear function $m(x) = x$ corresponding to $p = q = 0$.

Show that the roughness $\int_0^1 (m''(x))^2 \, dx$ equals $4(p^2 - pq + q^2)$.

Expanding $m(x)$: \[\begin{equation*} m(x) = (1+p)x + (q-2p)x^2 + (p-q)x^3. \end{equation*}\] Differentiating twice: \[\begin{align*} m'(x) &= (1+p) + 2(q-2p)x + 3(p-q)x^2, \\ m''(x) &= 2(q-2p) + 6(p-q)x. \end{align*}\] Writing $m''(x) = \alpha + \beta x$ with $\alpha = 2(q-2p)$ and $\beta = 6(p-q)$, \[\begin{equation*} \int_0^1 (m''(x))^2 \, dx = \int_0^1 (\alpha + \beta x)^2 \, dx = \alpha^2 + \alpha\beta + \frac{\beta^2}{3}. \end{equation*}\] Substituting $\alpha = 2q - 4p$ and $\beta = 6p - 6q$ and simplifying gives $4(p^2 - pq + q^2)$.

By completing the square (or otherwise), show that $p^2 - pq + q^2 \geq 0$ with equality if and only if $p = q = 0$.

Completing the square: \[\begin{equation*} p^2 - pq + q^2 = \Bigl(p - \frac{q}{2}\Bigr)^{\!2} + \frac{3q^2}{4}. \end{equation*}\] This is a sum of two squares, hence non-negative. Equality holds if and only if both $p - q/2 = 0$ and $q = 0$, which requires $p = q = 0$.

Alternatively, $p^2 - pq + q^2 = \frac{1}{2}(p^2 + q^2 + (p-q)^2) \geq 0$.

Since every member of this family passes through both data points, $\mathrm{RSS} = 0$ for each. Deduce that for any $\lambda > 0$, the linear function $m(x) = x$ uniquely minimises $I(m) = \mathrm{RSS} + \lambda \int_0^1 (m''(x))^2 \, dx$ within this family.

With $\mathrm{RSS} = 0$, the objective becomes \[\begin{equation*} I(m) = \lambda \int_0^1 (m''(x))^2 \, dx = 4\lambda(p^2 - pq + q^2). \end{equation*}\] For $\lambda > 0$, minimising $I(m)$ is equivalent to minimising $p^2 - pq + q^2$. By part (b), this is minimised uniquely at $p = q = 0$, giving $m(x) = x$.