Problem Sheet 2

This problem sheet is for self-study only. It is not assessed.

5. Consider the Nadaraya-Watson estimator \[\begin{equation*} \hat m_h(x) = \frac{\sum_{i=1}^n K_h(x - x_i) y_i}{\sum_{j=1}^n K_h(x - x_j)}, \end{equation*}\] where the Gaussian kernel is used for \(K\). Assuming the \(x_i\) are distinct, show that \(\hat m_h(x_i) \to y_i\) as \(h \downarrow 0\) for each data point \(x_i\).

Evaluating the estimator at the data point \(x_i\), we have \[\begin{equation*} \hat m_h(x_i) = \frac{\sum_{j=1}^n K_h(x_i - x_j) y_j}{\sum_{j=1}^n K_h(x_i - x_j)} = \frac{K_h(0) y_i + \sum_{j \neq i} K_h(x_i - x_j) y_j}{K_h(0) + \sum_{j \neq i} K_h(x_i - x_j)}. \end{equation*}\] Dividing both numerator and denominator by \(K_h(0) = \frac{1}{h} K(0)\) gives \[\begin{equation*} \hat m_h(x_i) = \frac{y_i + \sum_{j \neq i} \frac{K_h(x_i - x_j)}{K_h(0)} y_j}{1 + \sum_{j \neq i} \frac{K_h(x_i - x_j)}{K_h(0)}}. \end{equation*}\] For \(j \neq i\) we have \(x_i - x_j \neq 0\), and thus \(|(x_i - x_j)/h| \to \infty\) as \(h\downarrow 0\). For the Gaussian kernel, \(K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2} \to 0\) as \(|u| \to \infty\). Therefore \[\begin{equation*} \frac{K_h(x_i - x_j)}{K_h(0)} = \frac{K\bigl((x_i - x_j)/h\bigr)}{K(0)} \to \frac{0}{K(0)} = 0 \end{equation*}\] as \(h \downarrow 0\), using \(K(0) > 0\). Since there are only finitely many terms in the sums (namely \(n-1\) terms), we conclude that \[\begin{equation*} \hat m_h(x_i) \to \frac{y_i + 0}{1 + 0} = y_i \end{equation*}\] as \(h \downarrow 0\).

6. The bias of the Nadaraya-Watson estimator satisfies \[\begin{equation*} \mathop{\mathrm{bias}}\bigl(\hat m_h(x)\bigr) \approx \frac{\mu_2(K) f'(x)}{f(x)} m'(x) h^2 + \frac{\mu_2(K)}{2} m''(x) h^2, \end{equation*}\] while for the local linear estimator (section 5), the bias is \[\begin{equation*} \mathop{\mathrm{bias}}\approx \frac{\mu_2(K)}{2} m''(x) h^2 \end{equation*}\] for small \(h\).

Suppose observations are more dense to the left of a point \(x_0\) than they are to the right, i.e. \(f'(x_0) < 0\), and that the regression function \(m\) has positive slope at \(x_0\). Explain geometrically why the Nadaraya-Watson estimator underestimates \(m(x_0)\).

The Nadaraya-Watson estimator computes a weighted average \(\hat m_h(x_0) = \sum_i w_i(x_0) y_i\), where the weights are determined by the distance from the point \(x_0\). Since \(m\) has positive slope, the \(y\)-values to the left of \(x_0\) are smaller than \(m(x_0)\), while those to the right are larger. When observations are more dense to the left, more data points lie there and receive collectively greater weight. The weighted average is therefore pulled downward, causing underestimation.

Let \(m(x) = 2x\) and \(f(x) = 2(1-x)\) for \(x \in [0,1]\). Compute the Nadaraya-Watson bias at \(x_0 = 0.5\) as a multiple of \(\mu_2(K) h^2\).

We have \(m'(x_0) = 2\), \(m''(x_0) = 0\), \(f(0.5) = 1\), and \(f'(0.5) = -2\). The Nadaraya-Watson bias is \[\begin{equation*} \mathop{\mathrm{bias}}\bigl(\hat m_h(0.5)\bigr) \approx \frac{\mu_2(K) \cdot (-2) \cdot 2}{1} h^2 + \frac{\mu_2(K)}{2} \cdot 0 \cdot h^2 = -4 \mu_2(K) h^2. \end{equation*}\] The negative sign confirms the underestimation predicted in part (a).

Show that the bias for local linear regression at the same point equals zero. Explain geometrically why fitting a local line, rather than taking a local average, eliminates this bias.

The local linear bias is \[\begin{equation*} \mathop{\mathrm{bias}}\approx \frac{\mu_2(K)}{2} \cdot 0 \cdot h^2 = 0. \end{equation*}\] Local linear regression fits a line to the weighted observations rather than averaging them. Even when observations are concentrated on one side of \(x_0\), the fitted line can extrapolate to \(x_0\) using its estimated slope. Since the true function \(m(x) = 2x\) is linear, the locally fitted line matches \(m\) exactly in expectation, eliminating the bias.

7. Consider local linear regression with kernel weights \(W = \mathrm{diag}(K_h(\tilde x - x_1), \ldots, K_h(\tilde x - x_n))\) and design matrix \(X\) with rows \((1, x_i - \tilde x)\). The local linear estimate for \(m(\tilde x)\) can be written as \[\begin{equation*} \hat m_h(\tilde x) = e_0^\top (X^\top W X)^{-1} X^\top W y, \end{equation*}\] where \(e_0 = (1,0)^\top\).

Show that \(\hat m_h(\tilde x) = \sum_{i=1}^n w_i y_i\) for weights \(w_i\) that do not depend on the \(y_j\).

The expression \(e_0^\top (X^\top W X)^{-1} X^\top W y\) is linear in \(y\). Writing \(w^\top = e_0^\top (X^\top W X)^{-1} X^\top W\), we have \(\hat m_h(\tilde x) = w^\top y = \sum_{i=1}^n w_i y_i\). The vector \(w\) depends on \(\tilde x\) and the \(x_i\) (through \(X\) and \(W\)) but not on the \(y_j\).

Show that if \(y_i = a + b(x_i - \tilde x)\) for constants \(a\) and \(b\), then \(\hat m_h(\tilde x) = a\).

If \(y_i = a + b(x_i - \tilde x)\), then \(\beta_0 = a\), \(\beta_1 = b\) makes every residual \(y_i - \beta_0 - \beta_1(x_i - \tilde x)\) equal to zero. The weighted RSS therefore vanishes, which is its minimum, so \(\hat\beta = (a, b)^\top\) and \(\hat m_h(\tilde x) = \hat\beta_0 = a\).

Deduce that \(\sum_{i=1}^n w_i = 1\) and \(\sum_{i=1}^n w_i(x_i - \tilde x) = 0\).

From part (a), \(\hat m_h(\tilde x) = \sum w_i y_i\) where the weights \(w_i\) do not depend on the \(y_j\). Thus we can choose \(y_i = a + b(x_i - \tilde x)\) as in part (b), giving \(\hat m_h(\tilde x) = a\). This gives \[\begin{equation*} a = \hat m_h(\tilde x) = \sum_{i=1}^n w_i (a + b(x_i - \tilde x)) = a \sum_{i=1}^n w_i + b \sum_{i=1}^n w_i(x_i - \tilde x). \end{equation*}\] Rearranging gives \[\begin{equation*} a\Bigl( \sum_{i=1}^n w_i - 1 \Bigr) + b \sum_{i=1}^n w_i(x_i - \tilde x) = 0. \end{equation*}\] Since this holds for all \(a, b\), both coefficients must vanish: \(\sum w_i = 1\) and \(\sum w_i(x_i - \tilde x) = 0\).