Section 3 The Variance of Kernel Density Estimates

In the previous section we considered the bias of the estimator \[\begin{equation*} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i). \end{equation*}\] We found \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( K_h(x - X_i) \bigr) \tag{3.1} \end{equation}\] for all \(i \in \{1, \ldots, n\}\) (we used \(i=1\)), and we used this relation to compute the bias. In this section, we will use similar arguments to compute the variance and the mean squared error of the estimator.

3.1 Variance

We use the formula \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \end{equation*}\] and consider the two terms separately.

3.1.1 Second Moment

For the second moment term in the definition of the variance we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \frac{1}{n} \sum_{j=1}^n K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \mathbb{E}\Bigl( \sum_{i,j=1}^n K_h(x - X_i) K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \sum_{i,j=1}^n \mathbb{E}\Bigl( K_h(x - X_i) K_h(x - X_j) \Bigr) \end{align*}\] Since the \(X_i\) are independent, the values of \(i\) and \(j\) in this sum do not matter. For the \(n\) terms where \(i=j\) we can assume that both indices equal 1, and for the \(n(n-1)\) terms where \(i\neq j\) we can assume \(i=1\) and \(j=2\), without changing the value of the expectation. So we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr) \mathbb{E}\bigl( K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr)^2 \Bigr) \\ &= \frac{1}{n} \mathbb{E}\Bigl( K_h(x - X_1)^2 \Bigr) + \frac{n-1}{n} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2, \end{align*}\] where we used equation (3.1) for the last term. Using this equation, we can write the expectation as \[\begin{align*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) &= \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \\ &= \frac{1}{n} \mathbb{E}\bigl( K_h(x - X_1)^2 ) + \Bigl(\frac{n-1}{n} - 1\Bigr) \mathbb{E}\bigl( \hat f_h(x) \bigr)^2. \end{align*}\] Since \((n-1)/n - 1 = -1/n\) we get \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \mathbb{E}\bigl( K_h(x - X_1)^2 ) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \Bigr). \tag{3.2} \end{equation}\]

3.1.2 The Roughness of a Kernel

Similar to what we did in the last section, we will use Taylor expansion of \(f\) around the point \(x\) to understand the behaviour of the variance for small \(h\). Some more care is needed here, because this time the result also depends on the sample size \(n\) and we will consider the joint limit of \(n \to \infty\) and \(h\to 0\). For the first term in equation (3.2) we find \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &= \int K_h(x - y)^2 f(y) \,dy \\ &= \int K_h(y - x)^2 f(y) \,dy \\ &= \int \frac{1}{h^2} K\Bigl( \frac{y - x}{h} \Bigr)^2 f(y) \,dy. \end{align*}\] Now we use the substitution \(z = (y - x) / h\). This gives \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &= \int \frac{1}{h^2} K(z)^2 f(x + hz) \,h \,dz \end{align*}\] Finally, we use Taylor approximation to get \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &\approx \int \frac{1}{h} K(z)^2 \Bigl( f(x) + hz\,f'(x) + \frac12 h^2 z^2 \,f''(x) \Bigr) \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + f'(x) \int z K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \end{align*}\] where the first-order term disappears since \(z \mapsto z K(z)^2\) is an asymmetric function. As a shorthand we use the following definition.

Definition 3.1 The roughness of a kernel \(K\) is given by \[\begin{equation*} R(K) := \int_{-\infty}^\infty K(x)^2 \,dx. \end{equation*}\]

This gives the result \[\begin{equation} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) \approx \frac{1}{h} f(x) R(K) + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \tag{3.3} \end{equation}\] for small \(h\).

3.1.3 The Variance for Small Bandwidth

Here we consider the term \(\mathbb{E}\bigl( \hat f_h(x) \bigr)^2\) in the formula for the variance. From the previous section we know \[\begin{equation*} \mathbb{E}\bigl( \hat f_h(x) \bigr) \approx f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{equation*}\] and thus we get \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \approx f(x)^2 + h^2 \mu_2(K) f(x) f''(x) + \frac14 h^4 \mu_2(K)^2 f''(x)^2 \tag{3.4} \end{equation}\] for small \(h\).

Substituting (3.3) and (3.4) into equation (3.2) we finally find \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \frac{1}{h} f(x) R(K) - f(x)^2 + \cdots \Bigr), \end{equation*}\] where all the omitted terms go to zero as \(h \downarrow 0\). Omitting one more term and keeping only the leading term we find \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) \tag{3.5} \end{equation}\] as \(h\downarrow 0\).

3.2 Mean Squared Error

The Mean Squared Error of the estimator \(\hat f_h(x)\) for \(f(x)\) is given by \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \mathbb{E}\Bigl( \bigl( \hat f_h(x) - f(x) \bigr)^2 \Bigr). \end{equation*}\] It is an easy exercise to show that this can equivalently be written as \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) + \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr)^2. \end{equation*}\] Substituing equations (2.5) and (3.5) into the formula for the MSE, we get \[\begin{equation} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4. \tag{3.6} \end{equation}\]

Some care is needed to make sure that the omitted terms from the Taylor approximations really don’t make a significant contribution in this formula for the MSE: The additional contributions from the variance have the form \(e_1(h) / n\), where the error term \(e_1\) does not depend on \(n\) and is negligible compared to \(f(x) R(K) / h\) as \(h \downarrow 0\). Using little-o notation, This is sometimes denoted by \(e_1(h) = o(1/h)\), which indicates that \(e_1(h) / (1/h) \to 0\) as \(h \downarrow 0\). The additional terms from the squared bias, say \(e_2(h)\), do not depend on \(n\) and are negligible compared to \(\mu_2(K)^2 f''(x)^2 h^4\). We can write \(e_2(h) = o(h^4)\) as \(n \downarrow 0\), to reflect this fact. We can summarise these results as \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4 + o(1/nh) + o(h^4) \end{equation*}\] as \(h \downarrow 0\), with the understanding that the constants in the definition of \(o(h^4)\) do not depend on \(n\) and that \(o(1/nh)\) really means “\(o(1/h)\), where the constants are proportional to \(1/n\).”

3.3 Optimal Bandwidth

The two terms on the right-hand side of (3.6) are balanced in that the first term decreases for large \(h\) while the second term decreases for small \(h\). By taking derivatives with respect to \(h\), we can find the optimal value of \(h\). Ignoring the higher order terms, we get \[\begin{equation*} \frac{\partial}{\partial h} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = -\frac{1}{nh^2} f(x) R(K) + \mu_2(K)^2 f''(x)^2 h^3 \end{equation*}\] and thus the derivative equals zero, if \[\begin{equation*} \frac{1}{nh^2} f(x) R(K) = \mu_2(K)^2 f''(x)^2 h^3 \end{equation*}\] or, equivalently, \[\begin{equation*} h = h_\mathrm{opt} := \Bigl( \frac{f(x) R(K)}{n \mu_2(K)^2 f''(x)^2} \Bigr)^{1/5}. \end{equation*}\] It is easy to check that this \(h\) corresponds to the minimum of the MSE. This shows how the optimal bandwidth depends both on the kernel and on the target density \(f\). In practice, this formula is hard to use, since \(f''\) is unknown. (We don’t even know \(f\)!)

Substituting the optimal value of \(h\) back into equation (3.6), we get \[\begin{align*} \mathop{\mathrm{MSE}}\nolimits_\mathrm{opt} &= \frac{1}{n} f(x) R(K) \Bigl( \frac{n \mu_2(K)^2 f''(x)^2}{f(x) R(K)} \Bigr)^{1/5} + \frac14 \mu_2(K)^2 f''(x)^2 \Bigl( \frac{f(x) R(K)}{n \mu_2(K)^2 f''(x)^2} \Bigr)^{4/5} \\ &= \frac54 \; \frac{1}{n^{4/5}} \; \Bigl( R(K)^2 \mu_2(K) \Bigr)^{2/5} \; \Bigl( f(x)^2 |f''(x)| \Bigr)^{2/5}. \end{align*}\] This expression clearly shows the contribution of \(n\), \(K\) and \(f\):

If the bandwidth is chosen optimally, as \(n\) increases the bandwidth \(h\) decreases proportionally to \(1/n^{1/5}\) and the MSE decreases proportionally to \(1 / n^{4/5}\). For comparison, in a Monte Carlo estimate for an expectation, the MSE decreases proportionally to \(1/n\). The error in kernel density estimation decreases slightly slower than for Monte Carlo estimates.
The error is proportional to \(\bigl( R(K)^2 \mu_2(K) \bigr)^{2/5}\). Thus we should use kernels where the value \(R(K)^2 \mu_2(K)\) is small.
The error is proportional to \(f(x)^2 |f''(x)|\). We cannot influence this term, but we can see that \(x\) where \(f\) is large or has high curvature have higher estimation error.

Summary

We found the variance of the kernel density estimate.
We studied the mean squared error for small \(h\).
We derived a formula for the optimal value of the bandwidth \(h\).