Section 2 The Error of Kernel Density Estimates

In the previous section we introduced the kernel density estimate \[\begin{equation} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i) \tag{2.1} \end{equation}\] for estimating the density \(f\), and we argued that \(\hat f_h(x) \approx f(x)\). The aim of the current section is to quantify the error of this approximation and to understand how this error depends on the true density \(f\) and on the bandwidth \(h > 0\).

2.1 A Statistical Model

As usual, we will make a statistical model for the data \(x_1, \ldots, x_n\), and then use this model to analyse how well the estimator performs. The statistical model we will consider here is extremely simple: we model the \(x_i\) using random variables \[\begin{equation} X_1, \ldots, X_n \sim f, \tag{2.2} \end{equation}\] which we assume to be independent and identically distributed (i.i.d.). Here, the notation \(X \sim f\), where \(f\) is a probability density, simply denotes that the random variable \(X\) has density \(f\).

It is important to not confuse \(x\) (the point where we are evaluating the densities during our analysis) with the data \(x_i\). A statistical model describes the data, so here we get random variables \(X_1, \ldots, X_n\) to describe the behaviour of \(x_1, \ldots, x_n\), but it does not describe \(x\). The number \(x\) is not part of the data, so will never be modelled by a random variable.

While the model is very simple, for example it is much simpler than the model we use in the level 3 part of the module for linear regression, the associated parameter estimation problem is more challenging. The only “parameter” in this model is the function \(f \colon\mathbb{R}\to \mathbb{R}\) instead of just a vector of numbers. The space of all possible density functions \(f\) is infinite dimensional, so this is a more challenging estimation problem than the one we consider, for example, for linear regression. Since \(f\) is not a “parameter” in the usual sense, sometimes this problem is called a “non-parametric” estimation problem.

Our estimate for the density \(f\) is the function \(\hat f_h\colon \mathbb{R}\to \mathbb{R}\), where \(\hat f_h(x)\) is given by (2.1) for every \(x \in\mathbb{R}\).

2.2 Moments of Kernels

For our analysis we will need to consider properties of \(K\) and \(K_h\) in more detail.

Definition 2.1 The \(k\)th moment of a kernel \(K\), for \(k \in \mathbb{N}_0 = \{0, 1, 2, \ldots\}\), is given by \[\begin{equation*} \mu_k(K) = \int_{-\infty}^\infty x^k K(x) \,dx. \end{equation*}\]

The second moment \(\mu_2\) is sometimes also called the variance of the kernel \(K\).

Using the properties of \(K\), we find the following results:

  • Since \(x^0 = 1\) for all \(x\in\mathbb{R}\), the \(0\)th moment is \(\mu_0(K) = \int_{-\infty}^\infty K(x) \,dx = 1\) for every kernel \(K\).

  • Since \(K\) is symmetric, the function \(x \mapsto x K(x)\) is antisymmetric and we have \[\begin{equation*} \mu_1(K) = \int_{-\infty}^\infty x K(x) \,dx = 0 \end{equation*}\] for every kernel \(K\).

The moments of the rescaled kernel \(K_h\) can be computed from the moments of \(K\).

Lemma 2.1 Let \(K\) be a kernel, \(k \in \mathbb{N}_0\) and \(h > 0\). Then \[\begin{equation*} \mu_k(K_h) = h^k \mu_k(K). \end{equation*}\]

Proof. We have \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty x^k K_h(x) \,dx \\ &= \int_{-\infty}^\infty x^k \frac1h K\Bigl(\frac{x}{h}\Bigr) \,dx. \end{align*}\] Using the substitution \(y = x/h\) we find \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty (hy)^k \frac1h K(y) \, h \,dy \\ &= h^k \int_{-\infty}^\infty y^k K(y) \,dy \\ &= h^k \mu_k(K). \end{align*}\] This completes the proof.

By lemma 1.1, if \(K\) is a kernel, then \(K_h\) is also a kernel which implies that \(K_h\) is a probability density. If \(Y\) is a random variable with density \(K_h\), written as \(Y \sim K_h\) in short, then we find \[\begin{equation*} \mathbb{E}(Y) = \int y K_h(y) \,dy = \mu_1(K_h) = 0 \end{equation*}\] and \[\begin{equation} \mathop{\mathrm{Var}}(Y) = \mathbb{E}(Y^2) = \int y^2 K_h(y) \,dy = \mu_2(K_h) = h^2 \, \mu_2(K). \tag{2.3} \end{equation}\] Thus, \(Y\) is centred and the variance of \(Y\) is proportional to \(h^2\).

2.3 The Roughness of Kernels

In addition to the moments of a kernel, we will need one more kernel property for our analysis.

Definition 2.2 The roughness of a kernel \(K\) is given by \[\begin{equation*} R(K) := \int_{-\infty}^\infty K(x)^2 \,dx. \end{equation*}\]

The roughness measures how concentrated the kernel is: a kernel with higher roughness has more mass concentrated in a smaller region, while a kernel with lower roughness is more spread out.

Similar to the moments, the roughness of the rescaled kernel \(K_h\) can be expressed in terms of the roughness of \(K\).

Lemma 2.2 Let \(K\) be a kernel and \(h > 0\). Then \[\begin{equation*} R(K_h) = \frac{1}{h} R(K). \end{equation*}\]

Proof. We have \[\begin{align*} R(K_h) &= \int_{-\infty}^\infty K_h(x)^2 \,dx \\ &= \int_{-\infty}^\infty \Bigl( \frac{1}{h} K\Bigl(\frac{x}{h}\Bigr) \Bigr)^2 \,dx \\ &= \int_{-\infty}^\infty \frac{1}{h^2} K\Bigl(\frac{x}{h}\Bigr)^2 \,dx. \end{align*}\] Using the substitution \(y = x/h\) we find \[\begin{align*} R(K_h) &= \int_{-\infty}^\infty \frac{1}{h^2} K(y)^2 \, h \,dy \\ &= \frac{1}{h} \int_{-\infty}^\infty K(y)^2 \,dy \\ &= \frac{1}{h} R(K). \end{align*}\] This completes the proof.

2.4 Mean Squared Error

In this section we will quantify the error of the kernel density estimate \(\hat f_h(x)\) for estimating \(f(x)\) by computing its mean squared error. The mean squared error combines two sources of error: bias (systematic error) and variance (random error due to sampling). We will see how the bandwidth
\(h\) affects both of these components, and how they must be balanced to minimise the overall error.

2.4.1 Result

The following theorem characterises the mean squared error of the kernel density estimate for small bandwidth \(h\).

Theorem 2.1 Let \(X_1, \ldots, X_n\) be i.i.d. random variables with density \(f\), and let \(\hat f_h(x)\) be the kernel density estimate defined in (2.1). Then, as \(h \downarrow 0\), the mean squared error of \(\hat f_h(x)\) is given by \[\begin{equation} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4 + o(1/nh) + o(h^4), \tag{2.4} \end{equation}\] where \(R(K) = \int K(x)^2 \,dx\) is the roughness of the kernel and \(\mu_2(K) = \int x^2 K(x) \,dx\) is the second moment of the kernel.

Here we use little-o notation: the term \(o(1/nh)\) represents error terms of the form \(e_1(h)/n\) where \(e_1(h) = o(1/h)\) as \(h \downarrow 0\), meaning that \(e_1(h) / (1/h) \to 0\), and the constants in the definition of \(o(1/h)\) do not depend on \(n\). Similarly, \(o(h^4)\) represents error terms \(e_2(h)\) which do not depend on \(n\) and satisfy \(e_2(h) / h^4 \to 0\) as \(h \downarrow 0\). Thus \(o(1/nh)\) really means ``\(o(1/h)\), where the constants are proportional to \(1/n\)’’, and the constants in the definition of \(o(h^4)\) do not depend on \(n\).

The theorem shows that the MSE has two main terms with opposing behaviour: the first term \(\frac{1}{nh} f(x) R(K)\) comes from the variance of the estimate and decreases as \(h\) increases, while the second term \(\frac14 \mu_2(K)^2 f''(x)^2 h^4\) comes from the squared bias and increases as \(h\) increases. These competing effects lead to an optimal choice of bandwidth that balances bias and variance.

We will not give a complete formal proof of this theorem, but we will derive the main components in the following subsections. First we compute the bias of the estimate and show that it is approximately \(\frac{\mu_2(K) f''(x)}{2} h^2\) for small \(h\). Then we compute the variance and show that it is approximately \(\frac{f(x) R(K)}{nh}\) for small \(h\). Combining these results using the standard identity \(\mathop{\mathrm{MSE}}\nolimits= \mathop{\mathrm{Var}}+ \mathop{\mathrm{bias}}^2\) yields the theorem.

2.4.2 Bias

2.4.2.1 The Bias of the Estimate

As usual, the bias of our estimate is the difference between what the estimator gives on average and the truth. For our estimation problem we get \[\begin{equation*} \mathop{\mathrm{bias}}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl(\hat f_h(x)\bigr) - f(x). \end{equation*}\] The expectation on the right-hand side averages over the randomness in the data, by using \(X_1, \ldots, X_n\) from the model in place of the data.

Substituting in the definition of \(\hat f_h(x)\) from equation (2.1) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \Bigr) \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_i) \bigr) \end{align*}\] and since the \(X_i\) are identically distributed, we can replace all \(X_i\) with \(X_1\) (or any other of them) to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \frac{1}{n} n \, \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \mathbb{E}\bigl( K_h(x - X_1) \bigr). \end{align*}\] Since the model assumes \(X_1\) (and all the other \(X_i\)) to have density \(f\), we can write this expectation as an integral to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \int_{-\infty}^\infty K_h(x - y) \, f(y) \, dy \\ &= \int_{-\infty}^\infty f(y) \, K_h(y - x) \, dy \\ &= \int_{-\infty}^\infty f(z+x) \, K_h(z) \, dz \end{align*}\] where we used the symmetry of \(K_h\) and the substitution \(z = y - x\).

2.4.2.2 The Bias for Small Bandwidth

Considering again the formula \[\begin{equation*} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \int_{-\infty}^\infty f(x+y) \, K_h(y) \, dy, \end{equation*}\] we see that we can interpret this integral as an expectation with respect to a random variable \(Y \sim K_h\): \[\begin{equation} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl( f(x+Y) \bigr). \tag{2.5} \end{equation}\] Equation (2.3) shows that for \(h \downarrow 0\) the random variable concentrates more and more around \(0\) and thus \(x+Y\) concentrates more and more around \(x\). For this reason we expect \(\mathbb{E}\bigl(\hat f_h(x)\bigr) \approx f(x)\) for small \(h\).

To get a more quantitative version of this argument, we consider the Taylor approximation of \(f\) around the point \(x\): \[\begin{equation*} f(x + y) \approx f(x) + y f'(x) + \frac{y^2}{2} f''(x). \end{equation*}\] Substituting this into equation (2.5) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &\approx \mathbb{E}\Bigl( f(x) + Y f'(x) + \frac{Y^2}{2} f''(x) \Bigr) \\ &= f(x) + \mathbb{E}(Y) f'(x) + \frac12 \mathbb{E}(Y^2) f''(x) \\ &= f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{align*}\] for small \(h\). Considering the bias again, this gives \[\begin{equation} \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x) \bigr) - f(x) \approx \frac{\mu_2(K) f''(x)}{2} h^2 \tag{2.6} \end{equation}\] which shows that the bias of the estimator decreases quadratically as \(h\) gets smaller.

2.4.3 Variance

We now turn to computing the variance of the estimator. Recall from the bias analysis above that \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( K_h(x - X_i) \bigr) \tag{2.7} \end{equation}\] for all \(i \in \{1, \ldots, n\}\). We will use similar arguments to derive the variance.

We use the formula \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \end{equation*}\] and consider the two terms separately.

2.4.3.1 Second Moment

For the second moment term in the definition of the variance we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \frac{1}{n} \sum_{j=1}^n K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \mathbb{E}\Bigl( \sum_{i,j=1}^n K_h(x - X_i) K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \sum_{i,j=1}^n \mathbb{E}\Bigl( K_h(x - X_i) K_h(x - X_j) \Bigr) \end{align*}\] Since the \(X_i\) are independent, the values of \(i\) and \(j\) in this sum do not matter. For the \(n\) terms where \(i=j\) we can assume that both indices equal 1, and for the \(n(n-1)\) terms where \(i\neq j\) we can assume \(i=1\) and \(j=2\), without changing the value of the expectation. So we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 \Bigr) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 \Bigr) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr) \mathbb{E}\bigl( K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 \Bigr) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr)^2 \Bigr) \\ &= \frac{1}{n} \mathbb{E}\Bigl( K_h(x - X_1)^2 \Bigr) + \frac{n-1}{n} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2, \end{align*}\] where we used equation (2.7) for the last term. Using this equation, we can write the variance as \[\begin{align*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) &= \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \\ &= \frac{1}{n} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) + \Bigl(\frac{n-1}{n} - 1\Bigr) \mathbb{E}\bigl( \hat f_h(x) \bigr)^2. \end{align*}\] Since \((n-1)/n - 1 = -1/n\) we get \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \Bigr). \tag{2.8} \end{equation}\]

2.4.3.2 Approximating \(\mathbb{E}\bigl( K_h(x - X_1)^2 \bigr)\)

Similar to the bias analysis, we will use Taylor expansion of \(f\) around the point \(x\) to understand the behaviour of the variance for small \(h\). Some more care is needed here, because this time the result also depends on the sample size \(n\) and we will consider the joint limit of \(n \to \infty\) and \(h\to 0\). For the first term in equation (2.8) we find \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) &= \int K_h(x - y)^2 f(y) \,dy \\ &= \int K_h(y - x)^2 f(y) \,dy \\ &= \int \frac{1}{h^2} K\Bigl( \frac{y - x}{h} \Bigr)^2 f(y) \,dy. \end{align*}\] Now we use the substitution \(z = (y - x) / h\). This gives \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) &= \int \frac{1}{h^2} K(z)^2 f(x + hz) \,h \,dz \end{align*}\] Finally, we use Taylor approximation to get \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) &\approx \int \frac{1}{h} K(z)^2 \Bigl( f(x) + hz\,f'(x) + \frac12 h^2 z^2 \,f''(x) \Bigr) \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + f'(x) \int z K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \end{align*}\] where the first-order term disappears since \(z \mapsto z K(z)^2\) is an antisymmetric function. Using the definition of the roughness of a kernel, \(R(K) = \int K(x)^2 \,dx\), this gives the result \[\begin{equation} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) \approx \frac{1}{h} f(x) R(K) + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \tag{2.9} \end{equation}\] for small \(h\).

2.4.3.3 The Variance for Small Bandwidth

Here we consider the term \(\mathbb{E}\bigl( \hat f_h(x) \bigr)^2\) in the formula for the variance. From the bias analysis above we know \[\begin{equation*} \mathbb{E}\bigl( \hat f_h(x) \bigr) \approx f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{equation*}\] and thus we get \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \approx f(x)^2 + h^2 \mu_2(K) f(x) f''(x) + \frac14 h^4 \mu_2(K)^2 f''(x)^2 \tag{2.10} \end{equation}\] for small \(h\).

Substituting equations (2.9) and (2.10) into equation (2.8) we finally find \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \frac{1}{h} f(x) R(K) - f(x)^2 + \cdots \Bigr), \end{equation*}\] where all the omitted terms go to zero as \(h \downarrow 0\). Omitting one more term and keeping only the leading term we find \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) \tag{2.11} \end{equation}\] as \(h\downarrow 0\).

2.4.4 Result

Recall that the mean squared error can be decomposed as \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \mathbb{E}\Bigl( \bigl( \hat f_h(x) - f(x) \bigr)^2 \Bigr) = \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) + \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr)^2. \end{equation*}\] Substituting equations (2.6) and (2.11) into the formula for the MSE, we obtain the main two terms in theorem 2.1: \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4. \end{equation*}\] The additional error terms \(o(1/nh)\) and \(o(h^4)\) in the theorem statement account for the higher-order terms neglected in the Taylor approximations used to derive the bias and variance.

These two main terms have opposing behaviour: the variance term \(\frac{1}{nh} f(x) R(K)\) increases as \(h \downarrow 0\), whilst the squared bias term \(\frac14 \mu_2(K)^2 f''(x)^2 h^4\) decreases as \(h \downarrow 0\). We will need to balance these two effects to find the optimal value of \(h\).

2.5 Optimal Bandwidth for Pointwise MSE

The two terms on the right-hand side of (2.4) are balanced in that the first term decreases for large \(h\) while the second term decreases for small \(h\). By taking derivatives with respect to \(h\), we can find the optimal value of \(h\). Ignoring the higher order terms, we get \[\begin{equation*} \frac{\partial}{\partial h} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = -\frac{1}{nh^2} f(x) R(K) + \mu_2(K)^2 f''(x)^2 h^3 \end{equation*}\] and thus the derivative equals zero, if \[\begin{equation*} \frac{1}{nh^2} f(x) R(K) = \mu_2(K)^2 f''(x)^2 h^3 \end{equation*}\] or, equivalently, \[\begin{equation*} h = h_\mathrm{opt} := \Bigl( \frac{f(x) R(K)}{n \mu_2(K)^2 f''(x)^2} \Bigr)^{1/5}. \end{equation*}\] It is easy to check that this \(h\) corresponds to the minimum of the MSE. This shows how the optimal bandwidth depends both on the kernel and on the target density \(f\). In practice, this formula is hard to use, since \(f''\) is unknown. (We don’t even know \(f\)!)

Substituting the optimal value of \(h\) back into equation (2.4), we get \[\begin{align*} \mathop{\mathrm{MSE}}\nolimits_\mathrm{opt} &= \frac{1}{n} f(x) R(K) \Bigl( \frac{n \mu_2(K)^2 f''(x)^2}{f(x) R(K)} \Bigr)^{1/5} + \frac14 \mu_2(K)^2 f''(x)^2 \Bigl( \frac{f(x) R(K)}{n \mu_2(K)^2 f''(x)^2} \Bigr)^{4/5} \\ &= \frac54 \; \frac{1}{n^{4/5}} \; \Bigl( R(K)^2 \mu_2(K) \Bigr)^{2/5} \; \Bigl( f(x)^2 |f''(x)| \Bigr)^{2/5}. \end{align*}\] This expression clearly shows the contribution of \(n\), \(K\) and \(f\):

  • If the bandwidth is chosen optimally, as \(n\) increases the bandwidth \(h\) decreases proportionally to \(1/n^{1/5}\) and the MSE decreases proportionally to \(1 / n^{4/5}\). For comparison, in a Monte Carlo estimate for an expectation, the MSE decreases proportionally to \(1/n\). The error in kernel density estimation decreases slightly slower than for Monte Carlo estimates.

  • The error is proportional to \(\bigl( R(K)^2 \mu_2(K) \bigr)^{2/5}\). Thus we should use kernels where the value \(R(K)^2 \mu_2(K)\) is small.

  • The error is proportional to \(\bigl( f(x)^2 |f''(x)| \bigr)^{2/5}\). We cannot influence this term, but we can see that \(x\) where \(f\) is large or has high curvature have higher estimation error.

2.6 Integrated Error

Equation (2.4) gives the mean squared error when trying to estimate the density \(f(x)\) at a fixed point \(x\). Usually we are interested in estimating the function \(f\) rather than individual points \(f(x)\). In this case, we consider the integrated mean squared error (IMSE): \[\begin{equation*} \mathrm{IMSE}\bigl( \hat f_h \bigr) := \int_{-\infty}^\infty \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \,dx. \end{equation*}\] Using our result from above we find \[\begin{align*} \mathrm{IMSE}\bigl( \hat f_h \bigr) &\approx \int \Bigl( \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4 \Bigr) \,dx \\ &= \frac{1}{nh} R(K) \int f(x) \,dx + \frac{h^4}{4} \mu_2(K)^2 \int f''(x)^2 \,dx \\ &= \frac{1}{nh} R(K) + \frac{1}{4} \mu_2(K)^2 R(f'') h^4, \end{align*}\] where we (mis-)use the definition of roughness as an abbreviation to express the integral over \(f''\).

As in the pointwise case above, we can use differentiation to find the optimal value of \(h\). Here we get \[\begin{equation*} h_\mathrm{opt} = \Bigl( \frac{R(K)}{n \mu_2(K)^2 R(f'')} \Bigr)^{1/5}. \end{equation*}\] and the corresponding error is \[\begin{equation} \mathrm{IMSE}_\mathrm{opt} = \frac54 \; \frac{1}{n^{4/5}} \; \Bigl( R(K)^2 \mu_2(K) \Bigr)^{2/5} \; R(f'')^{1/5}. \tag{2.12} \end{equation}\] Thus, in order to minimise the error we still need to choose \(h \propto n^{-1/5}\) and we should choose a kernel \(K\) which minimises the value \(R(K)^2 \mu_2(K)\).

Summary

  • We introduced a statistical model for density estimation where data are i.i.d. samples from the unknown density \(f\).
  • Kernel moments \(\mu_k(K) = \int x^k K(x) dx\) characterise kernel properties and scale with bandwidth: \(\mu_k(K_h) = h^k \mu_k(K)\).
  • For small bandwidth \(h\), the bias is approximately \(\frac{\mu_2(K) f''(x)}{2} h^2\), decreasing quadratically as \(h \to 0\).
  • The variance for small \(h\) is approximately \(\frac{f(x) R(K)}{nh}\), where \(R(K) = \int K(x)^2 dx\) is the kernel roughness, and increases as \(h \to 0\).
  • The mean squared error (MSE) combines both bias and variance. The integrated mean squared error (IMSE) integrates the MSE over all points and provides a global measure of estimation quality.