Section 2 The Error of Kernel Density Estimates
In the previous section we introduced the kernel density estimate \[\begin{equation} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i) \tag{2.1} \end{equation}\] for estimating the density \(f\), and we argued that \(\hat f_h(x) \approx f(x)\). The aim of the current section is to quantify the error of this approximation and to understand how this error depends on the true density \(f\) and on the bandwidth \(h > 0\).
2.1 A Statistical Model
As usual, we will make a statistical model for the data \(x_1, \ldots, x_n\), and then use this model to analyse how well the estimator performs. The statistical model we will consider here is extremely simple: we model the \(x_i\) using random variables \[\begin{equation} X_1, \ldots, X_n \sim f, \tag{2.2} \end{equation}\] which we assume to be independent and identically distributed (i.i.d.). Here, the notation \(X \sim f\), where \(f\) is a probability density, simply denotes that the random variable \(X\) has density \(f\).
It is important to not confuse \(x\) (the point where we are evaluating the densities during our analysis) with the data \(x_i\). A statistical model describes the data, so here we get random variables \(X_1, \ldots, X_n\) to describe the behaviour of \(x_1, \ldots, x_n\), but it does not descibe \(x\). The number \(x\) is not part of the data, so will never be modelled by a random variable.
While the model is very simple, for example it is much simpler than the model we use in the level 3 part of the module for linear regression, the associated parameter estimation problem is more challenging. The only “parameter” in this model is the function \(f \colon\mathbb{R}\to \mathbb{R}\) instead of just a vector of numbers. The space of all possible density functions \(f\) is infinite dimensional, so this is a more challenging estimation problem then the one we consider, for example, for linear regression. Since \(f\) is not a “parameter” in the usual sense, sometimes this problem is called a “non-parametric” estimation problem.
Our estimate for the density \(f\) is the function \(\hat f_h\colon \mathbb{R}\to \mathbb{R}\), where \(\hat f_h(x)\) is given by (2.1) for every \(x \in\mathbb{R}\).
2.2 Bias
2.2.1 The Bias of the Estimate
As ususal, the bias of our estimate is the difference between what the estimator gives on average and the truth. For our estimation problem we get \[\begin{equation*} \mathop{\mathrm{bias}}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl(\hat f_h(x)\bigr) - f(x). \end{equation*}\] The expectation on the right-hand side averages over the randomness in the data, by using \(X_1, \ldots, X_n\) from the model in place of the data.
Substituting in the definition of \(\hat f_h(x)\) from equation (2.1) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \Bigr) \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_i) \bigr) \end{align*}\] and since the \(X_i\) are identically distributed, we can replace all \(X_i\) with \(X_1\) (or any other of them) to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \frac{1}{n} n \, \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \mathbb{E}\bigl( K_h(x - X_1) \bigr). \end{align*}\] Since the model assumes \(X_1\) (and all the other \(X_i\)) to have density \(f\), we can write this expectation as an integral to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \int_{-\infty}^\infty K_h(x - y) \, f(y) \, dy \\ &= \int_{-\infty}^\infty f(y) \, K_h(y - x) \, dy \\ &= \int_{-\infty}^\infty f(z+x) \, K_h(z) \, dz \end{align*}\] where we used the symmetry of \(K_h\) and the substitution \(z = y - x\).
2.2.2 Moments of Kernels
To understand how the bias changes as \(h\) varies, we will need to consider properties of \(K\) and \(K_h\) in more detail.
Definition 2.1 The \(k\)th moment of a kernel \(K\), for \(k \in \mathbb{N}_0 = \{0, 1, 2, \ldots\}\), is given by \[\begin{equation*} \mu_k(K) = \int_{-\infty}^\infty x^k K(x) \,dx. \end{equation*}\]
The second moment \(\mu_2\) is sometimes also called the variance of the kernel \(K\).
Using the properties of \(K\), we find the following results:
Since \(x^0 = 1\) for all \(x\in\mathbb{R}\), the \(0\)th moment is \(\mu_0(K) = \int_{-\infty}^\infty K(x) \,dx = 1\) for every kernel \(K\).
Since \(K\) is symmetric, the function \(x \mapsto x K(x)\) is antisymmetric and we have \[\begin{equation*} \mu_1(K) = \int_{-\infty}^\infty x K(x) \,dx = 0 \end{equation*}\] for every kernel \(K\).
The moments of the rescaled kernel \(K_h\), given by \[\begin{equation*} K_h(x - y) = \frac{1}{h} K\Bigl( \frac{x-y}{h} \Bigr), \end{equation*}\] can be computed from the moments of \(K\).
Lemma 2.1 Let \(K\) be a kernel, \(k \in \mathbb{N}_0\) and \(h > 0\). Then \[\begin{equation*} \mu_k(K_h) = h^k \mu_k(K). \end{equation*}\]
Proof. We have \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty x^k K_h(x) \,dx \\ &= \int_{-\infty}^\infty x^k \frac1h K\Bigl(\frac{x}{h}\Bigr) \,dx. \end{align*}\] Using the substitution \(y = x/h\) we find \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty (hy)^k \frac1h K(y) \, h \,dy \\ &= h^k \int_{-\infty}^\infty y^k K(y) \,dy \\ &= h^k \mu_k(K). \end{align*}\] This completes the proof.
By lemma 1.1, if \(K\) is a kernel, then \(K_h\) is also a kernel which implies that \(K_h\) is a probability density. If \(Y\) is a random variable with density \(K_h\), written as \(Y \sim K_h\) in short, then we find \[\begin{equation*} \mathbb{E}(Y) = \int y K_h(y) \,dy = \mu_1(K_h) = 0 \end{equation*}\] and \[\begin{equation} \mathop{\mathrm{Var}}(Y) = \mathbb{E}(Y^2) = \int y^2 K_h(y) \,dy = \mu_2(K_h) = h^2 \, \mu_2(K). \tag{2.3} \end{equation}\] Thus, \(Y\) is centred and the variance of \(Y\) is proportional to \(h^2\).
2.2.3 The Bias for Small Bandwidth
Considering again the formula \[\begin{equation*} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \int_{-\infty}^\infty f(x+y) \, K_h(y) \, dy, \end{equation*}\] we see that we can interpret this integral as an expectation with respect to a random variable \(Y \sim K_h\): \[\begin{equation} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl( f(x+Y) \bigr). \tag{2.4} \end{equation}\] Equation (2.3) shows that for \(h \downarrow 0\) the random variable concentrates more and more around \(0\) and thus \(x+Y\) concentrates more and more around \(x\). For this reason we expect \(\mathbb{E}\bigl(\hat f_h(x)\bigr) \approx f(x)\) for small \(h\).
To get a more qualitative version of this argument, we consider the Taylor approximation of \(f\) around the point \(x\): \[\begin{equation*} f(x + y) \approx f(x) + y f'(x) + \frac{y^2}{2} f''(x). \end{equation*}\] Substituting this into equation (2.4) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &\approx \mathbb{E}\Bigl( f(x) + Y f'(x) + \frac{Y^2}{2} f''(x) \Bigr) \\ &= f(x) + \mathbb{E}(Y) f'(x) + \frac12 \mathbb{E}(Y^2) f''(x) \\ &= f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{align*}\] for small \(h\). Considering the bias again, this gives \[\begin{equation} \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x) \bigr) - f(x) \approx \frac{\mu_2(K) f''(x)}{2} h^2 \tag{2.5} \end{equation}\] which shows that the bias of the estimator descreases quadratically as \(h\) gets smaller.
In contrast, the variance of the estimator (computed below) increases as \(h \downarrow 0\). We will need to balance these two effects to find the optimal value of \(h\).
2.3 Variance
We now turn to computing the variance of the estimator. Recall from above that \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( K_h(x - X_i) \bigr) \tag{2.6} \end{equation}\] for all \(i \in \{1, \ldots, n\}\). We will use similar arguments to derive the variance.
We use the formula \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \end{equation*}\] and consider the two terms separately.
2.3.1 Second Moment
For the second moment term in the definition of the variance we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \frac{1}{n} \sum_{j=1}^n K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \mathbb{E}\Bigl( \sum_{i,j=1}^n K_h(x - X_i) K_h(x - X_j) \Bigr) \\ &= \frac{1}{n^2} \sum_{i,j=1}^n \mathbb{E}\Bigl( K_h(x - X_i) K_h(x - X_j) \Bigr) \end{align*}\] Since the \(X_i\) are independent, the values of \(i\) and \(j\) in this sum do not matter. For the \(n\) terms where \(i=j\) we can assume that both indices equal 1, and for the \(n(n-1)\) terms where \(i\neq j\) we can assume \(i=1\) and \(j=2\), without changing the value of the expectation. So we get \[\begin{align*} \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr) \mathbb{E}\bigl( K_h(x - X_2) \bigr) \Bigr) \\ &= \frac{1}{n^2} \Bigl( n \mathbb{E}\bigl( K_h(x - X_1)^2 ) + n(n-1) \mathbb{E}\bigl( K_h(x - X_1) \bigr)^2 \Bigr) \\ &= \frac{1}{n} \mathbb{E}\Bigl( K_h(x - X_1)^2 \Bigr) + \frac{n-1}{n} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2, \end{align*}\] where we used equation (2.6) for the last term. Using this equation, we can write the expectation as \[\begin{align*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) &= \mathbb{E}\bigl( \hat f_h(x)^2 \bigr) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \\ &= \frac{1}{n} \mathbb{E}\bigl( K_h(x - X_1)^2 ) + \Bigl(\frac{n-1}{n} - 1\Bigr) \mathbb{E}\bigl( \hat f_h(x) \bigr)^2. \end{align*}\] Since \((n-1)/n - 1 = -1/n\) we get \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \mathbb{E}\bigl( K_h(x - X_1)^2 ) - \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \Bigr). \tag{2.7} \end{equation}\]
2.3.2 The Roughness of a Kernel
Similar to the bias analysis, we will use Taylor expansion of \(f\) around the point \(x\) to understand the behaviour of the variance for small \(h\). Some more care is needed here, because this time the result also depends on the sample size \(n\) and we will consider the joint limit of \(n \to \infty\) and \(h\to 0\). For the first term in equation (2.7) we find \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &= \int K_h(x - y)^2 f(y) \,dy \\ &= \int K_h(y - x)^2 f(y) \,dy \\ &= \int \frac{1}{h^2} K\Bigl( \frac{y - x}{h} \Bigr)^2 f(y) \,dy. \end{align*}\] Now we use the substitution \(z = (y - x) / h\). This gives \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &= \int \frac{1}{h^2} K(z)^2 f(x + hz) \,h \,dz \end{align*}\] Finally, we use Taylor approximation to get \[\begin{align*} \mathbb{E}\bigl( K_h(x - X_1)^2 ) &\approx \int \frac{1}{h} K(z)^2 \Bigl( f(x) + hz\,f'(x) + \frac12 h^2 z^2 \,f''(x) \Bigr) \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + f'(x) \int z K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \\ &= \frac{1}{h} f(x) \int K(z)^2 \,dz + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \end{align*}\] where the first-order term disappears since \(z \mapsto z K(z)^2\) is an asymmetric function. As a shorthand we use the following definition.
Definition 2.2 The roughness of a kernel \(K\) is given by \[\begin{equation*} R(K) := \int_{-\infty}^\infty K(x)^2 \,dx. \end{equation*}\]
This gives the result \[\begin{equation} \mathbb{E}\bigl( K_h(x - X_1)^2 \bigr) \approx \frac{1}{h} f(x) R(K) + \frac12 h f''(x) \int z^2 K(z)^2 \,dz \tag{2.8} \end{equation}\] for small \(h\).
2.3.3 The Variance for Small Bandwidth
Here we consider the term \(\mathbb{E}\bigl( \hat f_h(x) \bigr)^2\) in the formula for the variance. From the bias analysis above we know \[\begin{equation*} \mathbb{E}\bigl( \hat f_h(x) \bigr) \approx f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{equation*}\] and thus we get \[\begin{equation} \mathbb{E}\bigl( \hat f_h(x) \bigr)^2 \approx f(x)^2 + h^2 \mu_2(K) f(x) f''(x) + \frac14 h^4 \mu_2(K)^2 f''(x)^2 \tag{2.9} \end{equation}\] for small \(h\).
Substituting (2.8) and (2.9) into equation (2.7) we finally find \[\begin{equation*} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) = \frac{1}{n} \Bigl( \frac{1}{h} f(x) R(K) - f(x)^2 + \cdots \Bigr), \end{equation*}\] where all the omitted terms go to zero as \(h \downarrow 0\). Omitting one more term and keeping only the leading term we find \[\begin{equation} \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) \tag{2.10} \end{equation}\] as \(h\downarrow 0\).
2.4 Mean Squared Error
The Mean Squared Error of the estimator \(\hat f_h(x)\) for \(f(x)\) is given by \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \mathbb{E}\Bigl( \bigl( \hat f_h(x) - f(x) \bigr)^2 \Bigr). \end{equation*}\] It is an easy exercise to show that this can equivalently be written as \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \mathop{\mathrm{Var}}\bigl( \hat f_h(x) \bigr) + \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr)^2. \end{equation*}\] Substituing equations (2.5) and (2.10) into the formula for the MSE, we get \[\begin{equation} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4. \tag{2.11} \end{equation}\]
Some care is needed to make sure that the omitted terms from the Taylor approximations really don’t make a significant contribution in this formula for the MSE: The additional contributions from the variance have the form \(e_1(h) / n\), where the error term \(e_1\) does not depend on \(n\) and is negligible compared to \(f(x) R(K) / h\) as \(h \downarrow 0\). Using little-o notation, This is sometimes denoted by \(e_1(h) = o(1/h)\), which indicates that \(e_1(h) / (1/h) \to 0\) as \(h \downarrow 0\). The additional terms from the squared bias, say \(e_2(h)\), do not depend on \(n\) and are negligible compared to \(\mu_2(K)^2 f''(x)^2 h^4\). We can write \(e_2(h) = o(h^4)\) as \(n \downarrow 0\), to reflect this fact. We can summarise these results as \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) = \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4 + o(1/nh) + o(h^4) \end{equation*}\] as \(h \downarrow 0\), with the understanding that the constants in the definition of \(o(h^4)\) do not depend on \(n\) and that \(o(1/nh)\) really means “\(o(1/h)\), where the constants are proportional to \(1/n\).”
2.5 Integrated Error
From equation (2.11) we know \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \approx \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4. \end{equation*}\] This gives the mean squared error when trying to estimate the density \(f(x)\) at a fixed point \(x\). Usually we are interested in estimating the function \(f\) rather than individual points \(f(x)\). In this case, we consider the integrated mean squared error (IMSE): \[\begin{equation*} \mathrm{IMSE}\bigl( \hat f_h \bigr) := \int_{-\infty}^\infty \mathop{\mathrm{MSE}}\nolimits\bigl( \hat f_h(x) \bigr) \,dx. \end{equation*}\] Using our result from above we find \[\begin{align*} \mathrm{IMSE}\bigl( \hat f_h \bigr) &\approx \int \Bigl( \frac{1}{nh} f(x) R(K) + \frac14 \mu_2(K)^2 f''(x)^2 h^4 \Bigr) \,dx \\ &= \frac{1}{nh} R(K) \int f(x) \,dx + \frac{h^4}{4} \mu_2(K)^2 \int f''(x)^2 \,dx \\ &= \frac{1}{nh} R(K) + \frac{1}{4} \mu_2(K)^2 R(f'') h^4, \end{align*}\] where we (mis-)use the definition of roughness as an abbreviation to express the integral over \(f''\).
As before, we can use differentiation to find the optimal value of \(h\). Here we get \[\begin{equation*} h_\mathrm{opt} = \Bigl( \frac{R(K)}{n \mu_2(K)^2 R(f'')} \Bigr)^{1/5}. \end{equation*}\] and the corresponding error is \[\begin{equation} \mathrm{IMSE}_\mathrm{opt} = \frac54 \; \frac{1}{n^{4/5}} \; \Bigl( R(K)^2 \mu_2(K) \Bigr)^{2/5} \; R(f'')^{1/5}. \tag{2.12} \end{equation}\] Thus, in order to minimise the error we still need to choose \(h \propto n^{-1/5}\) and we should choose a kernel \(K\) which minimises the value \(R(K)^2 \mu_2(K)\).
Summary
- We introduced a statistical model for density estimation where data are i.i.d. samples from the unknown density \(f\).
- Kernel moments \(\mu_k(K) = \int x^k K(x) dx\) characterise kernel properties and scale with bandwidth: \(\mu_k(K_h) = h^k \mu_k(K)\).
- For small bandwidth \(h\), the bias is approximately \(\frac{\mu_2(K) f''(x)}{2} h^2\), decreasing quadratically as \(h \to 0\).
- The variance for small \(h\) is approximately \(\frac{f(x) R(K)}{nh}\), where \(R(K) = \int K(x)^2 dx\) is the kernel roughness, and increases as \(h \to 0\).
- The mean squared error (MSE) combines both bias and variance. The integrated mean squared error (IMSE) integrates the MSE over all points and provides a global measure of estimation quality.