Section 2 The Bias of Kernel Density Estimates
In the previous section we introduced the kernel density estimate ˆfh(x)=1nn∑i=1Kh(x−xi) for estimating the density f, and we argued that ˆfh(x)≈f(x). The aim of the current section is to quantify the error of this approximation and to understand how this error depends on the true density f and on the bandwidth h>0.
2.1 A Statistical Model
As usual, we will make a statistical model for the data x1,…,xn, and then use this model to analyse how well the estimator performs. The statistical model we will consider here is extremely simple: we model the xi using random variables X1,…,Xn∼f, which we assume to be independent and identically distributed (i.i.d.). Here, the notation X∼f, where f is a probability density, simply denotes that the random variable X has density f.
It is important to not confuse x (the point where we are evaluating the densities during our analysis) with the data xi. A statistical model describes the data, so here we get random variables X1,…,xn to describe the behaviour of x1,…,xn, but it does not descibe x. The number x is not part of the data, so will never be modelled by a random variable.
While the model is very simple, for example it is much simpler than the model we use in the level 3 part of the module for linear regression, the associated parameter estimation problem is more challenging. The only “parameter” in this model is the function f:R→R instead of just a vector of numbers. The space of all possible density functions f is infinite dimensional, so this is a more challenging estimation problem then the one we consider, for example, for linear regression. Since f is not a “parameter” in the usual sense, sometimes this problem is called a “non-parametric” estimation problem.
Our estimate for the density f is the function ˆfh:R→R, where ˆfh(x) is given by (2.1) for every x∈R.
2.2 The Bias of the Estimate
As ususal, the bias of our estimate is the difference between what the estimator gives on average and the truth. For our estimation problem we get bias(ˆfh(x))=E(ˆfh(x))−f(x). The expectation on the right-hand side averages over the randomness in the data, by using X1,…,Xn from the model in place of the data.
Substituting in the definition of ˆfh(x) from equation (2.1) we find E(ˆfh(x))=E(1nn∑i=1Kh(x−Xi))=1nn∑i=1E(Kh(x−Xi)) and since the Xi are identically distributed, we can replace all Xi with X1 (or any other of them) to get E(ˆfh(x))=1nn∑i=1E(Kh(x−X1))=1nnE(Kh(x−X1))=E(Kh(x−X1)). Since the model assumes X1 (and all the other Xi) to have density f, we can write this expectation as an integral to get E(ˆfh(x))=∫∞−∞Kh(x−y)f(y)dy=∫∞−∞f(y)Kh(y−x)dy=∫∞−∞f(z+x)Kh(z)dz where we used the symmetry of Kh and the substitution z=y−x.
2.3 Moments of Kernels
To understand how the bias changes as h varies, we will need to consider properties of K and Kh in more detail.
Definition 2.1 The kth moment of a kernel K, for k∈N0={0,1,2,…}, is given by μk(K)=∫∞−∞xkK(x)dx.
The second moment μ2 is sometimes also called the variance of the kernel K.
Using the properties of K, we find the following results:
Since x0=1 for all x∈R, the 0th moment is μ0(K)=∫∞−∞K(x)dx=1 for every kernel K.
Since K is symmetric, the function x↦xK(x) is antisymmetric and we have μ1(K)=∫∞−∞xK(x)dx=0 for every kernel K.
The moments of the rescaled kernel Kh, given by Kh(x−y)=1hK(x−yh), can be computed from the moments of K.
Lemma 2.1 Let K be a kernel, k∈N0 and h>0. Then μk(Kh)=hkμk(K).
Proof. We have μk(Kh)=∫∞−∞xkKh(x)dx=∫∞−∞xk1hK(xh)dx. Using the substitution y=x/h we find μk(Kh)=∫∞−∞(hy)k1hK(y)hdy=hk∫∞−∞ykK(y)dy=hkμk(K). This completes the proof.
It is easy to check that if K is a kernel, then Kh is also a kernel which implies that Kh is a probability density. If Y is a random variable with density Kh, written as Y∼Kh in short, then we find E(Y)=∫yKh(y)dy=μ1(Kh)=0 and Var(Y)=E(Y2)=∫y2Kh(y)dy=μ2(Kh)=h2μ2(K). Thus, Y is centred and the variance of Y is proportional to h2.
2.4 The Bias for Small Bandwidth
Considering again the formula E(ˆfh(x))=∫∞−∞f(x+y)Kh(y)dy, we see that we can interpret this integral as an expectation with respect to a random variable Y∼Kh: E(ˆfh(x))=E(f(x+Y)). Equation (2.3) shows that for h↓0 the random variable concentrates more and more around 0 and thus x+Y concentrates more and more around x. For this reason we expect E(ˆfh(x))≈f(x) for small h.
To get a more qualitative version of this argument, we consider the Taylor approximation of f around the point x: f(x+y)≈f(x)+yf′(x)+y22f″ Substituting this into equation (2.4) we find \begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &\approx \mathbb{E}\Bigl( f(x) + Y f'(x) + \frac{Y^2}{2} f''(x) \Bigr) \\ &= f(x) + \mathbb{E}(Y) f'(x) + \frac12 \mathbb{E}(Y^2) f''(x) \\ &= f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{align*} for small h. Considering the bias again, this gives \begin{equation} \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x) \bigr) - f(x) \approx \frac{\mu_2(K) f''(x)}{2} h^2 \tag{2.5} \end{equation} which shows that the bias of the estimator descreases quadratically as h gets smaller.
In contrast, we will see in the next section that the variance of the estimator increases as h \downarrow 0. We will need to balance these two effects to find the optimal value of h.
Summary
- We have introduced a statistical model for density estimation.
- The bias for kernel density estimation can be written as an integral.
- We learned how the moments of a kernel are defined.
- The bias for small bandwidth depends on the second moment of the kernel and the second derivative of the density.