Section 2 The Bias of Kernel Density Estimates

In the previous section we introduced the kernel density estimate \[\begin{equation} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i) \tag{2.1} \end{equation}\] for estimating the density \(f\), and we argued that \(\hat f_h(x) \approx f(x)\). The aim of the current section is to quantify the error of this approximation and to understand how this error depends on the true density \(f\) and on the bandwidth \(h > 0\).

2.1 A Statistical Model

As usual, we will make a statistical model for the data \(x_1, \ldots, x_n\), and then use this model to analyse how well the estimator performs. The statistical model we will consider here is extremely simple: we model the \(x_i\) using random variables \[\begin{equation} X_1, \ldots, X_n \sim f, \tag{2.2} \end{equation}\] which we assume to be independent and identically distributed (i.i.d.). Here, the notation \(X \sim f\), where \(f\) is a probability density, simply denotes that the random variable \(X\) has density \(f\).

It is important to not confuse \(x\) (the point where we are evaluating the densities during our analysis) with the data \(x_i\). A statistical model describes the data, so here we get random variables \(X_1, \ldots, x_n\) to describe the behaviour of \(x_1, \ldots, x_n\), but it does not descibe \(x\). The number \(x\) is not part of the data, so will never be modelled by a random variable.

While the model is very simple, for example it is much simpler than the model we use in the level 3 part of the module for linear regression, the associated parameter estimation problem is more challenging. The only “parameter” in this model is the function \(f \colon\mathbb{R}\to \mathbb{R}\) instead of just a vector of numbers. The space of all possible density functions \(f\) is infinite dimensional, so this is a more challenging estimation problem then the one we consider, for example, for linear regression. Since \(f\) is not a “parameter” in the usual sense, sometimes this problem is called a “non-parametric” estimation problem.

Our estimate for the density \(f\) is the function \(\hat f_h\colon \mathbb{R}\to \mathbb{R}\), where \(\hat f_h(x)\) is given by (2.1) for every \(x \in\mathbb{R}\).

2.2 The Bias of the Estimate

As ususal, the bias of our estimate is the difference between what the estimator gives on average and the truth. For our estimation problem we get \[\begin{equation*} \mathop{\mathrm{bias}}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl(\hat f_h(x)\bigr) - f(x). \end{equation*}\] The expectation on the right-hand side averages over the randomness in the data, by using \(X_1, \ldots, X_n\) from the model in place of the data.

Substituting in the definition of \(\hat f_h(x)\) from equation (2.1) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \mathbb{E}\Bigl( \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) \Bigr) \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_i) \bigr) \end{align*}\] and since the \(X_i\) are identically distributed, we can replace all \(X_i\) with \(X_1\) (or any other of them) to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \frac{1}{n} n \, \mathbb{E}\bigl( K_h(x - X_1) \bigr) \\ &= \mathbb{E}\bigl( K_h(x - X_1) \bigr). \end{align*}\] Since the model assumes \(X_1\) (and all the other \(X_i\)) to have density \(f\), we can write this expectation as an integral to get \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &= \int_{-\infty}^\infty K_h(x - y) \, f(y) \, dy \\ &= \int_{-\infty}^\infty f(y) \, K_h(y - x) \, dy \\ &= \int_{-\infty}^\infty f(z+x) \, K_h(z) \, dz \end{align*}\] where we used the symmetry of \(K_h\) and the substitution \(z = y - x\).

2.3 Moments of Kernels

To understand how the bias changes as \(h\) varies, we will need to consider properties of \(K\) and \(K_h\) in more detail.

Definition 2.1 The \(k\)th moment of a kernel \(K\), for \(k \in \mathbb{N}_0 = \{0, 1, 2, \ldots\}\), is given by \[\begin{equation*} \mu_k(K) = \int_{-\infty}^\infty x^k K(x) \,dx. \end{equation*}\]

The second moment \(\mu_2\) is sometimes also called the variance of the kernel \(K\).

Using the properties of \(K\), we find the following results:

Since \(x^0 = 1\) for all \(x\in\mathbb{R}\), the \(0\)th moment is \(\mu_0(K) = \int_{-\infty}^\infty K(x) \,dx = 1\) for every kernel \(K\).
Since \(K\) is symmetric, the function \(x \mapsto x K(x)\) is antisymmetric and we have \[\begin{equation*} \mu_1(K) = \int_{-\infty}^\infty x K(x) \,dx = 0 \end{equation*}\] for every kernel \(K\).

The moments of the rescaled kernel \(K_h\), given by \[\begin{equation*} K_h(x - y) = \frac{1}{h} K\Bigl( \frac{x-y}{h} \Bigr), \end{equation*}\] can be computed from the moments of \(K\).

Lemma 2.1 Let \(K\) be a kernel, \(k \in \mathbb{N}_0\) and \(h > 0\). Then \[\begin{equation*} \mu_k(K_h) = h^k \mu_k(K). \end{equation*}\]

Proof. We have \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty x^k K_h(x) \,dx \\ &= \int_{-\infty}^\infty x^k \frac1h K\Bigl(\frac{x}{h}\Bigr) \,dx. \end{align*}\] Using the substitution \(y = x/h\) we find \[\begin{align*} \mu_k(K_h) &= \int_{-\infty}^\infty (hy)^k \frac1h K(y) \, h \,dy \\ &= h^k \int_{-\infty}^\infty y^k K(y) \,dy \\ &= h^k \mu_k(K). \end{align*}\] This completes the proof.

It is easy to check that if \(K\) is a kernel, then \(K_h\) is also a kernel which implies that \(K_h\) is a probability density. If \(Y\) is a random variable with density \(K_h\), written as \(Y \sim K_h\) in short, then we find \[\begin{equation*} \mathbb{E}(Y) = \int y K_h(y) \,dy = \mu_1(K_h) = 0 \end{equation*}\] and \[\begin{equation} \mathop{\mathrm{Var}}(Y) = \mathbb{E}(Y^2) = \int y^2 K_h(y) \,dy = \mu_2(K_h) = h^2 \, \mu_2(K). \tag{2.3} \end{equation}\] Thus, \(Y\) is centred and the variance of \(Y\) is proportional to \(h^2\).

2.4 The Bias for Small Bandwidth

Considering again the formula \[\begin{equation*} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \int_{-\infty}^\infty f(x+y) \, K_h(y) \, dy, \end{equation*}\] we see that we can interpret this integral as an expectation with respect to a random variable \(Y \sim K_h\): \[\begin{equation} \mathbb{E}\bigl(\hat f_h(x)\bigr) = \mathbb{E}\bigl( f(x+Y) \bigr). \tag{2.4} \end{equation}\] Equation (2.3) shows that for \(h \downarrow 0\) the random variable concentrates more and more around \(0\) and thus \(x+Y\) concentrates more and more around \(x\). For this reason we expect \(\mathbb{E}\bigl(\hat f_h(x)\bigr) \approx f(x)\) for small \(h\).

To get a more qualitative version of this argument, we consider the Taylor approximation of \(f\) around the point \(x\): \[\begin{equation*} f(x + y) \approx f(x) + y f'(x) + \frac{y^2}{2} f''(x). \end{equation*}\] Substituting this into equation (2.4) we find \[\begin{align*} \mathbb{E}\bigl(\hat f_h(x)\bigr) &\approx \mathbb{E}\Bigl( f(x) + Y f'(x) + \frac{Y^2}{2} f''(x) \Bigr) \\ &= f(x) + \mathbb{E}(Y) f'(x) + \frac12 \mathbb{E}(Y^2) f''(x) \\ &= f(x) + \frac12 h^2 \mu_2(K) f''(x) \end{align*}\] for small \(h\). Considering the bias again, this gives \[\begin{equation} \mathop{\mathrm{bias}}\bigl( \hat f_h(x) \bigr) = \mathbb{E}\bigl( \hat f_h(x) \bigr) - f(x) \approx \frac{\mu_2(K) f''(x)}{2} h^2 \tag{2.5} \end{equation}\] which shows that the bias of the estimator descreases quadratically as \(h\) gets smaller.

In contrast, we will see in the next section that the variance of the estimator increases as \(h \downarrow 0\). We will need to balance these two effects to find the optimal value of \(h\).

Summary

We have introduced a statistical model for density estimation.
The bias for kernel density estimation can be written as an integral.
We learned how the moments of a kernel are defined.
The bias for small bandwidth depends on the second moment of the kernel and the second derivative of the density.