Interlude: Vectorisation in R

In this section we explore techniques for writing efficient R code, using kernel density estimation as an example. We start with a straightforward implementation using for-loops, then show how to speed it up by replacing loops with operations on entire vectors and matrices at once. These techniques apply to many computational problems in statistics.

For our experiments we will use the same “snowfall” dataset as in the previous section.

# downloaded from https://teaching.seehuhn.de/data/buffalo/
x <- read.csv("data/buffalo.csv")
snowfall <- x$snowfall

Direct Implementation

The kernel density estimate for given data \(x_1, \ldots, x_n \in\mathbb{R}\) is \[\begin{equation*} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K\Bigl( \frac{x - x_i}{h} \Bigr). \end{equation*}\] Our aim is to implement this formula in R.

We use the triangular kernel from equation (1.2):

K <- function(x) pmax(1 - abs(x), 0)

The use of pmax() allows us to evaluate K() for several \(x\)-values in parallel:

K(c(-1, -0.5, 0, 0.5, 1))
## [1] 0.0 0.5 1.0 0.5 0.0

With this in place, we can now easily compute the kernel density estimate. As mentioned in the previous section, the choice of kernel normalization and bandwidth is not unique: rescaling the kernel and adjusting the bandwidth can give the same \(K_h\). The R function density() uses a specific convention: the bw parameter is defined to be the standard deviation of the smoothing kernel. For the triangular kernel, this means that density() uses a rescaled version with variance 1, which differs from our kernel in equation (1.2) by a factor of \(1/\sqrt{6}\). Thus, to get output comparable to bw=4 (bottom left panel in figure 1.3), we use \(h = 4\sqrt{6}\):

h <- 4 * sqrt(6) # bandwidth
x <- 100 # the point at which to estimate f
mean(1/h * K((x - snowfall) / h))
## [1] 0.0108237

To plot the estimated density, we typically evaluate \(\hat f_h\) on a grid of \(x\)-values. Since \(x_1, \ldots, x_n\) denotes the input data, we use \(\tilde x_1, \ldots, \tilde x_m\) for the evaluation points:

x.tilde <- seq(25, 200, by = 1)
f.hat <- numeric(length(x.tilde)) # empty vector to store the results
for (j in 1:length(x.tilde)) {
  f.hat[j] <- mean(1/h * K((x.tilde[j] - snowfall) / h))
}

Figure 1.4 shows the result is similar to figure 1.3.

plot(x.tilde, f.hat, type = "l",
     xlab = "snowfall", ylab = "density",
     ylim = c(0, 0.025))
Manually computed kernel density estimate for the snowfall data, corresponding to bw=4 in density().

Figure 1.4: Manually computed kernel density estimate for the snowfall data, corresponding to bw=4 in density().

This code is slower than density() (as we will see below), but gives us full control over the computation.

Vectorization

We now explore how to speed up our implementation. First, we encapsulate the code above into a function:

KDE1 <- function(x.tilde, x, K, h) {
  f.hat <- numeric(length(x.tilde))
  for (j in 1:length(x.tilde)) {
    f.hat[j] <- mean(1/h * K((x.tilde[j] - x) / h))
  }
  f.hat
}

A common source of inefficiency in R is the use of loops. The for-loop in KDE1() computes differences \(\tilde x_j - x_i\) for all \(j\). We have already avoided a second loop by treating x.tilde[j] - x as a vector operation, using pmax() in K, and using mean() for the sum. This approach of replacing explicit loops with operations on entire vectors is called vectorisation.

To eliminate the loop over j, we use outer(), which applies a function to all pairs of elements from two vectors. The result is a matrix with rows corresponding to the first vector and columns to the second. For example:

x <- c(1, 2, 3)
y <- c(1, 2)
outer(x, y, "-")
##      [,1] [,2]
## [1,]    0   -1
## [2,]    1    0
## [3,]    2    1

We use outer(x.tilde, x, "-") to compute all \(\tilde x_j - x_i\) at once, giving an \(m \times n\) matrix to which we apply K(). This fully vectorised implementation avoids all explicit loops:

KDE2 <- function(x.tilde, x, K, h) {
  dx <- outer(x.tilde, x, "-")
  rowMeans(1/h * K(dx / h))
}

We can verify that both functions return the same result:

KDE1(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277
KDE2(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277

There is another, minor optimisation we can make. Instead of dividing the \(m\times n\) matrix elements of dx by h, we can achieve the same effect by dividing the vectors x and x.tilde, i.e. only \(m+n\) numbers, by \(h\). This reduces the number of operations required. Similarly, instead of multiplying the \(m\times n\) kernel values by \(1/h\), we can divide the \(m\) row means by \(h\):

KDE3 <- function(x.tilde, x, K, h) {
  dx <- outer(x.tilde / h, x / h, "-")
  rowMeans(K(dx)) / h
}

Again, KDE3() returns the same result:

KDE3(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277

We measure execution times using microbenchmark:

library("microbenchmark")

microbenchmark(
  KDE1 = KDE1(x.tilde, snowfall, K, h),
  KDE2 = KDE2(x.tilde, snowfall, K, h),
  KDE3 = KDE3(x.tilde, snowfall, K, h),
  times = 1000)
## Unit: microseconds
##  expr     min       lq     mean   median       uq      max neval
##  KDE1 807.987 878.4250 968.8089 894.0050 913.5415 5753.694  1000
##  KDE2 108.650 156.4355 206.0156 162.2985 168.1410 4186.510  1000
##  KDE3  97.457 131.7330 162.3908 136.4480 140.9580 3876.263  1000

The output shows execution times for 1000 runs. The introduction of outer() in KDE2() gives a substantial speedup, and KDE3() improves further.

Finally, we compare to density(), using arguments from, to, and n to specify the same grid \(\tilde x\):

KDE.builtin <- function(x.tilde, x, kernel_name, h) {
    density(x,
      kernel = kernel_name, bw = h / sqrt(6),
      from = min(x.tilde), to = max(x.tilde), n = length(x.tilde))$y
}
microbenchmark(density = KDE.builtin(x.tilde, snowfall, "triangular", h), times = 1000)
## Unit: microseconds
##     expr    min     lq     mean  median     uq      max neval
##  density 90.323 98.359 112.5831 100.655 103.73 3564.909  1000

The result shows that density() is faster than KDE3(), but by a factor of less than 2. The reason is that density() uses a more efficient algorithm based on Fast Fourier Transform to compute an approximation to the kernel density estimate. By comparing the outputs of KDE3() and KDE.builtin() we can see that the approximation is very accurate:

f.hat1 <- KDE3(x.tilde, snowfall, K, h)
f.hat2 <- KDE.builtin(x.tilde, snowfall, "triangular", h)
max(abs(f.hat1 - f.hat2))
## [1] 1.935582e-05

Summary

  • Vectorisation is the technique of replacing explicit loops with operations on entire vectors.
  • Vectorised code dramatically improves performance in R.
  • The outer() function enables fully vectorised implementations for pairwise operations.
  • Further speedups can be achieved by reducing the number of operations on large matrices.