Interlude: Vectorisation in R

In this section we explore techniques for writing efficient R code, using kernel density estimation as an example. We start with a straightforward implementation using for-loops, then show how to speed it up by replacing loops with operations on entire vectors and matrices at once. These techniques apply to many computational problems in statistics.

For our experiments we will use the same “snowfall” dataset as in the previous section.

# downloaded from https://teaching.seehuhn.de/data/buffalo/
x <- read.csv("data/buffalo.csv")
snowfall <- x$snowfall

Direct Implementation

The kernel density estimate for given data \(x_1, \ldots, x_n \in\mathbb{R}\) is \[\begin{equation*} \hat f_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K\Bigl( \frac{x - x_i}{h} \Bigr). \end{equation*}\] Our aim is to implement this formula in R.

We use the triangular kernel from equation (1.2):

K <- function(x) pmax(1 - abs(x), 0)

The use of pmax() allows us to evaluate K() for several \(x\)-values in parallel:

K(c(-1, -0.5, 0, 0.5, 1))
## [1] 0.0 0.5 1.0 0.5 0.0

With this in place, we can now easily compute the kernel density estimate. As mentioned in the previous section, the choice of kernel normalization and bandwidth is not unique: rescaling the kernel and adjusting the bandwidth can give the same \(K_h\). The R function density() uses a specific convention: the bw parameter is defined to be the standard deviation of the smoothing kernel. For the triangular kernel, this means that density() uses a rescaled version with variance 1, which differs from our kernel in equation (1.2) by a factor of \(1/\sqrt{6}\). Thus, to get output comparable to bw=4 (bottom left panel in figure 1.3), we use \(h = 4\sqrt{6}\):

h <- 4 * sqrt(6) # bandwidth
x <- 100 # the point at which to estimate f
mean(1/h * K((x - snowfall) / h))
## [1] 0.0108237

To plot the estimated density, we typically evaluate \(\hat f_h\) on a grid of \(x\)-values. Since \(x_1, \ldots, x_n\) denotes the input data, we use \(\tilde x_1, \ldots, \tilde x_m\) for the evaluation points:

x.tilde <- seq(25, 200, by = 1)
f.hat <- numeric(length(x.tilde)) # empty vector to store the results
for (j in 1:length(x.tilde)) {
  f.hat[j] <- mean(1/h * K((x.tilde[j] - snowfall) / h))
}

Figure 1.4 shows the result is similar to figure 1.3.

plot(x.tilde, f.hat, type = "l",
     xlab = "snowfall", ylab = "density",
     ylim = c(0, 0.025))
Manually computed kernel density estimate for the snowfall data, corresponding to bw=4 in density().

Figure 1.4: Manually computed kernel density estimate for the snowfall data, corresponding to bw=4 in density().

This code is slower than density() (as we will see below), but gives us full control over the computation.

Vectorization

We now explore how to speed up our implementation. First, we encapsulate the code above into a function:

KDE1 <- function(x.tilde, x, K, h) {
  f.hat <- numeric(length(x.tilde))
  for (j in 1:length(x.tilde)) {
    f.hat[j] <- mean(1/h * K((x.tilde[j] - x) / h))
  }
  f.hat
}

A common source of inefficiency in R is the use of loops. The for-loop in KDE1() computes differences \(\tilde x_j - x_i\) for all \(j\). We have already avoided a second loop by treating x.tilde[j] - x as a vector operation, using pmax() in K, and using mean() for the sum. This approach of replacing explicit loops with operations on entire vectors is called vectorisation.

To eliminate the loop over j, we use outer(), which applies a function to all pairs of elements from two vectors. The result is a matrix with rows corresponding to the first vector and columns to the second. For example:

x <- c(1, 2, 3)
y <- c(1, 2)
outer(x, y, "-")
##      [,1] [,2]
## [1,]    0   -1
## [2,]    1    0
## [3,]    2    1

We use outer(x.tilde, x, "-") to compute all \(\tilde x_j - x_i\) at once, giving an \(m \times n\) matrix to which we apply K(). This fully vectorised implementation avoids all explicit loops:

KDE2 <- function(x.tilde, x, K, h) {
  dx <- outer(x.tilde, x, "-")
  rowMeans(1/h * K(dx / h))
}

We can verify that both functions return the same result:

KDE1(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277
KDE2(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277

There is another, minor optimisation we can make. Instead of dividing the \(m\times n\) matrix elements of dx by h, we can achieve the same effect by dividing the vectors x and x.tilde, i.e. only \(m+n\) numbers, by \(h\). This reduces the number of operations required. Similarly, instead of multiplying the \(m\times n\) kernel values by \(1/h\), we can divide the \(m\) row means by \(h\):

KDE3 <- function(x.tilde, x, K, h) {
  dx <- outer(x.tilde / h, x / h, "-")
  rowMeans(K(dx)) / h
}

Again, KDE3() returns the same result:

KDE3(c(50, 100, 150), snowfall, K, h)
## [1] 0.0078239091 0.0108236983 0.0007448277

We measure execution times using microbenchmark:

library("microbenchmark")

microbenchmark(
  KDE1 = KDE1(x.tilde, snowfall, K, h),
  KDE2 = KDE2(x.tilde, snowfall, K, h),
  KDE3 = KDE3(x.tilde, snowfall, K, h),
  times = 1000)
## Unit: microseconds
##  expr      min       lq      mean   median       uq      max neval
##  KDE1 1436.107 1449.945 1489.5932 1456.361 1467.103 4007.381  1000
##  KDE2  192.987  194.504  244.4082  195.242  197.661 2429.496  1000
##  KDE3  173.061  174.578  212.9133  175.357  177.858 4469.205  1000

The output shows execution times for 1000 runs. The introduction of outer() in KDE2() gives a substantial speedup, and KDE3() improves further.

Finally, we compare to density(), using arguments from, to, and n to specify the same grid \(\tilde x\):

KDE.builtin <- function(x.tilde, x, kernel_name, h) {
    density(x,
      kernel = kernel_name, bw = h / sqrt(6),
      from = min(x.tilde), to = max(x.tilde), n = length(x.tilde))$y
}
microbenchmark(density = KDE.builtin(x.tilde, snowfall, "triangular", h), times = 1000)
## Unit: microseconds
##     expr     min      lq     mean  median       uq      max neval
##  density 155.062 157.522 181.6117 169.125 178.1655 5803.345  1000

The result shows that density() is faster than KDE3(), but by a factor of less than 2. The reason is that density() uses a more efficient algorithm based on Fast Fourier Transform to compute an approximation to the kernel density estimate. By comparing the outputs of KDE3() and KDE.builtin() we can see that the approximation is very accurate:

f.hat1 <- KDE3(x.tilde, snowfall, K, h)
f.hat2 <- KDE.builtin(x.tilde, snowfall, "triangular", h)
max(abs(f.hat1 - f.hat2))
## [1] 1.935582e-05

Summary

  • Vectorisation is the technique of replacing explicit loops with operations on entire vectors.
  • Vectorised code dramatically improves performance in R.
  • The outer() function enables fully vectorised implementations for pairwise operations.
  • Further speedups can be achieved by reducing the number of operations on large matrices.