# Section 19 Efficiency of Robust Estimators

In theoretical statistics, **efficiency** is a measure of quality of an
estimator. A more efficient estimator achieves smaller estimation error
for a given number of samples. Here we give an informal introduction
to the topic of efficiency and discuss the efficiency of different
methods for linear regression.

## 19.1 Efficiency

In the section about the Mean Squared Error we have seen that \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat\theta \bigr) = \mathop{\mathrm{Var}}\bigl( \hat\theta \bigr) + \mathop{\mathrm{bias}}\bigl( \hat\theta \bigr)^2 \end{equation*}\] can be used as a measure for the estimation error of an estimator \(\hat\theta\), and if the estimator is unbiased, this expression simplifies to \[\begin{equation*} \mathop{\mathrm{MSE}}\nolimits\bigl( \hat\theta \bigr) = \mathop{\mathrm{Var}}\bigl( \hat\theta \bigr). \end{equation*}\] Thus, “good” estimators will have small variance. Instead of considering the full definition from theoretical statistics (which requires the concept of Fisher information), here we simply consider \[\begin{equation*} \mathrm{efficiency}(\hat\theta) = \frac{1}{\mathop{\mathrm{Var}}(\hat\theta)} \end{equation*}\] as a measure for the efficiency of an estimator. We will use this measure of efficiency to compare different estimators.

**Example 19.1 **Both the sample mean and the sample median can be used as estimators
for the population mean. Here we show that the median has lower efficiency
than the mean.

Suppose \(X_1,\ldots, X_n\) are i.i.d. with CDF \(F(x)\), *i.e.*
\(P(X_i \leq x) = F(x)\). As before, let \(X_{(1)} \leq \cdots \leq X_{(n)}\) be the samples arranged in order of increasing value.
Then the distribution of \(X_{(k)}\) has CDF
\[\begin{equation*}
P(X_{(k)} \leq x)
= \sum_{i=k}^n {n\choose i} F(x)^i \bigl(1-F(x)\bigr)^{n-i},
\end{equation*}\]
since \(X_{(k)} \leq x\) requires at least \(k\) out of \(n\) samples to be
less than or equal to \(x\).
If we differentiate this (using product rule and chain rule) we get
the density of \(X_{(k)}\):
\[\begin{align*}
f_{(k)}(x)
&= \frac{d}{dx} P(X_{(k)} \leq x) \\
&= \cdots \\
&= f(x)\frac{n!}{(n-k)! (k-1)!} F(x)^{k-1}\bigl(1-F(x)\bigr)^{n-k}.
\end{align*}\]

For simplicity we assume that \(n\) is odd, say \(n = 2m+1\). In this case the median is \(X_\mathrm{median} = X_{(m+1)}\). Thus, the density of the median is \[\begin{align*} f_\mathrm{median}(x) &= f_{(m+1)}(x) \\ &= f(x)\frac{n!}{(n-m-1)! m!} F(x)^m\bigl(1-F(x)\bigr)^{n-m-1} \\ &= f(x)\frac{n!}{m! m!} F(x)^m\bigl(1-F(x)\bigr)^m. \end{align*}\] This density can be used to understand the behaviour of the median.

Now assume that \(X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)\), *i.e.*
the density is
\[\begin{equation*}
f(x)
= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl( - \frac{(x-\mu)^2}{2\sigma^2} \Bigr)
\end{equation*}\]
and the CDF is
\[\begin{equation*}
F(x) = \Phi\Bigl( \frac{x - \mu}{\sigma} \Bigr),
\end{equation*}\]
where \(\Phi\) is the CDF of the standard normal distribution. In this case
we know that
\[\begin{equation*}
\mathop{\mathrm{Var}}\bigl( \overline X \bigr)
= \frac{\sigma^2}{n}.
\end{equation*}\]
For comparison, we can use the density of \(X_\mathrm{median}\) to show
(in a series of complicated steps) that
\[\begin{equation*}
\mathop{\mathrm{Var}}\bigl( X_\mathrm{median} \bigr)
\sim \frac{\pi}{2} \frac{\sigma^2}{n},
\end{equation*}\]
as \(n\to\infty\). Thus, for large \(n\), the relative efficiency
of the median compared to the mean is
\[\begin{align*}
\frac{\mathrm{efficiency}(X_\mathrm{median})}{\mathrm{efficiency}(\overline{X})}
&= \frac{\mathop{\mathrm{Var}}\bigl( \overline X \bigr)}{\mathop{\mathrm{Var}}\bigl( X_\mathrm{median} \bigr)}
= \frac{\frac{\sigma^2}{n}}{\frac{\pi}{2} \frac{\sigma^2}{n}} \\
&= \frac{2}{\pi}
= 0.637.
\end{align*}\]
Thus, the sample median is a less efficient estimator for the mean than
the sample mean is.

The situation in the example carries over to linear regression: more robust estimators tend to be less efficient, and there is a trade-off between robustness and efficiency. Without proof, here we list some examples of this principle:

One can show that usually the variance of the least squares estimator decreases proportionally to \(1 / n\), and thus the efficiency of the least squares estimator increases proportionally to \(n\).

The value \(t\) in Huber’s method can be used to control the balance between robustness and efficiency. As \(t\to\infty\), the method converges to least squares regression and becomes less robust, but at the same time efficiency increases. Huber suggests the value \(t = 1.345\sigma\) and showed that for this choice the method is \(95\%\) efficient for large \(n\), compared to least squares regression.

Similarly, the parameter \(a\) in Tukey’s Bisquare Method controls the trade-off between robustness and efficiency, with smaller \(a\) leading to a more robust method. In the literature it is suggested that \(a = 4.685 \sigma\) leads to \(95\%\) efficiency for large \(n\), compared to least squares regression.

The efficiency of the least absolute values estimator can be shown to be \(64\%\).

## 19.2 Robust estimators

Since the M-estimator only addresses \(y\)-space outliers, the breakdown point of M-estimation is still \(1/n\). We now consider two estimators which have a higher breakdown point, but lower efficiency.

### 19.2.1 Least Median of Squares

Rather than minimise the sum of the residual sum of squares (or a weighted version of it), this estimator minimises the median of squared residuals. The estimate is given by \[\begin{equation*} \hat\beta^\mathrm{(LMS)} := \mathop{\mathrm{arg\,min}}\limits_\beta \mathop{\mathrm{median}}_{i\in\{1,\ldots,n\}} \bigl( y_i- x_i^T \beta \bigr)^2, \end{equation*}\] where \(x_1, \ldots, x_n\) are the rows of the design matrix \(X\). This is very robust with respect to outliers, both in \(x\)-direction and \(y\)-directions: Since the median is used instead of the sum, up to half of the squared residuals can increase to infinity while the estimate stays bounded. Thus the asymptotic breakdown point of the method is \(1/2\).

The least median of squares method has poor asymptotic efficiency \(n^{2/3}\). In the limit, the relative efficiency compared to ordinary least squares is \[\begin{equation*} \frac{\mathrm{efficiency}(\hat\beta^\mathrm{(LMS)})}{\mathrm{efficiency}(\hat\beta^\mathrm{(LSQ)})} \propto \frac{n^{2/3}}{n} \longrightarrow 0 \end{equation*}\] as \(n\to\infty\).

In R, the LMS estimate can be computed using the function
`lqs(..., method = "lms")`

function from the `MASS`

library.

**Example 19.2 **The following code computes an LMS regression estimate. We
introduce artifical outliers by shifting \(30\%\) of the data
to the bottom right. These artificial outliers are represented
by the red circles in the plot.

```
library("MASS") # for lqs()
set.seed(20211207)
<- 50
n <- runif(n, 0, 10)
x <- x + rnorm(n)
y
# add 30% outliers
<- floor(0.3*n)
n.ol <- sample.int(n, n.ol)
idx <- x[idx] + 15
x[idx] <- y[idx] - 10
y[idx]
<- lqs(y~x, method="lms")
m plot(x, y, col = ifelse(1:n %in% idx, "red", "black"))
abline(m)
```

### 19.2.2 Least Trimmed Squares

This takes as its objective function the sum of \(h\) smallest squared residuals and was proposed as a remedy to the low asymptotic efficiency of LMS. The least trimmed squares estimator \(\hat\beta^\mathrm{(LTS)}\) is defined as \[\begin{equation*} \hat\beta^\mathrm{(LTS)} := \mathop{\mathrm{arg\,min}}\limits_\beta \sum_{i=1}^k r_{[i]}^2(\beta), \end{equation*}\] where \(r_{[i]}^2(\beta)\) represents the \(i\)th smallest value amongst \(r_i(\beta)^2 = \bigl( y_i- x_i^T\beta \bigr)^2\).

The value \(k\) controls the trade-off between robustness and efficiency and the value must satisfy \(n/2 < k \leq n\). For a given \(k\) the method can tolerate \(n-k\) outliers. The bounday case \(k=n\) corresponds to the ordinary least squares method. The breakdown point of the method is \((n - k +1)/n\).

Computing the LTS estimate is a non-trivial problem, which involves fitting the
least-squares estimate to a carefully chosen subset of the samples. In R, the
LTS estimate can be computed using the function `lqs(..., method = "lts")`

function from the `MASS`

library.

**Example 19.3 **The following code computes an LTS regression estimate. The
data is the same as in the previous example.

```
<- lqs(y~x, method = "lts", quantile = floor(n/2)+1)
m plot(x, y, col = ifelse(1:n %in% idx, "red", "black"))
abline(m)
```

One can show that LTS has an efficiency of approximately \(0.08\) compared to ordinary least squares regression.

**Summary**

- In this section we have informally discussed the efficiency of different estimators.
- We have introduced the Least Median of Squares estimator.
- We have introduced the Least Trimmed Squares estimator.