Efficiency

Introduction

Example 1 Suppose \(X_1, \ldots, X_n\) are i.i.d. from a Normal distribution \(N(\theta, \sigma^2)\), \(\sigma^2 = 1\) - that is, the mean is unknown, but the variance is known. Our goal is to estimate \(\theta\) using one of the two estimators: either the sample mean or the sample median. Which one should we use? Let’s look at their expected values and variances.

Sample mean:

\[ \hat{\mu} = \bar{X}_n, \quad \mathbb{E}(\hat{\mu}) = \mu, \quad \text{Var}(\hat{\mu}) = \frac{1}{n} \leftarrow \text{given} \]

Sample median: Denote the sample median by \(\tilde{\mu}\).

Suppose we know that: \[ \mathbb{E}(\tilde{\mu}) = \mu, \quad \text{Var}(\tilde{\mu}) \approx \frac{\pi}{2n}\;\text{ for large } n. \]

We can compare their mean squared errors, which seems to be a good way to see which estimator we would prefer. Recall that the MSE of an estimator is the sum of its variance and the square of its bias. We can see that since both estimators are unbiased, their mean squared errors:

\[ \Rightarrow \text{MSE}(\hat{\mu}) = 0 + \frac{1}{n}, \] and \[ \text{MSE}(\tilde{\mu}) = 0 + \frac{\pi}{2}n. \]

\[ \frac{\pi}{2} > 1 \Rightarrow \text{MSE}(\tilde{\mu}) > \text{MSE}(\hat{\mu}) \]

We see that though they are both unbiased, the sample median has a larger variance than the sample mean, and the sample mean is therefore the preferred estimator of the true mean.

Relative Efficiency of Two Estimators

In general, even if our estimators are not unbiased, it is better to have smaller variance, since this will lead to greater precision (narrower confidence intervals). We will always use consistent estimators, which means that for large samples, they will be very close to the true mean with high probability.

To quantify this idea of “better” estimators, we will define the Relative Efficiency of two estimators: Suppose we want to compare a pair of estimators of a parameter \(\theta\), denoted by \(\hat{\theta}\), \(\tilde{\theta}\).

We know that (from sample of size \(n\)):

\[ \text{MSE}(\hat{\theta}_n) = \left(\text{Bias}(\hat{\theta}_n)\right)^2 + \text{Var}(\hat{\theta}_n) \]

\[ \text{MSE}(\tilde{\theta}_n) = \left(\text{Bias}(\tilde{\theta}_n)\right)^2 + \text{Var}(\tilde{\theta}_n) \]

Definition 1 The efficiency of \(\tilde{\theta}\) relative to \(\hat{\theta}\) is given by:

\[ \text{eff}(\tilde{\theta}_n, \hat{\theta}_n) = \frac{\text{MSE}(\hat{\theta}_n)}{\text{MSE}(\tilde{\theta}_n)} \]

This quantity is called Relative Efficiency.

What does it mean if \(\text{eff}(\tilde{\theta}_n, \hat{\theta}_n) < 1\)? In that case, we have that \(MSE(\tilde{\theta}_n) > \text{MSE}(\hat{\theta}_n) \Rightarrow \hat{\theta}_n\) is a preferred estimator.

Remark 1. If \(\hat{\theta}_n\), \(\tilde{\theta}_n\) are unbiased, then: \[ \text{eff}(\tilde{\theta}_n, \hat{\theta}_n) = \frac{\text{Var}(\hat{\theta}_n)}{\text{Var}(\tilde{\theta}_n)} \]

Definition 2 If we don’t know the variance, we can use the asymptotic variance in the ratio of variances, and we will call the we call it the Asymptotic Relative Efficiency (A.R.E.). Note that since our estimators are asymptotically unbiased, if we are considering large samples, we only need to look at the ratio of the asymptotic variances, not the MSEs.

Suppose we have that:

\[ \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{D} N(0, \sigma_1^2), \] and also: \[ \sqrt{n}(\tilde{\theta}_n - \theta) \to N(0, \sigma_2^2). \] Then we define the Asymptotic Relative Efficiency as: \[ \text{ARE}(\tilde{\theta}, \hat{\theta}) = \frac{\sigma_1^2}{\sigma_2^2} \]

Remark 2. In Example 1, \[ \text{Var}(\hat{\theta}) = \frac{1}{n}, \quad \text{Var}(\tilde{\theta}) \approx \frac{\pi}{2n}. \]

\[ \Rightarrow \text{A.R.E. of } \tilde{\theta} \text{ to } \hat{\theta} = \text{A.R.E}(\tilde{\theta}, \hat{\theta}) = \frac{1/n}{\pi/2n} = \frac{2}{\pi} \approx 0.63 \]

Example 2 Suppose \(X_1, \ldots, X_n \overset{\text{IID}}{\sim} \text{Exp}\!\left(\tfrac{1}{\theta}\right)\) distribution. Then we know that

\[ \mathbb{E}X = \theta, \quad f(x \mid \theta) = \frac{1}{\theta} e^{-x/\theta}, \text{ and } F(x) = 1 - e^{-x/\theta}, \quad \theta > 0,\ x > 0 \]

Show that \(\tilde{\theta} = n X_{(1)} = n \cdot \min(X_1, \ldots, X_n)\) is an unbiased estimator of \(\theta\). (Hint: show \(X_{(1)} \sim \text{Exp}\!\left(\tfrac{n}{\theta}\right)\).)
An alternate estimator of \(\theta\) is \(\bar{X}_n\). Call this \(\hat{\theta}\). Show that \(\hat{\theta}\) is unbiased.
What is the relative efficiency of \(\tilde{\theta}\) to \(\hat{\theta}\)?

Check your solution!

Let \(\hat{\theta}\) be the sample mean, so \(\hat{\theta}_n = \bar{X}_n\). We are given \(\tilde{\theta} = n\) times the minimum, so \(\tilde{\theta}_n = n \cdot X_{(1)}, \Rightarrow F_{X_{(1)}}(x) = P(X_{(1)} \leq x) = 1 - P(X_{(1)} > x)\).

First, notice that \(P(X_{(1)} > x) = P(X_1 > x,\ X_2 > x,\ \ldots,\ X_n > x)\). \[ \begin{align*} \Rightarrow 1-F_{X_{(1)}}(x) &= \prod_{i=1}^{n} P(X_i > x) \\ &= \prod_{i=1}^{n} e^{-x/\theta} = e^{-nx/\theta}\\ \Rightarrow P(X_{(1)} \leq x) &= 1 - e^{-nx/\theta} \end{align*} \]

This implies that \(X_{(1)} \sim \text{Exp}\!\left(\frac{n}{\theta}\right) \Rightarrow \mathbb{E}(X_{(1)}) = \frac{\theta}{n}\). Therefore, we get that: \[ \Rightarrow \mathbb{E}(n X_{(1)}) = n \cdot \frac{\theta}{n} = \theta \Rightarrow \tilde{\theta}_n = n X_{(1)} \text{ is unbiased.} \]

Next, let’s consider the sample mean. We will do the same computation we have done many times before to see that it is unbiased: \[ \mathbb{E}(\hat{\theta}_n) = \mathbb{E}(\bar{X}_n) = \mathbb{E}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n}\sum_{i=1}^n \mathbb{E}X_i = \theta \]
Now, first let’s write down the definition of efficiency of \(\tilde{\theta}\) relative to \(\hat{\theta}\): \[ \text{eff}(\tilde{\theta}, \hat{\theta}) = \frac{\text{Var}(\hat{\theta})}{\text{Var}(\tilde{\theta})} \] We see that \[ \text{Var}(\tilde{\theta}) = \text{Var}(n X_{(1)}) = n^2 \text{Var}(X_{(1)}), \] and \[ \begin{align*} X_{(1)} \sim \text{Exp}\!\left(\frac{n}{\theta}\right) &\Rightarrow \text{Var}(X_{(1)}) = \frac{\theta^2}{n^2}\\ &\Rightarrow \text{Var}(n X_{(1)}) = n^2 \cdot \frac{\theta^2}{n^2} = \theta^2. \end{align*} \] Now, we know that: \[ \text{Var}(\hat{\theta}) = \text{Var}(\bar{X}_n) = \frac{\text{Var}(X)}{n} = \frac{\theta^2}{n} \] Thus we get that \[ \text{eff}(\tilde{\theta}, \hat{\theta}) = \frac{\theta^2/n}{\theta^2} = \frac{1}{n}. \]

Thus, \(\hat{\theta}\) is more efficient than \(\tilde{\theta}\), and \(\tilde{\theta}\) will need \(n\) times the sample size to achieve the same level of precision.

The Cramér-Rao Inequality

The question that now naturally arises: is some other estimator more efficient than either of the two we are considering?

Efficiency tells us what fraction of samples will make two (unbiased) estimators have the same variance, if their variance \(\propto \frac{1}{n}\) (which is true in many important cases), but it only tells us about the two estimators we are considering. Can we tell if we have reached the most efficient estimator?

The answer is Yes! There is a famous inequality that gives us a lower bound on the variance of any estimator. If our estimator achieves this lower bound, it cannot be improved.

Let’s state the inequality for unbiased estimators. This inequality gives us what is called the “Cramér-Rao lower bound for an unbiased estimator”:

Theorem 1 Let \(X_1, X_2, \ldots, X_n\) be IID random variables with density function \(f(x \mid \theta_0)\) (the true parameter value). Suppose \(\hat{\theta_n}\) is an unbiased estimator of \(\theta\). Then, under smoothness assumptions on \(f\), we have that: \[ \text{Var}(\hat{\theta}_n) \geq \frac{1}{I_n(\theta_0)} = \frac{1}{n I(\theta_0)} \]

References

Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons, Hoboken, NJ.

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.

Kerns, Jonny. 2017. “Probability Concepts Explained: Maximum Likelihood Estimation.” Medium. 2017. https://medium.com/data-science/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1.

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.