Mean Squared Error and Consistency

Introduction

Now we will look at the properties of maximum likelihood estimation when we have large samples. We want to consider the properties of the estimator: Is it unbiased? Consistent? What can we say about the distribution of the estimator when the sample size is large?

The two main results that we will establish in this chapter and the next are about the consistency of the MLE; and that the MLE is asymptotically normal. The variance of the MLE is related to a quantity called the Fisher Information. First, we will establish that the MLE is consistent.

Consistency of the MLE of \(\theta\).

As we might guess, the maximum likelihood estimator from an IID sample with PDF \(f(x\vert \theta)\), is consistent for \(\theta\) (under appropriate smoothness conditions on \(f\), which means that we can take as many continuous derivatives of \(f\) as we need and it won’t have any sharp corners etc.).

In the following, we will assume \(f\) is smooth. This will ensure that it is “well-behaved” and we can interchange the orders of integration and differentiation.

In our discussion below, \(\theta_0\) is the true unknown value of the parameter \(\theta\), and \(f(x \mid \theta_0)\) is the true PDF (\(\theta_0\) is some fixed, but unknown value).

Recall that \(X_1, \dots, X_n \overset{\text{IID}}{\sim} f(x \mid \theta)\) is a random sample, and \[ \operatorname{lik}(\theta) = \operatorname{lik}(\theta \mid X_1, \dots, X_n) = \prod_{i=1}^n f(X_i \mid \theta) \]

Define \(\ell(\theta) = \log \operatorname{lik}(\theta)\). Then, \[ \frac{1}{n} \ell(\theta) =\frac{1}{n} \log \operatorname{lik}(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i\mid \theta). \]

Claim: If MSE of an estimator goes to 0, the estimator must be consistent

Recall the following definitions for an estimator \(\hat \theta_n\) of the parameter \(\theta\).

\[\text{Bias}(\hat{\theta}_n) = E(\hat{\theta}_n) - \theta\]

\[\text{Var}(\hat{\theta}_n) = E\left[(\hat{\theta}_n - E(\hat{\theta}_n))^2\right]\]

\[\text{MSE}(\hat{\theta}_n) = E\left[(\hat{\theta}_n - \theta)^2\right] = \text{Bias}(\hat{\theta}_n)^2 + \text{Var}(\hat{\theta}_n)\]

We want to show that \(\text{MSE}(\hat{\theta}_n) \to 0 \Rightarrow \hat{\theta}_n \xrightarrow{P} \theta\),

that is, \(\forall \varepsilon > 0\), \(\displaystyle \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \varepsilon) = 0\).

Applying Markov’s inequality to \(Y = (X - a)^2 \geq 0\) for some \(a\), \(c >0\), we get:

\[P(Y > c^2) = P\!\left((X-a)^2 > c^2\right) = P(|X-a| > c) \leq \frac{E\!\left[(X-a)^2\right]}{c^2} \]

Now let \(X = \hat{\theta}_n\), \(a = \theta\), \(c = \varepsilon > 0\). We get:

\[P(|\hat{\theta}_n - \theta| > \varepsilon) \leq \frac{E\left[(\hat{\theta}_n - \theta)^2\right]}{\varepsilon^2} = \frac{\text{MSE}(\hat{\theta}_n)}{\varepsilon^2}\]

Since the MSE of \(\hat\theta_n\) goes \(0\) as \(n \to \infty\), this implies that \[ \lim_{n\to\infty} P(|\hat{\theta}_n - \theta| > \varepsilon) = 0 \Rightarrow \hat{\theta}_n \xrightarrow{P} \theta. \] We have shown that if \(MSE(\hat\theta_n) \to 0\), then \(\hat\theta_n\) must be a consistent estimator for \(\theta\).

Example

Let \(X_1, \dots, X_n \overset{IID}{\sim} \mathcal{N}(\mu, 1)\). Since \(E(X_i) = \mu\) and \(\mathrm{Var}(X_i) = 1\), we have that \(E(\overline{X}_n) = \mu\) and \(\mathrm{Var}(\overline{X}_n) = \dfrac{1}{n}\).

Let \(\hat\theta_n = \overline{X}_n + \dfrac{1}{n}\), which implies that, as \(E(\hat\theta_n) = E\left(\overline{X}_n + \dfrac{1}{n}\right) = \mu + \dfrac{1}{n}\), the bias of \(\hat\theta_n\) must be \(\dfrac{1}{n}\): \[ \mathrm{Bias}(\hat\theta_n) = E(\hat\theta_n) - \theta = \mu + \dfrac{1}{n} - \mu = \dfrac{1}{n}, \] since \(\theta = \mu\) in this example. Therefore, \(\mathrm{Bias}(\hat\theta_n) \to 0\) as \(n\to \infty\). Note that \(\left(\mathrm{Bias}(\hat\theta_n)\right)^2 = \dfrac{1}{n^2} \to 0\) as \(n \to \infty\).

Further, \[ \mathrm{Var}(\hat\theta_n) = \mathrm{Var}\left(\overline{X}_n + \dfrac{1}{n}\right) = \mathrm{Var}\left(\overline{X}_n \right) = \dfrac{1}{n} \to 0 \text{ as } n \to \infty. \] As a consequence of the statements above, we have that the MSE of \(\theta_n\) goes to 0 as \(n \to \infty\), and therefore, \(\hat\theta_n\) is a consistent estimator of \(\mu\).

̂

[Rice (2006); Pimentel (2024); Chihara and Hesterberg (2018); Hogg, McKean, and Craig (2005);]

References

Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons, Hoboken, NJ.
Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.