Mean Squared Error and Consistency
Introduction
Now we will look at the properties of maximum likelihood estimation when we have large samples. We want to consider the properties of the estimator: Is it unbiased? Consistent? What can we say about the distribution of the estimator when the sample size is large?
The two main results that we will establish in this chapter and the next are about the consistency of the MLE; and that the MLE is asymptotically normal. The variance of the MLE is related to a quantity called the Fisher Information. First, we will establish that the MLE is consistent.
Consistency of the MLE of \(\theta\).
As we might guess, the maximum likelihood estimator from an IID sample with PDF \(f(x\vert \theta)\), is consistent for \(\theta\) (under appropriate smoothness conditions on \(f\), which means that we can take as many continuous derivatives of \(f\) as we need and it won’t have any sharp corners etc.).
In the following, we will assume \(f\) is smooth. This will ensure that it is “well-behaved” and we can interchange the orders of integration and differentiation.
In our discussion below, \(\theta_0\) is the true unknown value of the parameter \(\theta\), and \(f(x \mid \theta_0)\) is the true PDF (\(\theta_0\) is some fixed, but unknown value).
Recall that \(X_1, \dots, X_n \overset{\text{IID}}{\sim} f(x \mid \theta)\) is a random sample, and \[ \operatorname{lik}(\theta) = \operatorname{lik}(\theta \mid X_1, \dots, X_n) = \prod_{i=1}^n f(X_i \mid \theta) \]
Define \(\ell(\theta) = \log \operatorname{lik}(\theta)\). Then, \[ \frac{1}{n} \ell(\theta) =\frac{1}{n} \log \operatorname{lik}(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i\mid \theta). \]
Claim: If MSE of an estimator goes to 0, the estimator must be consistent
Recall the following definitions for an estimator \(\hat \theta_n\) of the parameter \(\theta\).
\[\text{Bias}(\hat{\theta}_n) = E(\hat{\theta}_n) - \theta\]
\[\text{Var}(\hat{\theta}_n) = E\left[(\hat{\theta}_n - E(\hat{\theta}_n))^2\right]\]
\[\text{MSE}(\hat{\theta}_n) = E\left[(\hat{\theta}_n - \theta)^2\right] = \text{Bias}(\hat{\theta}_n)^2 + \text{Var}(\hat{\theta}_n)\]
We want to show that \(\text{MSE}(\hat{\theta}_n) \to 0 \Rightarrow \hat{\theta}_n \xrightarrow{P} \theta\),
that is, \(\forall \varepsilon > 0\), \(\displaystyle \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \varepsilon) = 0\).
Applying Markov’s inequality to \(Y = (X - a)^2 \geq 0\) for some \(a\), \(c >0\), we get:
\[P(Y > c^2) = P\!\left((X-a)^2 > c^2\right) = P(|X-a| > c) \leq \frac{E\!\left[(X-a)^2\right]}{c^2} \]
Now let \(X = \hat{\theta}_n\), \(a = \theta\), \(c = \varepsilon > 0\). We get:
\[P(|\hat{\theta}_n - \theta| > \varepsilon) \leq \frac{E\left[(\hat{\theta}_n - \theta)^2\right]}{\varepsilon^2} = \frac{\text{MSE}(\hat{\theta}_n)}{\varepsilon^2}\]
Since the MSE of \(\hat\theta_n\) goes \(0\) as \(n \to \infty\), this implies that \[ \lim_{n\to\infty} P(|\hat{\theta}_n - \theta| > \varepsilon) = 0 \Rightarrow \hat{\theta}_n \xrightarrow{P} \theta. \] We have shown that if \(MSE(\hat\theta_n) \to 0\), then \(\hat\theta_n\) must be a consistent estimator for \(\theta\).
Example
Let \(X_1, \dots, X_n \overset{IID}{\sim} \mathcal{N}(\mu, 1)\). Since \(E(X_i) = \mu\) and \(\mathrm{Var}(X_i) = 1\), we have that \(E(\overline{X}_n) = \mu\) and \(\mathrm{Var}(\overline{X}_n) = \dfrac{1}{n}\).
Let \(\hat\theta_n = \overline{X}_n + \dfrac{1}{n}\), which implies that, as \(E(\hat\theta_n) = E\left(\overline{X}_n + \dfrac{1}{n}\right) = \mu + \dfrac{1}{n}\), the bias of \(\hat\theta_n\) must be \(\dfrac{1}{n}\): \[ \mathrm{Bias}(\hat\theta_n) = E(\hat\theta_n) - \theta = \mu + \dfrac{1}{n} - \mu = \dfrac{1}{n}, \] since \(\theta = \mu\) in this example. Therefore, \(\mathrm{Bias}(\hat\theta_n) \to 0\) as \(n\to \infty\). Note that \(\left(\mathrm{Bias}(\hat\theta_n)\right)^2 = \dfrac{1}{n^2} \to 0\) as \(n \to \infty\).
Further, \[ \mathrm{Var}(\hat\theta_n) = \mathrm{Var}\left(\overline{X}_n + \dfrac{1}{n}\right) = \mathrm{Var}\left(\overline{X}_n \right) = \dfrac{1}{n} \to 0 \text{ as } n \to \infty. \] As a consequence of the statements above, we have that the MSE of \(\theta_n\) goes to 0 as \(n \to \infty\), and therefore, \(\hat\theta_n\) is a consistent estimator of \(\mu\).
̂