Consistency of the Maximum Likelihood Estimator
Introduction
A closer look at what consistency means
For this discussion, we follow Casella and Berger (2002):
Recall that a sequence of estimators \(\hat{\theta}_1, \hat{\theta}_2, \ldots\) is consistent for \(\theta_0\) if \(\hat{\theta}_n \xrightarrow{P} \theta_0\), where \(\theta_0\) is some fixed, but unknown value of \(\theta\).
Suppose \(X_1, X_2, \ldots \overset{\text{IID}}{\sim} f(x \mid \theta_0)\). We construct a sequence of estimators \(\hat{\theta}_1, \hat{\theta}_2, \ldots\)
Example 1 If \(\hat{\theta}_n = \overline{X}_n = \dfrac{\sum_{i=1}^n X_i}{n}\), then:
\[\hat{\theta}_1 = \overline{X}_1 = X_1\]
\[\hat{\theta}_2 = \overline{X}_2 = \frac{X_1 + X_2}{2}\]
\[\hat{\theta}_3 = \overline{X}_3 = \frac{X_1 + X_2 + X_3}{3}, \quad \text{etc.}\]
So really, what we are saying is that a sequence of estimators \(\{\hat{\theta}_n\}\) is a consistent sequence for the parameter \(\theta\) if for every \(\varepsilon > 0\) and every \(\theta \in \Theta\) (the parameter space):
\[\lim_{n \to \infty} P\!\left(|\hat{\theta}_n - \theta| < \varepsilon\right) = 1\]
Note that this means that as the sample size gets larger and sample information gets better, the estimator will be arbitrarily close to the true parameter with high probability.
Note that the definition of consistency deals with the entire family of distributions indexed by \(\theta\).
In the following example, we will prove consistency of the estimator directly, using the definition,
Example 2 Let \(X_i \sim N(\theta, 1)\). This implies that \(f(x \mid \theta) = \dfrac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}(x-\theta)^2}\).
If \(X_1, \ldots, X_n \overset{\text{IID}}{\sim} N(\theta, 1)\), then \(\overline{X}_n \sim N\!\left(\theta, \tfrac{1}{n}\right)\).
We want to show \(P(|\overline{X}_n - \theta| < \varepsilon) \to 1\). Note that since \(\overline{X}_n \sim N\!\left(\theta, \tfrac{1}{n}\right)\), its density is \(f_{\overline{X}_n}(x) = \sqrt{\dfrac{n}{2\pi}}\, e^{-\frac{n(x-\theta)^2}{2}}\).
\[P(|\overline{X}_n - \theta| < \varepsilon) = P(\theta - \varepsilon < \overline{X}_n < \theta + \varepsilon)\]
\[= \int_{\theta-\varepsilon}^{\theta+\varepsilon} f_{\overline{X}_n}(x \mid \theta)\, x = \int_{\theta-\varepsilon}^{\theta+\varepsilon} \sqrt{\frac{n}{2\pi}}\, e^{-\frac{n(x-\theta)^2}{2}}\, dx\]
We will do two variable changes: First, let \(y = x - \theta\), so \(dy = dx\); when \(x = \theta - \varepsilon\), \(y = -\varepsilon\):
\[= \sqrt{\frac{n}{2\pi}} \int_{-\varepsilon}^{\varepsilon} e^{-\frac{ny^2}{2}}\, dy\]
Next, let \(t = \sqrt{n}\, y\), so \(dt = \sqrt{n}\, dy\), \(t^2 = ny^2\); when \(y = -\varepsilon\), \(t = -\varepsilon\sqrt{n}\):
\[\Rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\varepsilon\sqrt{n}}^{\varepsilon\sqrt{n}} e^{-t^2/2}\, dt = P(-\varepsilon\sqrt{n} < Z < \varepsilon\sqrt{n})\]
As \(n \to \infty, \varepsilon\sqrt{n} \to \infty\), and \(P(-\infty < Z < \infty) = 1\), where \(Z \sim \mathcal{N}(0,1)\).
Therefore, \(\overline{X}_n \text{ is a consistent (sequence of) estimator(s) of } \theta. \quad \blacksquare\).
Sufficient Condition for Consistency
Last time we showed that if the MSE of an estimator goes to 0, then the estimator must be consistent. This is often easier to show than computing the probability that we need, and that is why the theorem is so useful.
If we can show that, as \(n \to \infty\):
- \(\operatorname{Var}(\hat{\theta}_n) \to 0\)
- \((\operatorname{Bias}(\hat{\theta}_n))^2 \to 0\)
then \(\operatorname{MSE}(\hat{\theta}_n) \to 0\), which implies \(\hat{\theta}_n\) is consistent. In the above example, since \(\overline{X}_n \sim \mathcal{N}\left(\theta, \dfrac{1}{n}\right),\) it is immediate that the MSE \(\to 0\), and therefore \(\overline{X}_n\) is a consistent estimator of \(\theta\).
Caveat: Consistency \(\not\Rightarrow\) MSE \(\to 0\)
Suppose we are trying to estimate \(\mu\). Let:
\[\hat{\theta}_n = \begin{cases} \overline{X}_n & \text{w.p. } 1 - \tfrac{1}{n} \\ n & \text{w.p. } \tfrac{1}{n} \end{cases}\]
With probability \(1 - \frac{1}{n}\), \(\hat{\theta}_n = \overline{X}_n \xrightarrow{P} \theta = \mu\), so \(\hat{\theta}_n \xrightarrow{P} \theta\).
But with probability \(\frac{1}{n}\), \(\hat{\theta}_n = n\), and the MSE satisfies:
\[\operatorname{MSE} = E\!\left[(\hat{\theta}_n - \theta)^2\right]\]
\[= E\!\left[(\overline{X}_n - \theta)^2 \cdot \mathbf{1}(\hat{\theta}_n = \overline{X}_n)\right] + E\!\left[(n - \theta)^2 \cdot \mathbf{1}(\hat{\theta}_n = n)\right]\]
\[\geq E\!\left[(n - \theta)^2 \cdot \frac{1}{n}\right] = \frac{(n-\theta)^2}{n} \to \infty \text{ as } n \to \infty\]
Therefore, consistency does not imply that \(\operatorname{MSE} \to 0\).
Consistency of the MLE of \(\theta\)
As we might guess, the maximum likelihood estimator from an IID sample with PDF \(f(x\vert \theta)\), is consistent for \(\theta\) (under appropriate smoothness conditions on \(f\), which means that we can take as many continuous derivatives of \(f\) as we need -it won’t have any sharp corners etc.).
In the following, we will assume \(f\) is smooth. This will ensure that it is “well-behaved” and we can interchange the orders of integration and differentiation.
In our discussion below, \(\theta_0\) is the true unknown value of the parameter \(\theta\), and \(f(x \mid \theta_0)\) is the true PDF (\(\theta_0\) is some fixed, but unknown value).
Recall that \(X_1, \dots, X_n \overset{\text{IID}}{\sim} f(x \mid \theta)\) is a random sample, and \[ \operatorname{lik}(\theta) = \operatorname{lik}(\theta \mid X_1, \dots, X_n) = \prod_{i=1}^n f(X_i \mid \theta) \]
Define \(\ell(\theta) = \log \operatorname{lik}(\theta)\). Then, \[ \frac{1}{n} \ell(\theta) =\frac{1}{n} \log \operatorname{lik}(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i\mid \theta). \]
For ease of writing, we will write \(l'(\theta) = \dfrac{\partial l }{\partial \theta}\) and \(l''(\theta) = \dfrac{\partial^2 l }{\partial \theta^2}\) where possible.
Theorem 1 Let \(\widehat{\theta}_n\) be the MLE of \(\theta_0\). Then \(\hat \theta_n\) is consistent for \(\theta_0\). That is, \(\hat \theta_n \overset{P}{\rightarrow}\theta_0\). This means that, for any \(\epsilon > 0\), \[ \lim_{n \rightarrow \infty} P(\lvert \hat \theta_n - \theta_0\rvert > \epsilon) = 0. \]
Proof:
\[ \Rightarrow \quad \frac{1}{n} \ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i \mid \theta). \]
The RHS is the sample mean of the IID random variables \[ Y_i = \log f(X_i \mid \theta). \]
By the WLLN, \[ \frac{1}{n} \sum_{i=1}^n Y_i \xrightarrow{P} E[Y]. \] Now, since \(Y \equiv Y(\theta)= \log f(X\mid \theta)\), we have
\[ E[Y] = E[\log f(X \mid \theta)] = \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx \] where \(f(x \mid \theta_0)\) is the true PDF from which the sample \(X_1, \dots, X_n\) was generated.
By the WLLN, for large \(n\), with high probability,\(\dfrac{1}{n} \sum_{i=1}^n Y_i\) and \(E[Y]\) will be close, which means that \(\dfrac{1}{n} \ell(\theta)\) and \(E[Y]\) will be close.
The maximizer of \(\ell(\theta)\) is \(\hat{\theta}_n\), and by the above, this should be close to the maximizer of \(\displaystyle \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx\).
Let’s maximize \(E[Y]\) (note that \(E \equiv E_{\theta_0}\)): \[ \begin{align*} E[Y] &= E[\log f(X \mid \theta)] \\ &= \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx \\ \end{align*} \] Taking the derivative with respect to \(\theta\) on both sides (and using the fact that \(f\) is smooth to justify changing the order of operations) we get: \[ \begin{align*} \frac{\partial}{\partial \theta} E[\log f(X \mid \theta)] &= \frac{\partial}{\partial \theta} \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx\\ &= \int \frac{\partial}{\partial \theta}\log f(x \mid \theta)\cdot f(x \mid \theta_0)\, dx\\ &= \int \frac{\frac{\partial}{\partial \theta} f(x \mid \theta)} {f(x \mid \theta)} \cdot f(x \mid \theta_0)\, dx \end{align*} \]
Now we will show that \(\theta_0\) maximizes \(E[Y]=E[\log f(X \mid \theta)\), by plugging \(\theta_0\) into the right hand side above to get:
\[ \int \frac{\frac{\partial}{\partial \theta}f(x \mid \theta_0) } {f(x \mid \theta_0)} \cdot f(x \mid \theta_0)\, dx = \int\frac{\partial f(x \mid \theta_0)}{\partial \theta} \, dx = \frac{\partial}{\partial \theta} \int f(x \mid \theta_0)\, dx = 0. \]
We get \(0\) at the end because \(\int f(x \mid \theta)\, dx = 1\), since \(f(x \mid \theta)\) is a density function and integrates to 1. Therefore, we see that \(\theta_0\) is a stationary point for \(E[\log f(X \mid \theta)]\).
We can show that \(\theta_0\) is a max by showing that \(E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] \le 0\) for all \(\theta\).
\[ \begin{align*} E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] &= \int \log f(x \mid \theta) \cdot f(x \mid \theta_0) \, dx - \int \log f(x \mid \theta_0) \cdot f(x \mid \theta_0) \, dx\\ &= \int \left[ \log f(x \mid \theta) - \log f(x \mid \theta_0) \right] \cdot f(x \mid \theta_0) \, dx\\ &= \int \log \left[ \frac{f(x \mid \theta)}{f(x \mid \theta_0)} \right] \cdot f(x \mid \theta_0) \, dx \\ \end{align*} \]
Note that \(\log t \leq t - 1\). This implies that (with \(t = \displaystyle \frac{f(x \mid \theta)}{f(x \mid \theta_0)}\) )
\[ \begin{align*} E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] &\leq \int \left[\frac{f(x \mid \theta)}{f(x \mid \theta_0)} - 1 \right] \cdot f(x \mid \theta_0) \, dx\\ &= \int f(x \mid \theta) \, dx \;-\; \int f(x \mid \theta_0) \, dx \\ &= 1 - 1 = 0 \end{align*} \]
Therefore, \[ E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] \leq 0 \quad \text{for all } \theta \]
Which means that \(\theta_0 \text{ maximizes } E_{\theta_0} \log f(X \mid \cdot)\).
So by the Weak Law of large Numbers:
\[\frac{1}{n} \sum_{i=1}^{n} \log f(X_i \mid \theta) \xrightarrow{\quad} E_{\theta_0} [ \log f(X \mid \theta)]\]
- The left side is \(\dfrac{1}{n}\ell(\theta)\), maximized by \(\hat{\theta}_n\) (the MLE).
- The right side is \(E[Y(\theta)]\), maximized by \(\theta_0\) (the true parameter).
Summary of the Consistency Argument
- \(\hat{\theta}_n\) maximizes \(\displaystyle \sum_{i=1}^n Y_i = \ell(\theta)\).
- \(\theta_0\) maximizes \(E[Y] \equiv E[Y(\theta)]= E\left[\log f(X\mid \theta)\right]\)
- By the WLLN, \(\displaystyle \sum_{i=1}^n Y_i \xrightarrow{P} E[Y] \;\; \forall \theta\), and therefore the maximizers \(\hat{\theta}_n\) also converge to the maximizer \(\theta_0\) of \(E[Y]\). (This is a result of the continuous mapping theorem 1.)
Therefore \[ \left(\hat{\theta}_n\right)_{MLE} \xrightarrow{P} \theta_0 \]
References
Footnotes
Continuous Mapping Theorem: If we have a sequence of random variables \({X_i}\) such that \(X_n \xrightarrow{P} X\) and \(g\) is a continuous function, then \(g(X_n) \xrightarrow{P} g(X)\). ̂↩︎