Consistency of the Maximum Likelihood Estimator

Introduction

A closer look at what consistency means

For this discussion, we follow Casella and Berger (2002):

Recall that a sequence of estimators \(\hat{\theta}_1, \hat{\theta}_2, \ldots\) is consistent for \(\theta_0\) if \(\hat{\theta}_n \xrightarrow{P} \theta_0\), where \(\theta_0\) is some fixed, but unknown value of \(\theta\).

Suppose \(X_1, X_2, \ldots \overset{\text{IID}}{\sim} f(x \mid \theta_0)\). We construct a sequence of estimators \(\hat{\theta}_1, \hat{\theta}_2, \ldots\)

Example 1 If \(\hat{\theta}_n = \overline{X}_n = \dfrac{\sum_{i=1}^n X_i}{n}\), then:

\[\hat{\theta}_1 = \overline{X}_1 = X_1\]

\[\hat{\theta}_2 = \overline{X}_2 = \frac{X_1 + X_2}{2}\]

\[\hat{\theta}_3 = \overline{X}_3 = \frac{X_1 + X_2 + X_3}{3}, \quad \text{etc.}\]

So really, what we are saying is that a sequence of estimators \(\{\hat{\theta}_n\}\) is a consistent sequence for the parameter \(\theta\) if for every \(\varepsilon > 0\) and every \(\theta \in \Theta\) (the parameter space):

\[\lim_{n \to \infty} P\!\left(|\hat{\theta}_n - \theta| < \varepsilon\right) = 1\]

Note that this means that as the sample size gets larger and sample information gets better, the estimator will be arbitrarily close to the true parameter with high probability.

Note that the definition of consistency deals with the entire family of distributions indexed by \(\theta\).


In the following example, we will prove consistency of the estimator directly, using the definition,

Example 2 Let \(X_i \sim N(\theta, 1)\). This implies that \(f(x \mid \theta) = \dfrac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}(x-\theta)^2}\).

If \(X_1, \ldots, X_n \overset{\text{IID}}{\sim} N(\theta, 1)\), then \(\overline{X}_n \sim N\!\left(\theta, \tfrac{1}{n}\right)\).

We want to show \(P(|\overline{X}_n - \theta| < \varepsilon) \to 1\). Note that since \(\overline{X}_n \sim N\!\left(\theta, \tfrac{1}{n}\right)\), its density is \(f_{\overline{X}_n}(x) = \sqrt{\dfrac{n}{2\pi}}\, e^{-\frac{n(x-\theta)^2}{2}}\).

\[P(|\overline{X}_n - \theta| < \varepsilon) = P(\theta - \varepsilon < \overline{X}_n < \theta + \varepsilon)\]

\[= \int_{\theta-\varepsilon}^{\theta+\varepsilon} f_{\overline{X}_n}(x \mid \theta)\, x = \int_{\theta-\varepsilon}^{\theta+\varepsilon} \sqrt{\frac{n}{2\pi}}\, e^{-\frac{n(x-\theta)^2}{2}}\, dx\]

We will do two variable changes: First, let \(y = x - \theta\), so \(dy = dx\); when \(x = \theta - \varepsilon\), \(y = -\varepsilon\):

\[= \sqrt{\frac{n}{2\pi}} \int_{-\varepsilon}^{\varepsilon} e^{-\frac{ny^2}{2}}\, dy\]

Next, let \(t = \sqrt{n}\, y\), so \(dt = \sqrt{n}\, dy\), \(t^2 = ny^2\); when \(y = -\varepsilon\), \(t = -\varepsilon\sqrt{n}\):

\[\Rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\varepsilon\sqrt{n}}^{\varepsilon\sqrt{n}} e^{-t^2/2}\, dt = P(-\varepsilon\sqrt{n} < Z < \varepsilon\sqrt{n})\]

As \(n \to \infty, \varepsilon\sqrt{n} \to \infty\), and \(P(-\infty < Z < \infty) = 1\), where \(Z \sim \mathcal{N}(0,1)\).

Therefore, \(\overline{X}_n \text{ is a consistent (sequence of) estimator(s) of } \theta. \quad \blacksquare\).

Sufficient Condition for Consistency

Last time we showed that if the MSE of an estimator goes to 0, then the estimator must be consistent. This is often easier to show than computing the probability that we need, and that is why the theorem is so useful.

If we can show that, as \(n \to \infty\):

  1. \(\operatorname{Var}(\hat{\theta}_n) \to 0\)
  2. \((\operatorname{Bias}(\hat{\theta}_n))^2 \to 0\)

then \(\operatorname{MSE}(\hat{\theta}_n) \to 0\), which implies \(\hat{\theta}_n\) is consistent. In the above example, since \(\overline{X}_n \sim \mathcal{N}\left(\theta, \dfrac{1}{n}\right),\) it is immediate that the MSE \(\to 0\), and therefore \(\overline{X}_n\) is a consistent estimator of \(\theta\).

Caveat: Consistency \(\not\Rightarrow\) MSE \(\to 0\)

Suppose we are trying to estimate \(\mu\). Let:

\[\hat{\theta}_n = \begin{cases} \overline{X}_n & \text{w.p. } 1 - \tfrac{1}{n} \\ n & \text{w.p. } \tfrac{1}{n} \end{cases}\]

With probability \(1 - \frac{1}{n}\), \(\hat{\theta}_n = \overline{X}_n \xrightarrow{P} \theta = \mu\), so \(\hat{\theta}_n \xrightarrow{P} \theta\).

But with probability \(\frac{1}{n}\), \(\hat{\theta}_n = n\), and the MSE satisfies:

\[\operatorname{MSE} = E\!\left[(\hat{\theta}_n - \theta)^2\right]\]

\[= E\!\left[(\overline{X}_n - \theta)^2 \cdot \mathbf{1}(\hat{\theta}_n = \overline{X}_n)\right] + E\!\left[(n - \theta)^2 \cdot \mathbf{1}(\hat{\theta}_n = n)\right]\]

\[\geq E\!\left[(n - \theta)^2 \cdot \frac{1}{n}\right] = \frac{(n-\theta)^2}{n} \to \infty \text{ as } n \to \infty\]

Therefore, consistency does not imply that \(\operatorname{MSE} \to 0\).

Consistency of the MLE of \(\theta\)

As we might guess, the maximum likelihood estimator from an IID sample with PDF \(f(x\vert \theta)\), is consistent for \(\theta\) (under appropriate smoothness conditions on \(f\), which means that we can take as many continuous derivatives of \(f\) as we need -it won’t have any sharp corners etc.).

In the following, we will assume \(f\) is smooth. This will ensure that it is “well-behaved” and we can interchange the orders of integration and differentiation.

In our discussion below, \(\theta_0\) is the true unknown value of the parameter \(\theta\), and \(f(x \mid \theta_0)\) is the true PDF (\(\theta_0\) is some fixed, but unknown value).

Recall that \(X_1, \dots, X_n \overset{\text{IID}}{\sim} f(x \mid \theta)\) is a random sample, and \[ \operatorname{lik}(\theta) = \operatorname{lik}(\theta \mid X_1, \dots, X_n) = \prod_{i=1}^n f(X_i \mid \theta) \]

Define \(\ell(\theta) = \log \operatorname{lik}(\theta)\). Then, \[ \frac{1}{n} \ell(\theta) =\frac{1}{n} \log \operatorname{lik}(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i\mid \theta). \]

For ease of writing, we will write \(l'(\theta) = \dfrac{\partial l }{\partial \theta}\) and \(l''(\theta) = \dfrac{\partial^2 l }{\partial \theta^2}\) where possible.

Theorem 1 Let \(\widehat{\theta}_n\) be the MLE of \(\theta_0\). Then \(\hat \theta_n\) is consistent for \(\theta_0\). That is, \(\hat \theta_n \overset{P}{\rightarrow}\theta_0\). This means that, for any \(\epsilon > 0\), \[ \lim_{n \rightarrow \infty} P(\lvert \hat \theta_n - \theta_0\rvert > \epsilon) = 0. \]

Proof:

\[ \Rightarrow \quad \frac{1}{n} \ell(\theta) = \frac{1}{n} \sum_{i=1}^n \log f(X_i \mid \theta). \]

The RHS is the sample mean of the IID random variables \[ Y_i = \log f(X_i \mid \theta). \]

By the WLLN, \[ \frac{1}{n} \sum_{i=1}^n Y_i \xrightarrow{P} E[Y]. \] Now, since \(Y \equiv Y(\theta)= \log f(X\mid \theta)\), we have

\[ E[Y] = E[\log f(X \mid \theta)] = \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx \] where \(f(x \mid \theta_0)\) is the true PDF from which the sample \(X_1, \dots, X_n\) was generated.

By the WLLN, for large \(n\), with high probability,\(\dfrac{1}{n} \sum_{i=1}^n Y_i\) and \(E[Y]\) will be close, which means that \(\dfrac{1}{n} \ell(\theta)\) and \(E[Y]\) will be close.

The maximizer of \(\ell(\theta)\) is \(\hat{\theta}_n\), and by the above, this should be close to the maximizer of \(\displaystyle \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx\).

Let’s maximize \(E[Y]\) (note that \(E \equiv E_{\theta_0}\)): \[ \begin{align*} E[Y] &= E[\log f(X \mid \theta)] \\ &= \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx \\ \end{align*} \] Taking the derivative with respect to \(\theta\) on both sides (and using the fact that \(f\) is smooth to justify changing the order of operations) we get: \[ \begin{align*} \frac{\partial}{\partial \theta} E[\log f(X \mid \theta)] &= \frac{\partial}{\partial \theta} \int \log f(x \mid \theta) \cdot f(x \mid \theta_0)\, dx\\ &= \int \frac{\partial}{\partial \theta}\log f(x \mid \theta)\cdot f(x \mid \theta_0)\, dx\\ &= \int \frac{\frac{\partial}{\partial \theta} f(x \mid \theta)} {f(x \mid \theta)} \cdot f(x \mid \theta_0)\, dx \end{align*} \]

Now we will show that \(\theta_0\) maximizes \(E[Y]=E[\log f(X \mid \theta)\), by plugging \(\theta_0\) into the right hand side above to get:

\[ \int \frac{\frac{\partial}{\partial \theta}f(x \mid \theta_0) } {f(x \mid \theta_0)} \cdot f(x \mid \theta_0)\, dx = \int\frac{\partial f(x \mid \theta_0)}{\partial \theta} \, dx = \frac{\partial}{\partial \theta} \int f(x \mid \theta_0)\, dx = 0. \]

We get \(0\) at the end because \(\int f(x \mid \theta)\, dx = 1\), since \(f(x \mid \theta)\) is a density function and integrates to 1. Therefore, we see that \(\theta_0\) is a stationary point for \(E[\log f(X \mid \theta)]\).

We can show that \(\theta_0\) is a max by showing that \(E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] \le 0\) for all \(\theta\).

\[ \begin{align*} E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] &= \int \log f(x \mid \theta) \cdot f(x \mid \theta_0) \, dx - \int \log f(x \mid \theta_0) \cdot f(x \mid \theta_0) \, dx\\ &= \int \left[ \log f(x \mid \theta) - \log f(x \mid \theta_0) \right] \cdot f(x \mid \theta_0) \, dx\\ &= \int \log \left[ \frac{f(x \mid \theta)}{f(x \mid \theta_0)} \right] \cdot f(x \mid \theta_0) \, dx \\ \end{align*} \]

Note that \(\log t \leq t - 1\). This implies that (with \(t = \displaystyle \frac{f(x \mid \theta)}{f(x \mid \theta_0)}\) )

\[ \begin{align*} E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] &\leq \int \left[\frac{f(x \mid \theta)}{f(x \mid \theta_0)} - 1 \right] \cdot f(x \mid \theta_0) \, dx\\ &= \int f(x \mid \theta) \, dx \;-\; \int f(x \mid \theta_0) \, dx \\ &= 1 - 1 = 0 \end{align*} \]

Therefore, \[ E[\log f(X \mid \theta)] - E[\log f(X \mid \theta_0)] \leq 0 \quad \text{for all } \theta \]

Which means that \(\theta_0 \text{ maximizes } E_{\theta_0} \log f(X \mid \cdot)\).

So by the Weak Law of large Numbers:

\[\frac{1}{n} \sum_{i=1}^{n} \log f(X_i \mid \theta) \xrightarrow{\quad} E_{\theta_0} [ \log f(X \mid \theta)]\]

  • The left side is \(\dfrac{1}{n}\ell(\theta)\), maximized by \(\hat{\theta}_n\) (the MLE).
  • The right side is \(E[Y(\theta)]\), maximized by \(\theta_0\) (the true parameter).

Summary of the Consistency Argument

  1. \(\hat{\theta}_n\) maximizes \(\displaystyle \sum_{i=1}^n Y_i = \ell(\theta)\).
  2. \(\theta_0\) maximizes \(E[Y] \equiv E[Y(\theta)]= E\left[\log f(X\mid \theta)\right]\)
  3. By the WLLN, \(\displaystyle \sum_{i=1}^n Y_i \xrightarrow{P} E[Y] \;\; \forall \theta\), and therefore the maximizers \(\hat{\theta}_n\) also converge to the maximizer \(\theta_0\) of \(E[Y]\). (This is a result of the continuous mapping theorem 1.)

Therefore \[ \left(\hat{\theta}_n\right)_{MLE} \xrightarrow{P} \theta_0 \]

[Rice (2006); Pimentel (2024); Chihara and Hesterberg (2018); Hogg, McKean, and Craig (2005); Casella and Berger (2002);]

References

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury.
Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons, Hoboken, NJ.
Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.

Footnotes

  1. Continuous Mapping Theorem: If we have a sequence of random variables \({X_i}\) such that \(X_n \xrightarrow{P} X\) and \(g\) is a continuous function, then \(g(X_n) \xrightarrow{P} g(X)\). ̂↩︎