Proof of Asymptotic Normality, and Confidence Intervals
Introduction
In the last lecture, we stated that the MLE is asymptotically normal, with the asymptotic variance of the MLE being the reciprocal of the Fisher information of the sample. We spent some time defining and studying the Fisher information. Now we provide a sketch of the proof of the theorem, (we won’t be filling in all the details).
Asymptotic Normality of the MLE
Theorem 1 Let \(X_1, X_2, \ldots, X_n \sim f(x|\theta)\) be an IID random sample, and let \(I_n(\theta_0) = nI(\theta_0)\) be the Fisher information of the sample.
Then, under appropriate smoothness conditions on \(f\), we have that \(\hat{\theta}_n\), the maximum likelihood estimator for \(\theta_0\), has an asymptotically normal distribution. That is, the distribution of the standardized MLE converges to the standard normal distribution:
\[ \sqrt{nI(\theta_0)}\,\bigl(\hat{\theta}_n - \theta_0\bigr) \xrightarrow{D} N(0,1) \]
This implies that for large \(n\):
\[ \hat{\theta}_n \approx N\!\left(\theta_0,\; \frac{1}{nI(\theta_0)}\right) \]
Proof. \(\hat{\theta}_n\) maximizes the log-likelihood function:
\[ \ell(\theta) = \sum_{i=1}^{n} \log f(X_i|\theta) \implies \ell'(\hat{\theta}_n) = 0 \]
We can do a Taylor’s series expansion of \(\ell'(\hat{\theta}_n)\) in a neighborhood of \(\theta_0\):
\[ 0 = \ell'(\hat{\theta}_n) \approx \ell'(\theta_0) + \ell''(\theta_0)\cdot(\hat{\theta}_n - \theta_0) \]
\[ \implies \ell'(\theta_0) + \ell''(\theta_0)(\hat{\theta}_n - \theta_0) \approx 0 \]
\[ \implies \hat{\theta}_n - \theta_0 \approx -\frac{\ell'(\theta_0)}{\ell''(\theta_0)} \]
\[ \implies \sqrt{n}\,(\hat{\theta}_n - \theta_0) \approx -\frac{\ell'(\theta_0)/\sqrt{n}}{\ell''(\theta_0)/n} \tag{1}\]
We will consider the numerator and denominator separately.
Numerator: \(\ell'(\theta_0)/\sqrt{n}\)
\[ E\!\left(\frac{\ell'(\theta_0)}{\sqrt{n}}\right) = \frac{1}{\sqrt{n}}\,E\bigl(\ell'(\theta_0)\bigr) = 0 \]
\[ \implies \operatorname{Var}\!\left(\frac{\ell'(\theta_0)}{\sqrt{n}}\right) = \frac{1}{n}\operatorname{Var}\bigl(\ell'(\theta_0)\bigr) \]
\[ = \frac{1}{n}\,E\!\left[(\ell'(\theta_0))^2\right] = \frac{1}{n}\cdot \underbrace{I_n(\theta_0)}_{\text{FI of sample}} \]
\[ = \frac{1}{n}\cdot nI(\theta_0) = I(\theta_0) \]
Consequently, since \(\ell'(\theta) = \sum_{i=1}^{n} \dfrac{\partial\log f(X_i|\theta)}{\partial\theta}\), which is a sum of IID random variables, we can apply the CLT to this sum, and we get that \[ \dfrac{\ell'(\theta_0)/\sqrt{n}}{\sqrt{I(\theta_0)}} = \dfrac{\ell'(\theta_0)-0}{\sqrt{nI(\theta_0)}} \overset{D}{\longrightarrow} \mathcal{N}(0,1) \tag{2}\]
Denominator: \(\ell''(\theta_0)/n\)
\[ \frac{1}{n}\,\ell''(\theta_0) = \frac{1}{n}\sum_{i=1}^{n} \frac{\partial^2}{\partial\theta^2}\log f(X_i|\theta_0) \]
This is the sample mean of the IID random variables \(\dfrac{\partial^2}{\partial\theta^2}\log f(X_i|\theta_0)\), and therefore by the WLLN this converges to:
\[ \begin{align*} E\!\left(\frac{\partial^2}{\partial\theta^2}\log f(X|\theta_0)\right) &= -I(\theta_0)\\ \implies \frac{1}{n}\,\ell''(\theta_0) &\xrightarrow{P} -I(\theta_0) \end{align*} \] Since convergence in probability implies convergence in distribution, we see that \[ \frac{1}{n}\,\ell''(\theta_0) \xrightarrow{D} -I(\theta_0) \tag{3}\]
By Slutsky’s theorem1, we can replace the denominator in Equation 1 by the constant that it converges to.
Putting it all together
Combining the results from Equation 2 and Equation 3, and plugging these into Equation 1, we see that for large \(n\) we have:
\[ \begin{align*} \sqrt{n}\,(\hat{\theta}_n - \theta_0) &\approx -\frac{\ell'(\theta_0)/\sqrt{n}}{\ell''(\theta_0)/n} \\ &\approx \frac{\ell'(\theta_0)/\sqrt{n}}{I(\theta_0)}\\ \Rightarrow \sqrt{nI(\theta_0)}\,(\hat{\theta}_n - \theta_0) &\approx \frac{\ell'(\theta_0)}{\sqrt{nI(\theta_0)}} \overset{D}\longrightarrow \mathcal{N}(0,1) \\ \end{align*} \] \(\blacksquare\)
Examples
Bernoulli\((p)\) distribution
\(X_1, X_2, \ldots, X_n \overset{\text{IID}}{\sim} \text{Bern}(p)\)
\[ \text{lik}(p) = p^{\sum x_i}(1-p)^{n-\sum x_i} = p^{n\bar{x}}(1-p)^{n-n\bar{x}} \quad \text{(let } \textstyle\sum x_i = n\bar{x}\text{)} \] \[ \implies \ell(p) = \log\text{lik}(p) = n\bar{x}\log p + (n - n\bar{x})\log(1-p) \]
\[ \implies \ell'(p) = \frac{n\bar{x}}{p} + \frac{n - n\bar{x}}{1-p} \cdot(-1) = \frac{n\bar{x}}{p} - \frac{n-n\bar{x}}{1-p} \] Setting \(\ell'(p)\) and solving for the maximum likelihood estimate we see that \(\hat{p} = \bar{x}\), and the corresponding estimator is \(\overline{X}\).
Notice that the second derivative of \(\ell(\theta)\) is negative, confirming that \(\hat{p}\) is indeed a maximum: \[ \ell''(p) = -\frac{n\bar{x}}{p^2} - \frac{n - n\bar{x}}{(1-p)^2} < 0 \]
In terms of the random sample \(X_1, \ldots, X_n\):
\[ \ell''(p) = -\frac{n\overline{X}}{p^2} - \frac{n - n\overline{X}}{(1-p)^2} \] Taking expectations to compute the Fisher information of the sample, we get (recall that \(E(\overline{X}) = p\)): \[ \begin{align*} -E(\ell''(p)) &= \frac{n\,E(\overline{X})}{p^2} + \frac{n\,E(1-\overline{X})}{(1-p)^2}\\ &= \frac{np}{p^2} + \frac{n(1-p)}{(1-p)^2} \\ &= \frac{n}{p(1-p)} \end{align*} \]
\[ \implies I_n(p_0) = \frac{n}{p_0(1-p_0)} \quad \text{and} \quad I(p_0) = \frac{1}{p_0(1-p_0)}, \] since \(I_n(p_0) = nI(p_0)\).
By the asymptotic Normality of the MLE,
\[ \begin{align*} \sqrt{nI(p_0)}\,(\hat{p}_n - p_0) &\overset{D}\longrightarrow \mathcal{N}(0,1)\\ \implies \frac{\sqrt{n}}{\sqrt{p_0(1-p_0)}}(\hat{p}_n - p_0) &\overset{D}\longrightarrow \mathcal{N}(0,1)\\ \end{align*} \]
\[ \implies \hat{p}_n \approx N\!\left(p_0,\; \frac{p_0(1-p_0)}{n}\right) \]
Exponential\((\lambda)\) distribution
\(X_1, \ldots, X_n \overset{\text{IID}}{\sim} \text{Exp}(\lambda)\)
\[ \implies f(x|\lambda) = \lambda e^{-\lambda x} \]
Using just a single observation \(X \sim f(x|\lambda)\):
\[ \text{lik}(\lambda|x) = \lambda e^{-\lambda x} \implies \ell(\lambda) = \log\lambda - \lambda x \]
\[ \ell'(\lambda) = \frac{1}{\lambda} - x \implies \hat{\lambda} = \frac{1}{x} \]
\[ \ell''(\lambda) = -\frac{1}{\lambda^2} \implies I(\lambda) = -E(\ell''(\lambda)) = \frac{1}{\lambda^2} \]
Thus the Fisher information is given by $ I(_0) = $, where \(\lambda_0\) is the true value of the parameter.
Therefore:
\[ \sqrt{nI(\lambda_0)}\,(\hat{\lambda}_n - \lambda_0) = \sqrt{n}\cdot\frac{1}{\lambda_0}(\hat{\lambda}_n - \lambda_0) \overset{D}\longrightarrow \mathcal{N}(0,1) \]
\[ \implies \frac{\sqrt{n}}{\lambda_0}(\hat{\lambda}_n - \lambda_0) \overset{D}\longrightarrow \mathcal{N}(0,1) \]
\[ \implies \hat{\lambda}_n \approx N\!\left(\lambda_0,\; \frac{\lambda_0^2}{n}\right) \]
Confidence Intervals
We use the asymptotic distribution to construct confidence intervals just like before. Suppose we have an IID random sample \(X_1, \ldots, X_n\) with density \(f(x\mid \theta_0)\).
Let \(\hat\theta\) be the MLE of \(\theta_0\), and suppose we want to construct a \((1-\alpha)100\%\) confidence interval for \(\theta_0\). Using the asymptotic normality of \(\hat\theta\), where \[ \begin{align*} \hat\theta &\approx \mathcal{N}\left(\theta_0, \frac{1}{nI(\theta_0)}\right) \\ \implies \frac{\hat\theta-\theta_0}{1/\sqrt{nI(\theta_0)}} &= \sqrt{nI(\theta_0)}(\hat\theta-\theta_0) \approx \mathcal{N}(0,1). \end{align*} \] We begin with \(P\!\left(-z_{\alpha/2} \le Z \le z_{\alpha/2}\right) = 1-\alpha\), where \(Z \sim\mathcal{N}(0,1)\) :
\[P\!\left(-z_{\alpha/2} \le \sqrt{nI(\theta_0)}\,(\hat{\theta}_n - \theta_0) \le z_{\alpha/2}\right) = 1-\alpha\]
\[\implies P\!\left(\frac{-z_{\alpha/2}}{\sqrt{nI(\theta_0)}} \le (\hat{\theta}_n - \theta_0) \le \frac{z_{\alpha/2}}{\sqrt{nI(\theta_0)}}\right) = 1-\alpha\]
\[\implies P\!\left(\hat{\theta}_n - \frac{z_{\alpha/2}}{\sqrt{nI(\theta_0)}} \le \theta_0 \le \hat{\theta}_n + \frac{z_{\alpha/2}}{\sqrt{nI(\theta_0)}}\right) = 1-\alpha\]
Therefore, the (asymptotic) \((1-\alpha)\cdot 100\%\) confidence interval for \(\theta_0\) is given by
\[ \hat{\theta}_n \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{nI(\theta_0)}} \text{ or } \left(\hat{\theta}_n -z_{\alpha/2} \cdot \frac{1}{\sqrt{nI(\theta_0)}}, \hat{\theta}_n + z_{\alpha/2} \cdot \frac{1}{\sqrt{nI(\theta_0)}} \right) \]
References
Footnotes
Replacing the random variable \(\ell''(\theta_0)/n\) by the constant it converges to uses Slutsky’s Theorem: if \(X_n \xrightarrow{D} X\) and \(Y_n \xrightarrow{P} c\), then \(X_n/Y_n \xrightarrow{D} X/c\), \(X_nY_n \xrightarrow{D} Xc\), and \(X_n + Y_n \xrightarrow{D} X+c\).↩︎