Proof of the Cramér-Rao Inequality

Introduction

In the last lecture, we defined the efficiency of one unbiased estimator relative to another estimator as \[ \mathrm{eff}(\tilde\theta_n,\hat\theta_n) = \frac{\mathrm{Var}(\hat\theta_n)}{\mathrm{Var}(\tilde\theta_n)}. \] If \(\mathrm{eff}(\tilde\theta_n,\hat\theta_n) < 1\) this means that \(\hat\theta_n\) has a smaller mean squared error than \(\tilde\theta_n\), and we would prefer to use \(\hat\theta_n\) for estimating \(\theta\). The question then arises - how do we know we have the best unbiased estimator (that is, with the lowest possible variance)? The Cramér-Rao inequality gives us a way to find it by giving us a lower bound on the variance of any estimator. If an estimator achieves the lower bound, then we know that we aren’t going to get anything that has a lower variance.

Let’s restate the Cramér-Rao inequality, including for general estimators that are not necessarily unbiased.

Cramér-Rao Lower Bound

Theorem 1 Let \(X_1, \ldots, X_n \sim f(x \mid \theta)\) be an IID random sample, and let \(T = t(X_1, \ldots, X_n)\) be an estimator of \(\theta\), such that \(\mathbb{E}(T) = \mathbb{E}[t(X_1, \ldots, X_n)] = k(\theta)\). Then for sufficiently smooth \(f\),

\[ \text{Var}(T) \geq \frac{[k'(\theta)]^2}{nI(\theta)}. \]

(Recall that if \(\theta_0\) is the true parameter value, then \(I(\theta_0)\) is the Fisher information in a sample of size 1.)

Further, if \(\mathbb{E}(T) = k(\theta) = \theta\), that is, \(T\) is an unbiased estimator of \(\theta\), then \[ \text{Var}(T) \geq \frac{1}{I_n(\theta)} = \frac{1}{nI(\theta)} \]

Proof. We sketch the proof below in the case of a continuous density function:

\[ \begin{align*} k(\theta) &= \mathbb{E}_\theta(t(X_1, \ldots, X_n)) \\ &= \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1, \ldots, x_n \mid \theta)\, dx_1 \cdots dx_n\\ &= \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \end{align*} \] Taking the derivative with respect to \(\theta\) and changing the order of differentiation and integration, we get: \[ \begin{align*} k'(\theta) &= \frac{\partial}{\partial \theta}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n\\ &=\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n) \left[\sum_{i=1}^{n} \frac{\frac{\partial}{\partial\theta} f(x_i \mid \theta)}{f(x_i \mid \theta)}\right] f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \end{align*} \] In the last equality, we multiplied and divided by \(f(x_i \mid \theta)\).

Now let

\[ Y = \sum_{i=1}^{n} \frac{\frac{\partial}{\partial\theta} f(x_i \mid \theta)}{f(x_i \mid \theta)} = \sum_{i=1}^{n} \frac{\partial}{\partial\theta} \log f(x_i \mid \theta), \]

which tells us that \(Y\) is the sum of score functions \(\Rightarrow \mathbb{E}_\theta(Y) = 0\), and \(\text{Var}(Y) = I_n(\theta) = nI(\theta)\).

Back to \(k'(\theta)\):

\[ k'(\theta) = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n) \cdot y(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \]

This means that \(k'(\theta) = \mathbb{E}_\theta(TY) = \text{Cov}(T, Y)\) since \(\mathbb{E}_\theta(Y) = 0\).

Let \(\rho_{TY} = \dfrac{\text{Cov}(T,Y)}{\sqrt{\text{Var}(T)}\sqrt{\text{Var}(Y)}}\) denote the correlation of \(T, Y\).

\[ \Rightarrow \rho_{TY} = \frac{\mathbb{E}_\theta(TY)}{\sqrt{\text{Var}(T)}\sqrt{\text{Var}(Y)}} = \frac{k'(\theta)}{\sqrt{\text{Var}(T)} \cdot \sqrt{nI(\theta)}} \]

We know that \(\rho_{TY}^2 \leq 1\), so:

\[ [k'(\theta)]^2 \leq \text{Var}(T) \cdot nI(\theta) \]

\[ \Rightarrow \frac{[k'(\theta)]^2}{nI(\theta)} \leq \text{Var}(T) \qquad \blacksquare \]


Remark 1. If \(k(\theta)\) is the identity function, that is \(k(\theta) = \mathbb{E}_\theta(T) = \theta\), we recover the Cramér-Rao Inequality for unbiased estimators.

Definition 1 If \(\hat{\theta}\) is an unbiased estimator of \(\theta_0\) such that \(\text{Var}(\hat{\theta}) = \dfrac{1}{nI(\theta_0)}\), then we say that \(\hat{\theta}\) is an efficient estimator.

Remark 2. If we have an efficient estimator, then the variance is minimized, which means the sampling distribution of \(\hat{\theta}\) is as concentrated about the true value \(\theta_0\) as it can be.

Remark 3. Since the asymptotic variance of the MLE is \((I_n(\theta))^{-1}\), the MLE is asymptotically efficient, but it may not be efficient for finite small \(n\).

Definition 2 (Hogg, McKean, and Craig (2005)) For a given \(n\), and an IID random sample \(X_1, \ldots, X_n \sim f(x \mid \theta_0)\), we call an unbiased estimator \(\hat{\theta}_n\) for \(\theta_0\) a Minimum Variance Unbiased Estimator (MVUE) of \(\theta_0\) if, for any other unbiased estimator of \(\theta_0\), say \(\tilde \theta_n\), \(\mathrm{Var}(\hat\theta_n) \le \mathrm{Var}(\tilde\theta)_n\).


In the following examples, we will use \(\ell'(\theta)\) to denote \(\dfrac{\partial \ell}{\partial \theta}\) and \(\ell''(\theta)\) to denote \(\dfrac{\partial^2 \ell}{\partial \theta^2}\).

Example 1 (Normal\((\theta,1)\) distribution with unknown mean, known variance:) Suppose \(X_1, \ldots, X_n \overset{IID}{\sim} \mathcal{N}(\theta,\, \sigma^2 = 1)\), so \(f(x\mid \theta) = \displaystyle \frac{1}{\sqrt{2\pi}}\, e^{-\frac{1}{2}(x-\theta)^2}\).

  1. \(\mathbb{E}(\overline{X}_n) = \dfrac{1}{n}\sum_{i=1}^n \mathbb{E}(X_i) = \dfrac{1}{n} \cdot n\theta_0 = \theta_0\)

  2. \(\text{Var}(\overline{X}_n) = \dfrac{\sigma^2}{n} = \dfrac{1}{n}\)

  3. \(I(\theta) = -\mathbb{E}\!\left(\dfrac{\partial^2}{\partial\theta^2} \log f(X \mid \theta)\right) = -\mathbb{E}(\ell''(\theta))\)

\[ \begin{align*} \ell(\theta) &= \log f(x \mid \theta) \\ &= -\tfrac{1}{2}\log(2\pi) - \tfrac{1}{2}(x - \theta)^2 \Rightarrow \ell'(\theta) &= -\tfrac{1}{2} \cdot 2(x-\theta)(-1)\\ &= (x - \theta)\\ \Rightarrow \ell''(\theta) &= -1 \\ \Rightarrow I(\theta_0) &= 1 \end{align*} \] Now, by Cramér-Rao:

\[ \text{Var}(\overline{X}_n) \geq \frac{1}{nI(\theta_0)} = \frac{1}{n} \]

But \(\text{Var}(\overline{X}_n) = \dfrac{1}{n}\)

Therefore, we see that the sample mean is an efficient estimator of the true mean, for a normal distribution.


Example 2 (Poisson(\(\theta\)) distribution with unknown rate:) \(X \sim \text{Poisson}(\theta)\), where \(P(X = x) = \dfrac{\theta^x e^{-\theta}}{x!}\), \(\quad x = 0, 1, 2, \ldots\)

Recall that \(\mathbb{E}(X) = \text{Var}(X) = \theta\). What is the MLE of \(\theta\) and is it efficient?

Check your answer!

First, let’s find the MLE. Let’s write down \(\mathrm{lik}(\theta)\) first, for an IID sample from this distribution:\(X_1, \ldots, X_n\).

\[ \begin{align*} \mathrm{lik}(\theta) &= P(X_1=x_1, \ldots, X_n=x_n)\\ &=\prod_{i=1}^n P(X_i = x_i)\\ &=\prod_{i=1}^n \dfrac{\theta^{x_i} e^{-\theta}}{x_i!}\\ \Rightarrow \ell_n(\theta) &= \sum_{i=1}^n\log \left(\dfrac{\theta^{x_i} e^{-\theta}}{x_i!}\right)\\ &= \sum_{i=1}^n \left[x_i\log\theta - \log(x_i!) - \theta\right]\\ &= \sum_{i=1}^n x_i\log\theta -\sum_{i=1}^n\log(x_i!)- n\theta\\ \Rightarrow \ell_n'(\theta) &= \sum_{i=1}^n \frac{x_i}{\theta} - n\\ \end{align*} \] Therefore we see that the MLE of \(\theta\) is the sample mean \(\overline{X}\). Now let’s compute the Fisher information with \(n=1\).

\[ \begin{align*} \Rightarrow \ell(\theta) &= \log f(x \mid \theta)\\ &= x\log\theta - \log(x!) - \theta\\ \Rightarrow \ell'(\theta) &= \frac{x}{\theta} - 1\\ \Rightarrow \ell''(\theta) &= -\frac{x}{\theta^2} \end{align*} \] Therefore, we get that \[ \Rightarrow \ell''(\theta) = -\frac{x}{\theta^2} \quad \Rightarrow \quad \mathbb{E}\!\left[-\frac{\partial^2 \ell}{\partial \theta^2}\right] = \frac{1}{\theta^2}\mathbb{E}(X) = \frac{\theta}{\theta^2} = \frac{1}{\theta} \]

\[ \Rightarrow I(\theta) = \frac{1}{\theta} \quad \Rightarrow \quad I_n(\theta) = \frac{n}{\theta} \]

\(\mathbb{E}(\overline{X}_n) = \theta\) and \(\text{Var}(\overline{X}) = \dfrac{\theta}{n}\)

By Cramér- Rao:

\[ \text{Var}(\overline{X}) \geq \frac{1}{I_n(\theta)} = \frac{1}{n/\theta} = \frac{\theta}{n} = \text{Var}(\overline{X}) \]

\(\Rightarrow \overline{X}\) is efficient.


Example 3 (Beta(\(\theta,1\)) distribution, with unknown parameter \(\theta\)) Hogg, McKean, and Craig (2005)

Let \(X_1, \ldots, X_n \sim f(x \mid \theta)\), \(\theta > 0\), where

\[ f(x \mid \theta) = \begin{cases} \theta\, x^{\theta - 1}, & 0 < x < 1 \\ 0, & \text{otherwise} \end{cases} \]

\[ \ell(\theta) = \log f(X \mid \theta) = \log\theta + (\theta - 1)\log X \]

\[ \ell'(\theta) = \frac{1}{\theta} + \log X \]

\[ \ell''(\theta) = -\frac{1}{\theta^2} \quad \Rightarrow \quad I(\theta) = -\mathbb{E}_\theta (\ell''(\theta)) = \frac{1}{\theta^2} \]

MLE of \(\theta\):

\[ \begin{align*} \ell_n(\theta) &= \sum_{i=1}^n \log f(x_i \mid \theta)\\ &= \sum_{i=1}^n \left[\log\theta + (\theta-1)\log x_i\right]\\ &= n\log\theta + (\theta - 1)\sum_{i=1}^n \log x_i. \end{align*} \]

\[ \ell_n'(\theta) = \frac{n}{\theta} + \sum_{i=1}^n \log(x_i) \]

Setting \(\ell_n'(\theta) = 0\) and solving for \(\hat{\theta}_{\text{MLE}}\equiv \hat\theta_n\), we get that the maximum likelihood estimator of \(\theta\) is:

\[ \hat{\theta}_n = \frac{-n}{\sum_{i=1}^n \log(X_i)} \tag{1}\]

Finding the density of \(\hat{\theta}_n\):

Let \(Y_i = -\log X_i \Rightarrow Y_i > 0\). For \(y > 0\), let’s first figure out the CDF of \(Y_i\).

You can try it yourself first, and then check your answer.
What is the distribution of \(Y_i\)?

\[ \begin{align*} F_{Y_i}(y) &= P(Y_i \leq y)\\ &= P(-\log X_i \leq y)\\ &= P(\log X_i \geq -y)\\ &= P(X_i \geq e^{-y}) \\ &= 1 - P(X_i < e^{-y}) \end{align*} \] Since \(Y_i\) is a continuous random variable, this means that \(F_{Y_i}(y) = 1- F_{X_i}(e^{-y})\). Thus:

\[ \begin{align*} f_{Y_i}(y) &= \frac{d}{dy} F_{Y_i}(y) \\ &= -f_{X_i}(e^{-y})(-e^{-y}) \\ &= \left[\theta(e^{-y})^{\theta-1}\right]\left[e^{-y}\right] \\ &= \theta e^{-\theta y}\cdot e^y\cdot e^{-y}\\ &= \theta e^{-\theta y}, \quad y > 0,\; \theta > 0 \end{align*} \] Therefore, \(Y_i \sim \text{Exp}(\theta)\) which is the same as \(\text{Gamma}(1, \theta)\).

\[ \Rightarrow Y_1 + Y_2 + \cdots + Y_n \sim \text{Gamma}(n, \theta) \]

Define \(S = \displaystyle \sum_{i=1}^n Y_i\), so \(S\sim \text{Gamma}(n, \theta)\). Then, by Equation 1, \(\hat\theta_n = \dfrac{n}{S}\). We need the expectation and variance of the MLE, so we can use the distribution of \(S\) for this.

We know that
\(\mathbb{E}\left(\dfrac{1}{S}\right) = \dfrac{\theta}{n-1}.\)

Let \(T\sim\text{Gamma}(\alpha, \lambda)\) distribution, so \(f_T(t) = \dfrac{\lambda^\alpha}{\Gamma(\alpha)}\, t^{\alpha-1} e^{-\lambda t}, \quad t > 0.\)

We can compute the moments of the distribution: \[ \begin{align*} \mathbb{E}(T^k) &= \int_0^\infty t^k \cdot \frac{\lambda^\alpha}{\Gamma(\alpha)}\, t^{\alpha-1} e^{-\lambda t}\, dt \\ &= \frac{\lambda^\alpha}{\Gamma(\alpha)} \int_0^\infty t^{\alpha+k-1} e^{-\lambda t}\, dt. \end{align*} \] Now we can do a neat trick here by observing that the integral almost looks like the Gamma function. Recall that \(\Gamma(\alpha+k) = \int_0^\infty u^{\alpha+k-1} e^{-u}\, du\), so we need a factor of \((\lambda t)^{\alpha+k-1}\) in the integrand. Multiplying and dividing by \(\lambda^{\alpha+k-1}\), we get: \[ \mathbb{E}(T^k) = \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{\lambda^{\alpha+k-1}} \int_0^\infty (\lambda t)^{\alpha+k-1} e^{-\lambda t}\, dt. \] Substituting \(u = \lambda t\), so \(du = \lambda dt\), and we get: \[ \begin{align*} \mathbb{E}(T^k) &= \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{\lambda^{\alpha+k}} \int_0^\infty u^{\alpha+k-1} e^{-u}\, du \\[4pt] &= \frac{1}{\lambda^k} \cdot \frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \end{align*} \] Note that this will be true for any \(k\) as long as \(\alpha + k > 0\), so let’s set \(k = -1, \alpha = n\), and \(\lambda = \theta\): \[ \begin{align*} \mathbb{E}\!\left(\frac{1}{S}\right) &= \frac{1}{\theta^{-1}} \cdot \frac{\Gamma(n-1)}{\Gamma(n)} = \theta \cdot \frac{\Gamma(n-1)}{(n-1)\,\Gamma(n-1)} = \frac{\theta}{n-1}. \end{align*} \]

By Equation 1, \(\hat\theta_n = \dfrac{n}{S} \Rightarrow \mathbb{E}(\hat\theta_n)= \dfrac{n}{n-1}\theta\). So we see that the MLE is not unbiased, but since the bias tends to 1 as \(n\to \infty\), we can say that the MLE is asymptotically unbiased.

To compute the variance of the MLE, we need \(\mathbb{E}(\hat\theta_n^2) = n^2\mathbb{E}\left(\dfrac{1}{S^2}\right)\). We know that if \(T\sim \text{Gamma} (\alpha, \lambda)\), then \[ \mathbb{E}(T^k) = \frac{1}{\lambda^k}\frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \] Therefore, plugging in the values for \(S\), which are \(\alpha = n, \lambda = \theta, k = -2\), we get: \[ \begin{align*} \mathbb{E}(S^{-2}) &= \theta^2 \dfrac{\Gamma(n-2)}{\Gamma(n)}\\ &= \theta^2 \dfrac{\Gamma(n-2)}{(n-1)(n-2)\Gamma(n-2)}\\ &= \dfrac{\theta^2}{(n-1)(n-2)} \end{align*} \]

Putting this all together, we can compute the variance of the MLE as: \[ \begin{align*} \mathrm{Var}(\hat\theta_n) &= \mathbb{E}(\hat\theta_n^2) - \left(\mathbb{E}(\hat\theta_n)\right)^2\\ &= \frac{n^2\theta^2}{(n-1)(n-2)} - \frac{n^2\theta^2}{(n-1)^2}\\ &= \frac{n^2\theta^2}{(n-1)^2(n-2)}. \end{align*} \] Since \(nI(\theta) = \dfrac{n}{\theta^2}\), we see that \(\mathrm{Var}(\hat\theta_n) \ge \dfrac{1}{nI(\theta)}\), so the MLE does not attain the lower bound. Notice that it will be asymptotically efficient - but we already knew that, since we have shown that MLEs are asymptotically efficient.

[Rice (2006); Pimentel (2024); Hogg, McKean, and Craig (2005);]

References

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.