Proof of the Cramér-Rao Inequality

Introduction

In the last lecture, we defined the efficiency of one unbiased estimator relative to another estimator as \[ \mathrm{eff}(\tilde\theta_n,\hat\theta_n) = \frac{\mathrm{Var}(\hat\theta_n)}{\mathrm{Var}(\tilde\theta_n)}. \] If \(\mathrm{eff}(\tilde\theta_n,\hat\theta_n) < 1\) this means that \(\hat\theta_n\) has a smaller mean squared error than \(\tilde\theta_n\), and we would prefer to use \(\hat\theta_n\) for estimating \(\theta\). The question then arises - how do we know we have the best unbiased estimator (that is, with the lowest possible variance)? The Cramér-Rao inequality gives us a way to find it by giving us a lower bound on the variance of any estimator. If an estimator achieves the lower bound, then we know that we aren’t going to get anything that has a lower variance.

Let’s restate the Cramér-Rao inequality, including for general estimators that are not necessarily unbiased.

Cramér-Rao Lower Bound

Theorem 1 Let \(X_1, \ldots, X_n \sim f(x \mid \theta)\) be an IID random sample, and let \(T = t(X_1, \ldots, X_n)\) be an estimator of \(\theta\), such that \(\mathbb{E}(T) = \mathbb{E}[t(X_1, \ldots, X_n)] = k(\theta)\). Then for sufficiently smooth \(f\),

\[ \text{Var}(T) \geq \frac{[k'(\theta)]^2}{nI(\theta)}. \]

(Recall that if \(\theta_0\) is the true parameter value, then \(I(\theta_0)\) is the Fisher information in a sample of size 1.)

Further, if \(\mathbb{E}(T) = k(\theta) = \theta\), that is, \(T\) is an unbiased estimator of \(\theta\), then \[ \text{Var}(T) \geq \frac{1}{I_n(\theta)} = \frac{1}{nI(\theta)} \]

Proof. We sketch the proof below in the case of a continuous density function:

\[ \begin{align*} k(\theta) &= \mathbb{E}_\theta(t(X_1, \ldots, X_n)) \\ &= \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1, \ldots, x_n \mid \theta)\, dx_1 \cdots dx_n\\ &= \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \end{align*} \] Taking the derivative with respect to \(\theta\) and changing the order of differentiation and integration, we get: \[ \begin{align*} k'(\theta) &= \frac{\partial}{\partial \theta}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n\\ &=\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n) \left[\sum_{i=1}^{n} \frac{\frac{\partial}{\partial\theta} f(x_i \mid \theta)}{f(x_i \mid \theta)}\right] f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \end{align*} \] In the last equality, we multiplied and divided by \(f(x_i \mid \theta)\).

Now let

\[ Y = \sum_{i=1}^{n} \frac{\frac{\partial}{\partial\theta} f(x_i \mid \theta)}{f(x_i \mid \theta)} = \sum_{i=1}^{n} \frac{\partial}{\partial\theta} \log f(x_i \mid \theta), \]

which tells us that \(Y\) is the sum of score functions \(\Rightarrow \mathbb{E}_\theta(Y) = 0\), and \(\text{Var}(Y) = I_n(\theta) = nI(\theta)\).

Back to \(k'(\theta)\):

\[ k'(\theta) = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} t(x_1, \ldots, x_n) \cdot y(x_1, \ldots, x_n)\, f(x_1 \mid \theta) \cdots f(x_n \mid \theta)\, dx_1 \cdots dx_n \]

This means that \(k'(\theta) = \mathbb{E}_\theta(TY) = \text{Cov}(T, Y)\) since \(\mathbb{E}_\theta(Y) = 0\).

Let \(\rho_{TY} = \dfrac{\text{Cov}(T,Y)}{\sqrt{\text{Var}(T)}\sqrt{\text{Var}(Y)}}\) denote the correlation of \(T, Y\).

\[ \Rightarrow \rho_{TY} = \frac{\mathbb{E}_\theta(TY)}{\sqrt{\text{Var}(T)}\sqrt{\text{Var}(Y)}} = \frac{k'(\theta)}{\sqrt{\text{Var}(T)} \cdot \sqrt{nI(\theta)}} \]

We know that \(\rho_{TY}^2 \leq 1\), so:

\[ [k'(\theta)]^2 \leq \text{Var}(T) \cdot nI(\theta) \]

\[ \Rightarrow \frac{[k'(\theta)]^2}{nI(\theta)} \leq \text{Var}(T) \qquad \blacksquare \]

Remark 1. If \(k(\theta)\) is the identity function, that is \(k(\theta) = \mathbb{E}_\theta(T) = \theta\), we recover the Cramér-Rao Inequality for unbiased estimators.

Definition 1 If \(\hat{\theta}\) is an unbiased estimator of \(\theta_0\) such that \(\text{Var}(\hat{\theta}) = \dfrac{1}{nI(\theta_0)}\), then we say that \(\hat{\theta}\) is an efficient estimator.

Remark 2. If we have an efficient estimator, then the variance is minimized, which means the sampling distribution of \(\hat{\theta}\) is as concentrated about the true value \(\theta_0\) as it can be.

Remark 3. Since the asymptotic variance of the MLE is \((I_n(\theta))^{-1}\), the MLE is asymptotically efficient, but it may not be efficient for finite small \(n\).

Definition 2 (Hogg, McKean, and Craig (2005)) For a given \(n\), and an IID random sample \(X_1, \ldots, X_n \sim f(x \mid \theta_0)\), we call an unbiased estimator \(\hat{\theta}_n\) for \(\theta_0\) a Minimum Variance Unbiased Estimator (MVUE) of \(\theta_0\) if, for any other unbiased estimator of \(\theta_0\), say \(\tilde \theta_n\), \(\mathrm{Var}(\hat\theta_n) \le \mathrm{Var}(\tilde\theta)_n\).

In the following examples, we will use \(\ell'(\theta)\) to denote \(\dfrac{\partial \ell}{\partial \theta}\) and \(\ell''(\theta)\) to denote \(\dfrac{\partial^2 \ell}{\partial \theta^2}\).

Example 1 (Normal\((\theta,1)\) distribution with unknown mean, known variance:) Suppose \(X_1, \ldots, X_n \overset{IID}{\sim} \mathcal{N}(\theta,\, \sigma^2 = 1)\), so \(f(x\mid \theta) = \displaystyle \frac{1}{\sqrt{2\pi}}\, e^{-\frac{1}{2}(x-\theta)^2}\).

\(\mathbb{E}(\overline{X}_n) = \dfrac{1}{n}\sum_{i=1}^n \mathbb{E}(X_i) = \dfrac{1}{n} \cdot n\theta_0 = \theta_0\) ✓
\(\text{Var}(\overline{X}_n) = \dfrac{\sigma^2}{n} = \dfrac{1}{n}\)
\(I(\theta) = -\mathbb{E}\!\left(\dfrac{\partial^2}{\partial\theta^2} \log f(X \mid \theta)\right) = -\mathbb{E}(\ell''(\theta))\)

\[ \begin{align*} \ell(\theta) &= \log f(x \mid \theta) \\ &= -\tfrac{1}{2}\log(2\pi) - \tfrac{1}{2}(x - \theta)^2 \Rightarrow \ell'(\theta) &= -\tfrac{1}{2} \cdot 2(x-\theta)(-1)\\ &= (x - \theta)\\ \Rightarrow \ell''(\theta) &= -1 \\ \Rightarrow I(\theta_0) &= 1 \end{align*} \] Now, by Cramér-Rao:

\[ \text{Var}(\overline{X}_n) \geq \frac{1}{nI(\theta_0)} = \frac{1}{n} \]

But \(\text{Var}(\overline{X}_n) = \dfrac{1}{n}\)

Therefore, we see that the sample mean is an efficient estimator of the true mean, for a normal distribution.

Example 2 (Poisson(\(\theta\)) distribution with unknown rate:) \(X \sim \text{Poisson}(\theta)\), where \(P(X = x) = \dfrac{\theta^x e^{-\theta}}{x!}\), \(\quad x = 0, 1, 2, \ldots\)

Recall that \(\mathbb{E}(X) = \text{Var}(X) = \theta\). What is the MLE of \(\theta\) and is it efficient?

Check your answer!

First, let’s find the MLE. Let’s write down \(\mathrm{lik}(\theta)\) first, for an IID sample from this distribution:\(X_1, \ldots, X_n\).

\[ \begin{align*} \mathrm{lik}(\theta) &= P(X_1=x_1, \ldots, X_n=x_n)\\ &=\prod_{i=1}^n P(X_i = x_i)\\ &=\prod_{i=1}^n \dfrac{\theta^{x_i} e^{-\theta}}{x_i!}\\ \Rightarrow \ell_n(\theta) &= \sum_{i=1}^n\log \left(\dfrac{\theta^{x_i} e^{-\theta}}{x_i!}\right)\\ &= \sum_{i=1}^n \left[x_i\log\theta - \log(x_i!) - \theta\right]\\ &= \sum_{i=1}^n x_i\log\theta -\sum_{i=1}^n\log(x_i!)- n\theta\\ \Rightarrow \ell_n'(\theta) &= \sum_{i=1}^n \frac{x_i}{\theta} - n\\ \end{align*} \] Therefore we see that the MLE of \(\theta\) is the sample mean \(\overline{X}\). Now let’s compute the Fisher information with \(n=1\).

\[ \begin{align*} \Rightarrow \ell(\theta) &= \log f(x \mid \theta)\\ &= x\log\theta - \log(x!) - \theta\\ \Rightarrow \ell'(\theta) &= \frac{x}{\theta} - 1\\ \Rightarrow \ell''(\theta) &= -\frac{x}{\theta^2} \end{align*} \] Therefore, we get that \[ \Rightarrow \ell''(\theta) = -\frac{x}{\theta^2} \quad \Rightarrow \quad \mathbb{E}\!\left[-\frac{\partial^2 \ell}{\partial \theta^2}\right] = \frac{1}{\theta^2}\mathbb{E}(X) = \frac{\theta}{\theta^2} = \frac{1}{\theta} \]

\[ \Rightarrow I(\theta) = \frac{1}{\theta} \quad \Rightarrow \quad I_n(\theta) = \frac{n}{\theta} \]

\(\mathbb{E}(\overline{X}_n) = \theta\) and \(\text{Var}(\overline{X}) = \dfrac{\theta}{n}\)

By Cramér- Rao:

\[ \text{Var}(\overline{X}) \geq \frac{1}{I_n(\theta)} = \frac{1}{n/\theta} = \frac{\theta}{n} = \text{Var}(\overline{X}) \]

\(\Rightarrow \overline{X}\) is efficient.

Example 3 (Beta(\(\theta,1\)) distribution, with unknown parameter \(\theta\)) Hogg, McKean, and Craig (2005)

Let \(X_1, \ldots, X_n \sim f(x \mid \theta)\), \(\theta > 0\), where

\[ f(x \mid \theta) = \begin{cases} \theta\, x^{\theta - 1}, & 0 < x < 1 \\ 0, & \text{otherwise} \end{cases} \]

\[ \ell(\theta) = \log f(X \mid \theta) = \log\theta + (\theta - 1)\log X \]

\[ \ell'(\theta) = \frac{1}{\theta} + \log X \]

\[ \ell''(\theta) = -\frac{1}{\theta^2} \quad \Rightarrow \quad I(\theta) = -\mathbb{E}_\theta (\ell''(\theta)) = \frac{1}{\theta^2} \]

MLE of \(\theta\):

\[ \begin{align*} \ell_n(\theta) &= \sum_{i=1}^n \log f(x_i \mid \theta)\\ &= \sum_{i=1}^n \left[\log\theta + (\theta-1)\log x_i\right]\\ &= n\log\theta + (\theta - 1)\sum_{i=1}^n \log x_i. \end{align*} \]

\[ \ell_n'(\theta) = \frac{n}{\theta} + \sum_{i=1}^n \log(x_i) \]

Setting \(\ell_n'(\theta) = 0\) and solving for \(\hat{\theta}_{\text{MLE}}\equiv \hat\theta_n\), we get that the maximum likelihood estimator of \(\theta\) is:

\[ \hat{\theta}_n = \frac{-n}{\sum_{i=1}^n \log(X_i)} \tag{1}\]

Finding the density of \(\hat{\theta}_n\):

Let \(Y_i = -\log X_i \Rightarrow Y_i > 0\). For \(y > 0\), let’s first figure out the CDF of \(Y_i\).

You can try it yourself first, and then check your answer.

What is the distribution of \(Y_i\)?

\[ \begin{align*} F_{Y_i}(y) &= P(Y_i \leq y)\\ &= P(-\log X_i \leq y)\\ &= P(\log X_i \geq -y)\\ &= P(X_i \geq e^{-y}) \\ &= 1 - P(X_i < e^{-y}) \end{align*} \] Since \(Y_i\) is a continuous random variable, this means that \(F_{Y_i}(y) = 1- F_{X_i}(e^{-y})\). Thus:

\[ \begin{align*} f_{Y_i}(y) &= \frac{d}{dy} F_{Y_i}(y) \\ &= -f_{X_i}(e^{-y})(-e^{-y}) \\ &= \left[\theta(e^{-y})^{\theta-1}\right]\left[e^{-y}\right] \\ &= \theta e^{-\theta y}\cdot e^y\cdot e^{-y}\\ &= \theta e^{-\theta y}, \quad y > 0,\; \theta > 0 \end{align*} \] Therefore, \(Y_i \sim \text{Exp}(\theta)\) which is the same as \(\text{Gamma}(1, \theta)\).

\[ \Rightarrow Y_1 + Y_2 + \cdots + Y_n \sim \text{Gamma}(n, \theta) \]

Define \(S = \displaystyle \sum_{i=1}^n Y_i\), so \(S\sim \text{Gamma}(n, \theta)\). Then, by Equation 1, \(\hat\theta_n = \dfrac{n}{S}\). We need the expectation and variance of the MLE, so we can use the distribution of \(S\) for this.

We know that

\(\mathbb{E}\left(\dfrac{1}{S}\right) = \dfrac{\theta}{n-1}.\)

Let \(T\sim\text{Gamma}(\alpha, \lambda)\) distribution, so \(f_T(t) = \dfrac{\lambda^\alpha}{\Gamma(\alpha)}\, t^{\alpha-1} e^{-\lambda t}, \quad t > 0.\)

We can compute the moments of the distribution: \[ \begin{align*} \mathbb{E}(T^k) &= \int_0^\infty t^k \cdot \frac{\lambda^\alpha}{\Gamma(\alpha)}\, t^{\alpha-1} e^{-\lambda t}\, dt \\ &= \frac{\lambda^\alpha}{\Gamma(\alpha)} \int_0^\infty t^{\alpha+k-1} e^{-\lambda t}\, dt. \end{align*} \] Now we can do a neat trick here by observing that the integral almost looks like the Gamma function. Recall that \(\Gamma(\alpha+k) = \int_0^\infty u^{\alpha+k-1} e^{-u}\, du\), so we need a factor of \((\lambda t)^{\alpha+k-1}\) in the integrand. Multiplying and dividing by \(\lambda^{\alpha+k-1}\), we get: \[ \mathbb{E}(T^k) = \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{\lambda^{\alpha+k-1}} \int_0^\infty (\lambda t)^{\alpha+k-1} e^{-\lambda t}\, dt. \] Substituting \(u = \lambda t\), so \(du = \lambda dt\), and we get: \[ \begin{align*} \mathbb{E}(T^k) &= \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{\lambda^{\alpha+k}} \int_0^\infty u^{\alpha+k-1} e^{-u}\, du \\[4pt] &= \frac{1}{\lambda^k} \cdot \frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \end{align*} \] Note that this will be true for any \(k\) as long as \(\alpha + k > 0\), so let’s set \(k = -1, \alpha = n\), and \(\lambda = \theta\): \[ \begin{align*} \mathbb{E}\!\left(\frac{1}{S}\right) &= \frac{1}{\theta^{-1}} \cdot \frac{\Gamma(n-1)}{\Gamma(n)} = \theta \cdot \frac{\Gamma(n-1)}{(n-1)\,\Gamma(n-1)} = \frac{\theta}{n-1}. \end{align*} \]

By Equation 1, \(\hat\theta_n = \dfrac{n}{S} \Rightarrow \mathbb{E}(\hat\theta_n)= \dfrac{n}{n-1}\theta\). So we see that the MLE is not unbiased, but since the bias tends to 1 as \(n\to \infty\), we can say that the MLE is asymptotically unbiased.

To compute the variance of the MLE, we need \(\mathbb{E}(\hat\theta_n^2) = n^2\mathbb{E}\left(\dfrac{1}{S^2}\right)\). We know that if \(T\sim \text{Gamma} (\alpha, \lambda)\), then \[ \mathbb{E}(T^k) = \frac{1}{\lambda^k}\frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \] Therefore, plugging in the values for \(S\), which are \(\alpha = n, \lambda = \theta, k = -2\), we get: \[ \begin{align*} \mathbb{E}(S^{-2}) &= \theta^2 \dfrac{\Gamma(n-2)}{\Gamma(n)}\\ &= \theta^2 \dfrac{\Gamma(n-2)}{(n-1)(n-2)\Gamma(n-2)}\\ &= \dfrac{\theta^2}{(n-1)(n-2)} \end{align*} \]

Putting this all together, we can compute the variance of the MLE as: \[ \begin{align*} \mathrm{Var}(\hat\theta_n) &= \mathbb{E}(\hat\theta_n^2) - \left(\mathbb{E}(\hat\theta_n)\right)^2\\ &= \frac{n^2\theta^2}{(n-1)(n-2)} - \frac{n^2\theta^2}{(n-1)^2}\\ &= \frac{n^2\theta^2}{(n-1)^2(n-2)}. \end{align*} \] Since \(nI(\theta) = \dfrac{n}{\theta^2}\), we see that \(\mathrm{Var}(\hat\theta_n) \ge \dfrac{1}{nI(\theta)}\), so the MLE does not attain the lower bound. Notice that it will be asymptotically efficient - but we already knew that, since we have shown that MLEs are asymptotically efficient.

References

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.