Proof of the Rao-Blackwell Theorem and Properties of Estimators

Introduction

In this lecture we will wrap up the classical ideas of inference regarding parameter estimation. We will prove the Rao-Blackwell theorem, and see what “Rao-Blackwellization” of an estimator means. Then we will discuss properties of estimators, which can help us choose between two estimators. We stated the Rao-Blackwell theorem in the last lecture, which says that if we have an estimator that is not a function of a sufficient statistic, we can improve it. This was proved by C.R. Rao in 1945, and then independently by David Blackwell a couple of years later. (Bera 2003)

The Rao-Blackwell Theorem

So far, we have seen that MLEs are functions of sufficient statistics, and we have seen that if we have a sufficient statistic, we don’t need to store the entire sample. The main theorem we will discuss in this section, the Rao-Blackwell theorem, goes one step further. It says that an estimator should only depend on a sufficient statistic, otherwise we can improve it (by finding an estimator with lower variance).

Theorem 1 Suppose that \(\hat\theta\) is an estimator for \(\theta\) with \(\mathbb{E}(\hat\theta^2) < \infty\). Assume that \(T\) is a sufficient statistic for \(\theta\). Define a new estimator \(\tilde \theta = \mathbb{E}(\hat\theta \mid T)\). Then: \[ MSE(\tilde\theta) = \mathbb{E}(\tilde\theta - \theta)^2 \le \mathbb{E}(\hat\theta - \theta)^2 = MSE(\hat\theta) \]

That is, if we know a sufficient statistic \(T\), and we have an estimator \(\hat\theta\), we can define an even better estimator for \(\theta\), denoted \(\tilde\theta\), which will have a smaller MSE. Note that the inequality is strict unless \(\tilde{\theta}=\hat{\theta}\).

Brief Review of Conditional Expectation

Before we prove the theorem, let’s review conditional expectations.

Conditional Distributions and Expectations

Let \(X\) and \(Y\) be random variables, and let \(f(x,y)\) be the joint density (or mass function, if the random variables are discrete). Recall that the conditional density (or conditional mass function) of \(Y\) given \(X = x\) is defined by: \[ f_{Y\mid X}(y\midx)=\frac{f(x,y)}{f_{X}(x)}, f_{X}(x)\ne0 \]

Here, \(f_{X}(x)\) is the marginal of \(X\): \[ f_{X}(x)= \begin{cases} \sum_{y}f(x,y), \text{ if } X \text{ and } Y \text{ are discrete}\\ \int_{-\infty}^{\infty}f(x,y)dy, \text{ if } X \text{ and } Y \text{ are continuous} \end{cases} \]

We can similarly define \(f_{X\mid Y}(x\mid y)\). Note that \(f(x,y)=f_{Y\mid X}(y\mid x)f_{X}(x)=f_{X\mid Y}(x\mid y)f_{Y}(y)\).

Further, recall that \(f_{Y\mid X}(y \mid x)\) is a bona fide probability:

\(0\le f_{Y\mid X}\le1\) for discrete PMF, and \(0\le f_{Y\mid X}\) for continuous densities.
\(\int_{-\infty}^{\infty}f_{Y\mid X}(y\mid x)dy= \displaystyle\int_{-\infty}^{\infty}\frac{f(x,y)}{f_{X}(x)}dy=1\) (and analogously in the discrete case).

We can therefore define the conditional expectation \(\mathbb{E}(Y\mid X=x)\), which is a function of \(x\): \[ g(x)=\mathbb{E}(Y\mid X=x)=\int_{-\infty}^{\infty}yf_{Y\mid X}(y\mid x)dy \]

Thus, even though \(\mathbb{E}(Y)\) is a constant parameter of the distribution of \(Y\), \(\mathbb{E}(Y\mid X=x)\) is a function of \(x\). We fix \(X=x\) and then sum or integrate out the \(y\). Further, before we know the observed value \(x\), we don’t know the value of \(\mathbb{E}(Y\mid X=x)\), and therefore it is a random variable, denoted by \(\mathbb{E}(Y\mid X)\).

Example 1 Let \(X\sim \text{Unif}(0,1)\) and let \(Y\mid X=x\sim \text{Unif}(0,x)\). That is, for any given \(x\), \(Y\) has the uniform distribution on the interval \((0,x)\). What is \(\mathbb{E}(Y\mid X=x)\)? Now, since \(Y\mid X=x\sim \text{Unif}(0,x)\), we have that \[ f_{Y\mid X}(y\mid x)=\frac{1}{x}, \quad 0<y<x \] Therefore, by the definition of conditional expectation, \[ \begin{align*} \mathbb{E}(Y\mid X=x) &= \int_{0}^{x}y\frac{1}{x}dy \\ &=\frac{1}{x}\left[\frac{y^2}{2}\right]_{0}^{x} \\ &=\frac{1}{x}\cdot\frac{x^2}{2} \\ &=\frac{x}{2}. \end{align*} \] Of course, you could have written this down without actually computing the integral, since the mean would be \(\dfrac{0+x}{2}\),

Thus \(g(x)=\mathbb{E}(Y\mid X=x)=\dfrac{x}{2}\) for \(0<x<1\), and the random variable \(g(X)=\dfrac{X}{2}\).

Now we will state and prove an enormously useful theorem. We use it to compute the (unconditional) expectation of a random variable \(Y\) by conditioning it on another random variable \(X\), and computing the expected value of the random variable \(\mathbb{E}(Y\mid X)\). This is called by various names such as the law of total expectation, the tower property, and the name we use below. (In the text by Blitzstein and Hwang, it is called Adam’s Law):

Theorem 2 (Law of Iterated Expectations)

For random variables \(X, Y\): \[ \mathbb{E}_X(\mathbb{E}_Y(Y\mid X))=\mathbb{E}(Y) \]

Proof. (Continuous case): \[ \begin{align*} \mathbb{E}[\mathbb{E}(Y\mid X)] &= \mathbb{E}\left[\int_{-\infty}^{\infty}yf_{Y\mid X}(y\mid x)dy\right]\\ &=\int_{-\infty}^{\infty}\left[\int_{-\infty}^{\infty}y\frac{f(x,y)}{f_X(x)}dy\right]f_X(x)dx\\ &=\int_{-\infty}^{\infty}y\left[\int_{-\infty}^{\infty}f(x,y)dx\right]dy=\int_{-\infty}^{\infty}yf_Y(y)dy \\ &=\mathbb{E}(Y) \qquad \qquad \blacksquare \end{align*} \]

Conditional Variance

Having defined the conditional expectation of \(Y\) given a random variable \(X\), we can now define the conditional variance of \(Y\) given a random variable \(X\) similarly.

Definition 1 The conditional variance of a random variable \(Y\) given a random variable \(X\) is given by: \[ \begin{align*} \operatorname{Var}(Y\mid X=x) &= \mathbb{E}\left[\left(Y-\mathbb{E}(Y\mid X=x)\right)^2\mid X=x\right]\\ &= \mathbb{E}(Y^2\mid X=x)-[\mathbb{E}(Y\mid X=x)]^2 \end{align*} \]

\(\operatorname{Var}(Y\mid X)\) is a random variable with expected value \(\mathbb{E}[\operatorname{Var}(Y\mid X)]\). Note that it is a function of \(X\), so the expectation is over the values of \(X\).

Now we can relate the unconditional variance of \(Y\) to the conditional variance. This result is analogous to the law of total expectation and is called the law of total variance (and in Blitzstein and Hwang (2019) it is called Eve’s Law). This says that to compute the total variance of \(Y\), we can split \(Y\) into, say, groups (conditioning on the values taken by \(X\)), and look at the variation within groups and the variation between groups, and add these up.

Theorem 3 \[ \operatorname{Var}(Y)=\mathbb{E}(\operatorname{Var}(Y\mid X)) + \operatorname{Var}(\mathbb{E}(Y\mid X)) \] (Expectation of conditional Variance + Variance of conditional Expectation).

Proving the Rao-Blackwell theorem

An immediate consequence of the law of iterated expectations is that if \(\tilde\theta = \mathbb{E}(\hat\theta \mid T)\), then: \[ \mathbb{E}(\tilde{\theta})=\mathbb{E}(\mathbb{E}(\hat{\theta}\mid T))=\mathbb{E}(\hat{\theta}). \] Therefore, \(\operatorname{Bias}(\tilde\theta) = \operatorname{Bias}(\hat\theta)\), and if \(\hat{\theta}\) is unbiased, so is \(\tilde{\theta}\). This means that in order to compare MSEs, we only need to compare the variances of these two estimators.

Using the law of total variance on \(\hat{\theta}\) and \(\tilde{\theta}=\mathbb{E}(\hat{\theta}\mid T)\): \[ \begin{align*} \operatorname{Var}(\hat{\theta}) &= \operatorname{Var}(\mathbb{E}(\hat{\theta}\mid T))+\mathbb{E}(\operatorname{Var}(\hat{\theta}\mid T))\\ &= \operatorname{Var}(\tilde{\theta})+\mathbb{E}(\operatorname{Var}(\hat{\theta}\mid T)) \end{align*} \] Since \(\mathbb{E}(\operatorname{Var}(\hat{\theta}\mid T))\ge0\), then \(\operatorname{Var}(\hat{\theta})-\operatorname{Var}(\tilde{\theta})\ge0 \quad \blacksquare\)

This theorem tells us that when searching for a Minimum Variance Unbiased Estimator (MVUE), if a sufficient statistic exists, we can restrict our search to functions of that sufficient statistic.

Examples of “Rao-Blackwellization”

Example 2 Let \(X_1, \dots, X_n \overset{IID}{\sim} \text{Poisson}(\lambda)\) and define \(\theta=e^{-\lambda} \Rightarrow \lambda=-\log\theta\). \[ P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!} = \frac{\theta(-\log\theta)^x}{x!} \] Define \(\hat{\theta}=\mathbb{1}_{\{X_1=0\}}\). Use the Rao-Blackwell theorem to find an estimator with smaller MSE. Let \(\displaystyle T=\sum_{i=1}^n X_i\). \[ \begin{align*} \tilde{\theta} &= \mathbb{E}(\hat{\theta}\mid T)\\ &=\mathbb{E}(\mathbb{1}_{\{X_1=0\}}\mid T)=P\left(X_1=0\mid\sum_{i=1}^n X_i=t\right)\\ &=\dfrac{P(X_1=0, \sum_{i=2}^n X_i=t)}{P(\sum_{i=1}^n X_i=t)} \\[8pt] &= \dfrac{P(X_1=0)P(\sum_{i=2}^n X_i=t)}{P(\sum_{i=1}^n X_i=t)} \end{align*} \]

Since \(\displaystyle \sum_{i=2}^n X_i \sim \text{Poisson}((n-1)\lambda)\) and \(\displaystyle\sum_{i=1}^n X_i \sim \text{Poisson}(n\lambda)\), we get that : \[ \begin{align*} \tilde{\theta} &= \dfrac{e^{-\lambda} \dfrac{[(n-1)\lambda]^t e^{-(n-1)\lambda}}{t!}}{\dfrac{(n\lambda)^t e^{-n\lambda}}{t!}} = \left(\dfrac{n-1}{n}\right)^t \end{align*} \]

\[ \Rightarrow \tilde{\theta}=\left(\frac{n-1}{n}\right)^{\sum_{i=1}^n X_i} \]

Properties of Estimators

We wrap up our study of classical inference by summarizing the desirable properties of an estimator. We have seen all of these previously, but now I have added another property, that of equivariance:

Consistent: \(\hat{\theta}_n \xrightarrow{P} \theta_0\).
Unbiased: \(\mathbb{E}(\hat{\theta}_n)=\theta_0\).
Asymptotically Normal: \(\dfrac{(\hat{\theta}_n-\theta_0)}{\sqrt{1/I_n(\theta_0)}} \xrightarrow{D} N(0,1)\).
Efficient / Asymptotically Efficient: \(\operatorname{Var}(\hat{\theta}_n)=\dfrac{1}{I_n(\theta_0)}\).
Equivariant / Invariant: An estimator that changes in a predictable way if data changes. If \(g(\theta)\) is a parameter transformation, \(g(\hat{\theta}_n)\) should estimate \(g(\theta)\). MLE is equivariant.

References

Bera, Anil K. 2003. “The ET Interview: Professor C.R. Rao.” Econometric Theory 19 (2): 331–400. https://doi.org/10.1017/S0266466603192067.

Blitzstein, Joseph K., and Jessica Hwang. 2019. Introduction to Probability. CRC Press.

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. New York: Springer.