Proof of the Factorization theorem and the Rao-Blackwell Theorem

The Factorization Theorem:

In the last lecture note, we stated the factorization theorem which establishes a necessary and sufficient condition for a statistic to be sufficient for a population parameter $\theta$. Let’s restate it, and then we will sketch the proof in the discrete case.

Theorem 1 A necessary and sufficient condition for $T$ to be a sufficient statistic for $\theta$ is if there exist functions $g$ & $h$ such that:

\[ f(x_1, \ldots, x_n \mid \theta) = g\!\left(T(x_1, \ldots, x_n) \mid \theta\right) \cdot h(x_1, \ldots, x_n) \]

Sketch of Proof (discrete case):

Proof. First, let’s show that factorization $\implies$ sufficiency - that if we can factor $f$, then $T$ is sufficient.

Let $T = T(X_1, \ldots, X_n)$. Let $\vec{X} = (X_1, \ldots, X_n)$. We need the PMF of $T$.

\[ \begin{align*} P(T = t) &= \sum_{\vec{x}:\, T(\vec{x}) = t} P(\vec{X} = \vec{x}),\quad \text{ (where we sum over all $x$ that are mapped to $t$)}\\[4pt] &= \sum_{\vec{x}:\, T(\vec{x}) = t} g(T(\vec{x}) \mid \theta)\, h(\vec{x}), \quad \text{since $f$ can be factored into $g$ and $h$ }\\[4pt] &= \sum_{\vec{x}:\, T(\vec{x}) = t} g(t \mid \theta) \cdot h(\vec{x}), \\[4pt] &= g(t \mid \theta) \sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x}). \end{align*} \]

Now let’s check if the conditional distribution of $\vec{X}$ given $T$ is independent of $\theta$: \[ \begin{align*} P(\vec{X} = \vec{x} \mid T = t) &= \frac{P(\vec{X} = \vec{x},\, T = t)}{P(T = t)} = \frac{P(\vec{X} = \vec{x})}{P(T = t)}\quad \text{ notice that the numerator is the probability for some particular $\vec{x}$} \\[8pt] &= \frac{g(t \mid \theta)\cdot h(\vec{x})}{g(t \mid \theta)\sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x})} \\[8pt] &= \frac{h(\vec{x})}{\sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x})} \quad \longrightarrow \text{no dependence on } \theta \end{align*} \]

$\implies T$ is sufficient for $\theta$.

Now the other direction: sufficiency $\implies$ factorization. That is, if $T$ is sufficient, we can factor $f$ into $g, h$.

Now, if $T$ is sufficient this implies that $P(\vec{X} = \vec{x} \mid T = t)$ does not depend on $\theta$.

Define $g(t \mid \theta) = P(T = t \mid \theta)$, and $h(\vec x) = P(\vec{X} = \vec{x} \mid T = t)$. Then we have that \[ \begin{align*} f(\vec x \mid \theta) &= P(\vec{X} = \vec{x} \mid \theta) \\ &= P(\vec{X} = \vec{x} \mid T = t, \theta) P(T = t \mid \theta) \\ &= \underbrace{P(\vec{X} = \vec{x} \mid T = t)}_{h}\cdot \underbrace{P(T = t \mid \theta)}_{\large{g}},\\ \end{align*} \] where the last equality is because $T$ is a sufficient statistic. $\blacksquare$

We now have a nice corollary to the theorem that says that all we need for the MLE is $T$, and we don’t need the entire sample.

Corollary 1 If $T$ is sufficient for $\theta$, then the MLE $\hat \theta$ is a function of $T$.

Let $X_1, , X_n $ be an IID sample with density function (or PMF) $f(x\mid \theta)$. Then \[ \begin{align*} \mathrm{lik}(\theta) &= f(x_1, \ldots, x_n\mid \theta)\\ &= g(T(x_1, \ldots, x_n)\mid \theta)\cdot h(x_1, \ldots, x_n)\quad \text{(by sufficiency of $T$)}\\ \end{align*} \] Now, in order to compute the MLE, we take logs (and define $\ell(\theta)$), differentiate etc. You can see that when we take logs and differentiate with respect to $\theta$, the $h$-term vanishes, and we are left with a function of $T$: \[ \begin{align*} \ell(\theta) &= \log(g(T(x_1, \ldots, x_n)\mid \theta)\cdot h(x_1, \ldots, x_n))\\ &= \log g(T(x_1, \ldots, x_n)\mid \theta) +\log h(x_1, \ldots, x_n))\\ \implies \ell'(\theta) &= \frac{\partial}{\partial \theta}\log g(T(x_1, \ldots, x_n)\mid \theta) + 0\\ \end{align*} \] Thus, the maximizer of $\ell(\theta)$ will only be a function of $T$. $\blacksquare$

The Rao-Blackwell Theorem

So far, we have seen that MLEs are functions of sufficient statistics, and we have seen that if we have a sufficient statistic, we don’t need to store the entire sample. The main theorem we will discuss in this section, the Rao-Blackwell theorem, goes one step further. It says that an estimator should only depend on sufficient statistic, otherwise we can improve it (by finding an estimator with lower variance).

Theorem 2 Suppose that $\hat\theta$ is an estimator for $\theta$ with $\mathbb{E}(\hat\theta^2) < \infty$. Assume that $T$ is a sufficient statistic for $\theta$. Define a new estimator $\tilde \theta = \mathbb{E}(\hat\theta \mid T)$. Then: \[ MSE(\tilde\theta) = \mathbb{E}(\tilde\theta - \theta)^2 \le \mathbb{E}(\hat\theta - \theta)^2 = MSE(\hat\theta) \] That is, if we know a sufficient statistic $T$, and we have an estimator $\hat\theta$, we can define an even better estimator $\tilde\theta$, for $\theta$, which will have a smaller MSE.