Proof of the Factorization theorem and the Rao-Blackwell Theorem
The Factorization Theorem:
In the last lecture note, we stated the factorization theorem which establishes a necessary and sufficient condition for a statistic to be sufficient for a population parameter \(\theta\). Let’s restate it, and then we will sketch the proof in the discrete case.
Theorem 1 A necessary and sufficient condition for \(T\) to be a sufficient statistic for \(\theta\) is if there exist functions \(g\) & \(h\) such that:
\[ f(x_1, \ldots, x_n \mid \theta) = g\!\left(T(x_1, \ldots, x_n) \mid \theta\right) \cdot h(x_1, \ldots, x_n) \]
Sketch of Proof (discrete case):
Proof. First, let’s show that factorization \(\implies\) sufficiency - that if we can factor \(f\), then \(T\) is sufficient.
Let \(T = T(X_1, \ldots, X_n)\). Let \(\vec{X} = (X_1, \ldots, X_n)\). We need the PMF of \(T\).
\[ \begin{align*} P(T = t) &= \sum_{\vec{x}:\, T(\vec{x}) = t} P(\vec{X} = \vec{x}),\quad \text{ (where we sum over all $x$ that are mapped to $t$)}\\[4pt] &= \sum_{\vec{x}:\, T(\vec{x}) = t} g(T(\vec{x}) \mid \theta)\, h(\vec{x}), \quad \text{since $f$ can be factored into $g$ and $h$ }\\[4pt] &= \sum_{\vec{x}:\, T(\vec{x}) = t} g(t \mid \theta) \cdot h(\vec{x}), \\[4pt] &= g(t \mid \theta) \sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x}). \end{align*} \]
Now let’s check if the conditional distribution of \(\vec{X}\) given \(T\) is independent of \(\theta\): \[ \begin{align*} P(\vec{X} = \vec{x} \mid T = t) &= \frac{P(\vec{X} = \vec{x},\, T = t)}{P(T = t)} = \frac{P(\vec{X} = \vec{x})}{P(T = t)}\quad \text{ notice that the numerator is the probability for some particular $\vec{x}$} \\[8pt] &= \frac{g(t \mid \theta)\cdot h(\vec{x})}{g(t \mid \theta)\sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x})} \\[8pt] &= \frac{h(\vec{x})}{\sum_{\vec{x}:\, T(\vec{x}) = t} h(\vec{x})} \quad \longrightarrow \text{no dependence on } \theta \end{align*} \]
\(\implies T\) is sufficient for \(\theta\).
Now the other direction: sufficiency \(\implies\) factorization. That is, if \(T\) is sufficient, we can factor \(f\) into \(g, h\).
Now, if \(T\) is sufficient this implies that \(P(\vec{X} = \vec{x} \mid T = t)\) does not depend on \(\theta\).
Define \(g(t \mid \theta) = P(T = t \mid \theta)\), and \(h(\vec x) = P(\vec{X} = \vec{x} \mid T = t)\). Then we have that \[ \begin{align*} f(\vec x \mid \theta) &= P(\vec{X} = \vec{x} \mid \theta) \\ &= P(\vec{X} = \vec{x} \mid T = t, \theta) P(T = t \mid \theta) \\ &= \underbrace{P(\vec{X} = \vec{x} \mid T = t)}_{h}\cdot \underbrace{P(T = t \mid \theta)}_{\large{g}},\\ \end{align*} \] where the last equality is because \(T\) is a sufficient statistic. \(\blacksquare\)
We now have a nice corollary to the theorem that says that all we need for the MLE is \(T\), and we don’t need the entire sample.
Corollary 1 If \(T\) is sufficient for \(\theta\), then the MLE \(\hat \theta\) is a function of \(T\).
Let $X_1, , X_n $ be an IID sample with density function (or PMF) \(f(x\mid \theta)\). Then \[ \begin{align*} \mathrm{lik}(\theta) &= f(x_1, \ldots, x_n\mid \theta)\\ &= g(T(x_1, \ldots, x_n)\mid \theta)\cdot h(x_1, \ldots, x_n)\quad \text{(by sufficiency of $T$)}\\ \end{align*} \] Now, in order to compute the MLE, we take logs (and define \(\ell(\theta)\)), differentiate etc. You can see that when we take logs and differentiate with respect to \(\theta\), the \(h\)-term vanishes, and we are left with a function of \(T\): \[ \begin{align*} \ell(\theta) &= \log(g(T(x_1, \ldots, x_n)\mid \theta)\cdot h(x_1, \ldots, x_n))\\ &= \log g(T(x_1, \ldots, x_n)\mid \theta) +\log h(x_1, \ldots, x_n))\\ \implies \ell'(\theta) &= \frac{\partial}{\partial \theta}\log g(T(x_1, \ldots, x_n)\mid \theta) + 0\\ \end{align*} \] Thus, the maximizer of \(\ell(\theta)\) will only be a function of \(T\). \(\blacksquare\)
The Rao-Blackwell Theorem
So far, we have seen that MLEs are functions of sufficient statistics, and we have seen that if we have a sufficient statistic, we don’t need to store the entire sample. The main theorem we will discuss in this section, the Rao-Blackwell theorem, goes one step further. It says that an estimator should only depend on sufficient statistic, otherwise we can improve it (by finding an estimator with lower variance).
Theorem 2 Suppose that \(\hat\theta\) is an estimator for \(\theta\) with \(\mathbb{E}(\hat\theta^2) < \infty\). Assume that \(T\) is a sufficient statistic for \(\theta\). Define a new estimator \(\tilde \theta = \mathbb{E}(\hat\theta \mid T)\). Then: \[ MSE(\tilde\theta) = \mathbb{E}(\tilde\theta - \theta)^2 \le \mathbb{E}(\hat\theta - \theta)^2 = MSE(\hat\theta) \] That is, if we know a sufficient statistic \(T\), and we have an estimator \(\hat\theta\), we can define an even better estimator \(\tilde\theta\), for \(\theta\), which will have a smaller MSE.