Sufficiency, and the Factorization Theorem

Introduction

Let’s think about the following problem: We have 2 observed random variables \(X_1, X_2 \overset{\text{iid}}{\sim} \text{Bernoulli}(p)\), where \(p\) is unknown. Statistician A is told the observed values of both \(X_1\) & \(X_2\) and computes an MLE, \(\hat{p}_A\). Statistician B is told the value of \(T = X_1 + X_2\) and computes an MLE, \(\hat{p}_B\). Which estimator will be larger? What is your guess?

First, let’s check what Statistician A computes:

\(X_1, X_2 \overset{\text{iid}}{\sim} \text{Bernoulli}(p)\)

\[ \begin{align*} \text{lik}(p) &= f(x_1, x_2 \mid p)\\ &= p^{x_1+x_2}(1-p)^{2-(x_1+x_2)} \\ \Rightarrow \ell(p) &= (x_1+x_2)\log p + (2-x_1-x_2)\log(1-p) \end{align*} \] Differentiating with respect to \(p\) gives: \[ \ell'(p) = \frac{x_1+x_2}{p} - \frac{2-(x_1+x_2)}{1-p} = 0 \implies \hat{p}_A = \frac{x_1+x_2}{2} \] Therefore, the maximum likelihood estimator is given by \(\dfrac{X_1+X_2}{2}\).

Now let’s consider Statistician B:

\(T = X_1 + X_2 \sim \text{Bin}(2, p) \implies P(T=t) = \displaystyle\binom{2}{t}p^t(1-p)^{2-t}\), for \(t = 0,1,2\).

Since there is only one observed value,we have \[ \begin{align*} \mathrm{lik}(p) &= \displaystyle\binom{2}{t}p^t(1-p)^{2-t}\\ \implies \ell(p) &= \log\binom{2}{t} + t \log p +(1-t)\log(1-p)\\ \implies \ell'(p) &= \frac{t}{p} - \frac{2-t}{1-p}\\ \end{align*} \] Solving the last equation results in \(\hat p_B = \dfrac{t}{2} = \dfrac{x_1+x_2}{2}\)! Thus we see that both statisticians would get the same estimator \(\dfrac{X_1+X_2}{2} =\dfrac{T}{2}\). So \(T = X_1+X_2\) had all the information we needed, and we didn’t need the original sample.

Let’s check the conditional joint probability of \(X_1\) and \(X_2\) given \(T\).

\[ \begin{align*} P(X_1 = x_1,\, X_2 = x_2 \mid T = t) &= \frac{P(X_1 = x_1,\, X_2 = x_2,\, T = t)}{P(T = t)}\\ &= \frac{P(X_1 = x_1,\, X_2 = x_2)}{P(T = t)}\\ &= \frac{p^{x_1+x_2}(1-p)^{2-(x_1+x_2)}}{\dbinom{2}{t} p^t (1-p)^{2-t}}\\ &= \frac{1}{\dbinom{2}{t}} \qquad \leftarrow \text{No } x_1 \text{ or } x_2! \end{align*} \]

The conditional joint distribution of \(X_1, X_2\) does not depend on \(p\), just \(t\). The precise info about what the values of \(X_1, X_2\) are is not needed, since the prob. just depends on the total — all possible permutations of \(X_i\) have the same prob. All we need is how many of the \(X_i\)’s were 1, that is, the value of \(T\).

\[\implies \hat{p}_B = \frac{t}{2} = \hat{p}_A\]

We did not need the values of \(X_1, X_2\). The statistic \(T\) was sufficient for our inference, and is an example of a sufficient statistic for \(p\).

Definition 1 Let \(X_1, \ldots, X_n \overset{\text{iid}}{\sim} f(x \mid \theta)\). We say a statistic \(T = T(X_1, \ldots, X_n)\) is sufficient for \(\theta\) if the conditional distribution of \(X_1, \ldots, X_n\) given \(T\) does not depend on \(\theta\), for any \(t\).

This means that \(T\) contains all the information about \(\theta\) that’s in the sample. We can use \(T\) to compute the MLE for \(\theta\) and do not need to store all the \(X_i\)’s. This is very useful if, for example, the \(X_i\)’s are high-dimensional and expensive to store.

Note that a sample statistic is a function of the sample only, so it can be computed from the observed data. \(\bar{X}\) is a statistic, while \(\bar{X} - \mu\) is not.

Intuition: \(T = T(X_1, \ldots, X_n)\) is a sufficient statistic for \(\theta\) if the statistician who knows only \(T\) can do just as good a job of estimating \(\theta\) as the statistician who knows the entire sample \(X_1, X_2, \ldots, X_n\). So the statistic \(T = T(X_1, \ldots, X_n)\) is sufficient for the purpose of inference.

Further, once the set determined by \(T = t\) is fixed (\(\{T = t\}\)), that is, we see which \((x_1, \ldots, x_n)\) give \(T(x_1, \ldots, x_n) = t\), then if we consider any other statistic \(W\) such that \(W = W(X_1, \ldots, X_n)\), we cannot use \(W\) to get any information about \(\theta\). \(T\) has exhausted or used up all the sample information about \(\theta\).

Example 1 \(X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Bernoulli}(\theta)\), \(T = \sum_{i=1}^n X_i\). Show \(T\) is sufficient for \(\theta\).

Say \(n = 2\):

\((x_1, x_2)\) \(T = \sum X_i\) \(f(x_1, x_2 \mid T, \theta)\)
\((0, 0)\) \(0\) \(1 \big/ \binom{2}{0} = 1\)
\((0, 1)\) \(1\) \(1 \big/ \binom{2}{1} = \tfrac{1}{2}\)
\((1, 0)\) \(1\) \(\tfrac{1}{2}\)
\((1, 1)\) \(2\) \(1 \big/ \binom{2}{2} = 1\)

No dependence on \(\theta\): basically, if I know \(T=0\), for example, then I know what \(X_1\) and \(X_2\) must be.

Now consider a different statistic \(W = W(X_1, X_2) = X_1\), and suppose \((x_1, x_2) = (0, 0)\). This implies that \(W = 0\).

\[ \begin{align*} f(x_1, x_2 \mid W, \theta) &= \frac{f(x_1, x_2, W \mid \theta)}{f(W \mid \theta)} = \frac{f(x_1, x_2 \mid \theta)}{f(W \mid \theta)} \\[6pt] \implies f(0, 0 \mid W, \theta) &= \frac{(1-\theta)^2}{P(W=0 \mid \theta)} = \frac{(1-\theta)^2}{(1-\theta)^2 + \theta(1-\theta)} \\[6pt] &= \frac{1-\theta}{1-\theta+\theta} = 1 - \theta \end{align*} \]

which is still a function of \(\theta\), so we see that \(W\) is not a sufficient statistic.

In general, following the same computation that we did for \(n=2\), we get: \[ f(x_1, x_2, \ldots, x_n \mid T, \theta) = \frac{f(x_1, x_2, \ldots, x_n \mid \theta)}{f(T \mid \theta)} = \frac{\theta^{\sum x_i}(1-\theta)^{n - \sum x_i}}{\dbinom{n}{t} \theta^t (1-\theta)^{n-t}} = \frac{1}{\dbinom{n}{t}} \leftarrow \text{no dependence on } \theta \]

Factorization Theorem

This is great, but how do we figure out which statistic might be sufficient for \(\theta\)? There is a convenient factorization theorem to help (hinted at in the example above).

Theorem 1 A necessary and sufficient condition for \(T\) to be a sufficient statistic for \(\theta\) is if there exist functions \(g\) & \(h\) such that:

\[ f(x_1, \ldots, x_n \mid \theta) = g\!\left(T(x_1, \ldots, x_n) \mid \theta\right) \cdot h(x_1, \ldots, x_n) \]

That is: the joint density or mass function can be factored into a product of 2 functions \(g\) and \(h\) such that one factor (\(h\)) does not depend on \(\theta\) and is only a function of the \(x_i\)’s, and the other (\(g\)) depends on \(\theta\) and on the \(x_i\)’s only through \(T(x_1, \ldots, x_n)\).

If \(T\) is sufficient, we can factorize the PDF/PMF into a product of such functions; and if we can factor the PDF/PMF thus, then \(T\) is sufficient.

We can use this theorem to look for sufficient statistics either by:

  • Computing \(P(X_1 = x_1, \ldots, X_n = x_n \mid T(X_1, \ldots, X_n))\) and showing that it is independent of \(\theta\) (which is what we did above), or
  • Showing that the density (or mass function) can be factored into \(f(x_1, \ldots, x_n \mid \theta) = g(T(x_1, \ldots, x_n) \mid \theta) \cdot h(x_1, \ldots, x_n)\).

What sufficiency implies is that: \[ f(x_1, \ldots, x_n \mid T, \theta) = f(x_1, \ldots, x_n \mid T) \]

Back to our Bernoulli Example 1. What are \(h\) and \(g\)?

\[ \begin{align*} P(X_1 = x_1, \ldots, X_n = x_n \mid \theta) &= \prod_{i=1}^n P(X_i = x_i \mid \theta) \\ &= \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} \\ &= \theta^{\sum x_i}(1-\theta)^{n - \sum x_i} \\ &= \theta^{T(x_1,\ldots,x_n)}(1-\theta)^{n - T(x_1,\ldots,x_n)} \\ &= \underbrace{\theta^{T(x_1,\ldots,x_n)}(1-\theta)^{n-T(x_1,\ldots,x_n)}}_{\large{g(T(x_1,\ldots,x_n)\mid\theta)}} \cdot \underbrace{1}_{\large{h(x_1,\ldots,x_n)}}, \end{align*} \]

where \(T = \sum x_i\).

Example 2 \(X_1, X_2, \ldots, X_n \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)\)

\[ P(X = x) = \frac{\lambda^x e^{-\lambda}}{x!}, \quad x = 0, 1, 2, \ldots \]

Let \(\theta = e^{-\lambda}\). Find a sufficient statistic for \(\theta\).

Note: if \(\theta = e^{-\lambda}\), then \(\lambda = -\log\theta\).

\[ \begin{align*} P(X_1 = x_1, \ldots, X_n = x_n) &= \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} = \prod_{i=1}^n \frac{(-\log\theta)^{x_i} \cdot \theta}{x_i!} \\[6pt] &= \prod_{i=1}^n (-1) \cdot \theta \cdot \frac{(\log\theta)^{x_i}}{x_i!} \\[6pt] &= (-1)^n\, \theta^n \cdot \frac{(\log\theta)^{\sum x_i}}{\prod_{i=1}^n x_i!} \end{align*} \]

Let \(T = \sum_{i=1}^n X_i\), so \(t = \sum_{i=1}^n x_i\). Then:

\[ f(x_1, \ldots, x_n \mid \theta) = \underbrace{(-1)^n\, \theta^n\, (\log\theta)^t}_{\large{g(t \mid \theta)}} \cdot \underbrace{\dfrac{1}{\prod_{i=1}^n x_i!}}_{\large{h(x_1,\ldots,x_n)}} \]

Since \(f\) can be factorized, \(T\) is sufficient.

Example 3 Let \(X_1, \ldots, X_n\) be an IID sample with density

\[ f(x \mid \theta) = \begin{cases} \theta x^{\theta-1} & 0 < x < 1 \\ 0 & \text{o/w} \end{cases} \]

and \(\theta > 0\). Find a sufficient statistic for \(\theta\).

\[ \begin{align*} f(x_1, \ldots, x_n \mid \theta) &= \prod_{i=1}^n f(x_i \mid \theta) = \prod_{i=1}^n \theta\, x_i^{\theta-1} \\[4pt] &= \theta^n \prod_{i=1}^n x_i^{\theta-1} = \theta^n \left(\prod_{i=1}^n x_i\right)^\theta \cdot \frac{1}{\prod_{i=1}^n x_i} \end{align*} \]

Let \(T = \prod_{i=1}^n x_i\), so \(t = \prod x_i\). Then:

\[ f(x_1, \ldots, x_n \mid \theta) = \underbrace{\theta^n\, t^\theta}_{\large{g(t \mid \theta)}} \cdot \underbrace{\dfrac{1}{\prod_{i=1}^n x_i}}_{\large{h(x_1,\ldots,x_n)}} \]

Since \(f\) can be factorized, \(T\) is sufficient.

[Rice (2006); Pimentel (2024); Hogg, McKean, and Craig (2005); Wasserman (2004);]

References

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. New York: Springer.