The Bayesian Approach to Estimation

Introduction

This class has been all about the frequentist approach to inference — one of two major approaches to statistical inference. Bayesian inference is the other most widely used approach.

Reviewing the Frequentist Approach

  • Parameters are fixed (but unknown) quantities.
  • Any probabilistic statements we make are about the (sampled) data — that is, over all possible samples.
  • Probability refers to limits of relative frequencies, which means our statistical procedures, such as confidence intervals, should have well-defined long-run frequency properties. For example: “A 95% CI is ‘successful’, on average, 95% of the time.”

The Bayesian Approach

In the Bayesian approach, probability is not a limiting frequency. Instead, it describes a “degree of belief.”

This really becomes about putting your money where your mouth is — for example, on a prediction market like Polymarket. Markets aggregate beliefs and assign probabilities to uncertain events:

  • Arsenal will beat Atlético Madrid in the 2nd leg
  • Bayern will beat PSG (57% is the probability; 26% is the chance of a draw; and 19% is the probability of a Bayern win)
  • 51% chance Democrats take the Senate
  • 84% chance Democrats take the House

This framework works for all kinds of situations, but it is subjective — my beliefs may differ from yours.

We use this subjective notion of probability to establish a framework for estimation.

The Bayesian Estimation Framework

The main steps are:

  1. Prior distribution \(f_\Theta(\theta)\): State our prior belief about a parameter before seeing any data.
  2. Likelihood \(f_{X|\Theta}(x|\theta)\): Observe data \(X\) and compute the likelihood.
  3. Posterior distribution \(f_{\Theta|X}(\theta|x)\): Update our beliefs based on the data, giving us a new distribution called the posterior.

This updating is performed via Bayes’ Theorem.

What prior should we use? We should express our actual beliefs. In practice, when we have no idea, we use the uniform prior.

Example 1 (Coin Tossing) We toss a coin \(n = 20\) times and count the number of heads \(X\). Let \(\theta = P(\text{heads})\).

Frequentist Bayesian
\(\theta\) is fixed but unknown. Figure out the distribution of \(X\) given \(\theta\), and use this for inference and estimation of \(\theta\). \(\theta\) is unknown, so treat it as a random variable \(\Theta\). Give it a prior distribution based on current beliefs, then update with the data.

Setting Up the Prior

Since we have no prior knowledge of the coin’s bias, we use the uniform prior. We treat \(\Theta\) as a random variable (therefore, using a capital letter to denote it):

\[ \Theta \sim \text{Unif}[0, 1], \qquad \text{so } f_\Theta(\theta) = 1, \quad 0 \le \theta \le 1. \]

Setting Up the Likelihood

For a given value \(\theta \in [0,1]\), the conditional distribution of \(X\) given \(\Theta = \theta\) is Binomial:

\[ X \mid \Theta = \theta \;\sim\; \text{Bin}(n,\, \theta). \]

Note we have two random variables: \(X\) is discrete (Binomial) and \(\Theta\) is continuous (Uniform). The conditional PMF of \(X\) given \(\Theta = \theta\) is

\[ f_{X|\Theta}(x|\theta) = P(X = x \mid \Theta = \theta) = \binom{n}{x}\theta^x(1-\theta)^{n-x}. \]

Joint Density

The joint density of \((X, \Theta)\) is

\[ f_{X,\Theta}(x,\theta) = f_{X|\Theta}(x|\theta)\cdot f_\Theta(\theta) = \binom{n}{x}\theta^x(1-\theta)^{n-x} \cdot 1, \]

for \(x \in \{0, 1, \ldots, n\}\) and \(\theta \in [0,1]\).

Marginal Distribution of \(X\)

To find the marginal PMF of \(X\), we integrate out \(\theta\) (keep in mind that what we ultimately want is \(f_{\Theta|X}\)):

\[ f_X(x) = \int_0^1 f_{X,\Theta}(x,\theta)\,d\theta = \int_0^1 \binom{n}{x}\theta^x(1-\theta)^{n-x}\,d\theta = \binom{n}{x}\int_0^1 \theta^x(1-\theta)^{n-x}\,d\theta. \tag{$\star$} \]

We evaluate this using the Beta distribution. Recall that the Beta density on \([0,1]\), for \(t \in [0,1]\) and \(a, b > 0\), is

\[ g(t) = \frac{\Gamma(a+b)}{\Gamma(a)\,\Gamma(b)}\,t^{a-1}(1-t)^{b-1}, \qquad \int_0^1 g(t)\,dt = 1, \]

which implies

\[ \int_0^1 t^{a-1}(1-t)^{b-1}\,dt = \frac{\Gamma(a)\,\Gamma(b)}{\Gamma(a+b)}. \]

Comparing with (\(\star\)), we match exponents by setting \(a - 1 = x\) and \(b - 1 = n - x\), i.e., \(a = x+1\) and \(b = n-x+1\):

\[ f_X(x) = \binom{n}{x}\cdot\frac{\Gamma(x+1)\,\Gamma(n-x+1)}{\Gamma(n+2)}. \]

Since \(\Gamma(k+1) = k!\) for non-negative integers \(k\),

\[ f_X(x) = \frac{n!}{x!\,(n-x)!}\cdot\frac{x!\,(n-x)!}{(n+1)!} = \frac{1}{n+1}, \qquad x = 0, 1, \ldots, n. \]

Therefore, if \(\Theta \sim \text{Unif}[0,1]\), then \(X\) follows a discrete uniform distribution on \(\{0, 1, \ldots, n\}\).

Posterior Distribution via Bayes’ Theorem

We apply Bayes’ Theorem to find \(f_{\Theta|X}(\theta|x)\):

\[ f_{\Theta|X}(\theta|x) = \frac{f_{X,\Theta}(x,\theta)}{f_X(x)} = \frac{f_{X|\Theta}(x|\theta)\cdot f_\Theta(\theta)}{f_X(x)}. \]

Substituting:

\[ f_{\Theta|X}(\theta|x) = \frac{\dbinom{n}{x}\theta^x(1-\theta)^{n-x}\cdot 1}{\dfrac{1}{n+1}} = \frac{(n+1)!}{x!\,(n-x)!}\,\theta^x(1-\theta)^{n-x}. \]

We can rewrite this in terms of Gamma functions:

\[ f_{\Theta|X}(\theta|x) = \frac{\Gamma(n+2)}{\Gamma(x+1)\,\Gamma(n-x+1)}\,\theta^{(x+1)-1}(1-\theta)^{(n-x+1)-1}, \]

which is exactly the \(\text{Beta}(x+1,\, n-x+1)\) density.

Posterior Distribution

\[ \Theta \mid X = x \;\sim\; \text{Beta}(x+1,\; n-x+1) \]

Numerical example: Suppose \(n = 20\) and we observe \(x = 13\) heads. Starting from a Uniform prior:

\[ \text{Prior: } \Theta \sim \text{Unif}[0,1]; \qquad \text{Posterior: } \Theta \mid X = 13 \;\sim\; \text{Beta}(14,\, 8). \]

(Note: \(n - x + 1 = 20 - 13 + 1 = 8\).)

Summary: The Bayesian Approach

Bayesian Estimation — General Framework
  • Treat the unknown parameter \(\theta\) as the value of some random variable \(\Theta\).
  • Define a prior density \(f_\Theta(\theta)\).
  • Adjust \(f_\Theta(\theta)\) based on observed data \(X\) to get the posterior density \(f_{\Theta|X}(\theta|x)\).

The key relationships are:

\[ f_{X,\Theta}(x,\theta) = f_{X|\Theta}(x|\theta)\cdot f_\Theta(\theta) \]

\[ f_X(x) = \int f_{X,\Theta}(x,\theta)\,d\theta = \int f_{X|\Theta}(x|\theta)\,f_\Theta(\theta)\,d\theta \qquad \text{(marginal likelihood)} \]

\[ f_{\Theta|X}(\theta|x) = \frac{f_{X,\Theta}(x,\theta)}{f_X(x)} = \frac{f_{X|\Theta}(x|\theta)\cdot f_\Theta(\theta)}{\displaystyle\int f_{X|\Theta}(x|\theta)\,f_\Theta(\theta)\,d\theta} \]

Since \(f_X(x)\) is a constant with respect to \(\theta\), we have the fundamental result:

\[ \boxed{f_{\Theta|X}(\theta|x) \;\propto\; f_{X|\Theta}(x|\theta)\cdot f_\Theta(\theta)} \]

\[ \text{posterior} \;\propto\; \text{likelihood} \times \text{prior} \]

(Rice 2006; Pimentel 2024)

References

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.