MLEs for the Multinomial, and Large Sample Theory of the MLE

Introduction

We will continue with our exploration of the method of maximum likelihood estimation. We want to consider the properties of the estimator: Is it unbiased? Consistent? What can we say about the distribution of the estimator when the sample size is large?

But first, we are going to look at maximum likelihood estimation in a special, but important case.

The Multinomial Distribution

Recall the Multinomial distribution. It is a generalization of the Binomial distribution. A random variable with the \(\mathrm{Bin}(n,p)\) counts in how many trials we see one particular outcome (that we call a “success”), when we have a fixed number (\(n\)) of trials, each with two possible outcomes, and a probability of success (\(p\)) that stays the same for each trial.

Now we can see that this set up does not cover the case of rolling a fair six-sided die, say, 20 times. If we are only interested in the number of sixes that we see (for example), then we can certainly model this using the Binomial distribution, with \(n=20\) and \(p = \dfrac{1}{6}\) being the probability of a success which is rolling a 6. Therefore, the probability of a failure is \(q = \dfrac{5}{6}\). We know how to compute these probabilities.

But what if I want to know the probability of \(x_1\) rolls landing 1 and \(x_2\) rolls landing 2 and … and \(x_6\) rolls landing on 6. In this case, the Binomial distribution does not cover the computation of these probabilities, but we just have to generalize the idea of the Binomial. In the Binomial set up, we have two possible outcomes of each trial, with probabilities \(p\) and \(q=1-p\). In the Multinomial set up, if we have \(m\) possible outcomes for each trial, we have to have \(p_1, p_2, \dots, p_m\) be the probability of each of the \(m\) outcomes, respectively, and therefore \(p_1+p_2 + \dots+ p_m = 1\), and \(x_1 + x_2 + \dots + x_m = n\).

In the Binomial case, we define a random variable \(X\sim \mathrm{Bin}(n,p)\) as the number of successes in \(n\) trials, where the probability of success is \(p\). In the Multinomial setting, if we have \(m\) possible outcomes, we let \(X_i\) be the number of times we see outcome \(i\). Therefore, we have that \(X_1+X_2+\dots+X_m = n\). We can compute then the probability that we see the first outcome \(x_1\) times, and the second \(x_2\) times,etc, which is the joint probability \(P(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n)\).

Example To make it more concrete, consider rolling a six-sided die ten times. What is the probability that we see the face ⚀ four times, the face ⚃ three times, and the face ⚅ three times?

If we let \(p_1, p_2 \dots, p_6\) be the probabilities of seeing the faces ⚀ ⚁ \(\dots\) ⚅ respectively, then we need: \[ P(X_1 = 4, X_2 = 0, X_3 = 0, X_4 = 3, X_5 = 0, X_6 = 3). \]

Note that \(X_1, \dots, X_n\) are not independent! (Why not?)

Check your answer

They are not independent because they need to sum to \(n\), which is the total number of rolls.

We can compute this probability as follows: \[ P(X_1 = 4, X_2 = X_3 = X_5 = 0, X_4 = 3, X_6 = 3) = p_1^4 \,p_2 ^0 \, p_3^0\, p_4^3\, p_5^0 \,p_6^3 \times \dfrac{10!}{4!0!0!3!0!3!} = p_1^4 \,p_4^3\, p_6^3 \times \dfrac{10!}{4!3!3!}. \] The term \(\dfrac{10!}{4!0!0!3!0!3!}\) is the Multinomial coefficient \(\displaystyle \binom{n}{x_1, x_2, \dots x_m} = \dfrac{n!}{x_1!x_2!\dots x_n!}\) which counts the number of ways that \(n\) objects (the rolls) can be partitioned into \(m\) distinct groups (the faces rolled).

Simplifying the notation a bit, we can write the joint PMF of \(X_1, \dots, X_n\) as \[ f(x_1, \dots, x_m \vert p_1, \dots, p_m) = \dfrac{n!}{x_1!\dots x_m!} \prod_{i=1}^m p_i^{x_i} \]

Note that the marginal distributions of the \(X_i\) are \(\mathrm{Bin}(n, p_i)\).

Computing the MLE of the multinomial likelihood function

We have a joint PMF as defined above, and we will to estimate the probabilities \(p_1, \dots, p_n\) using maximum likelihood. We can do this, even though the \(X_i\) are not independent. The lack of independence means that we will not be able to factor the joint PMF into a product. Also note that the \(p_i's\) are not free to vary, because they are subject to the constraint that their sum is 1. This means that we need to maximize \(\mathrm{lik}(p_1, \dots, p_m)\) subject to the constraint that \(p_1 + \dots + p_m = 1\).

As we do often, we simplify the problem by taking logs of the likelihood function. Therefore, we need to maximize \(\ell(p_1, \dots, p_m) = \log \mathrm{lik}(p_1, \dots, p_m)\) subject to the constraint \(\displaystyle \sum_{i=1}^m p_i = 1\).

We will use Lagrange multipliers to do this. Define the function \(\mathcal{L}\) by: \[ \mathcal{L}(p_1, p_2, \dots, p_m, \lambda) = \ell(p_1, p_2, \dots, p_m) + \lambda(\sum_{i=1}^m p_i -1). \]

Writing out \(l\), we get: \[ \begin{align*} \mathcal{L}(p_1, p_2, \dots, p_m, \lambda) &= \log \mathrm{lik}(p_1, p_2, \dots, p_m) + \lambda(\sum_{i=1}^m p_i -1)\\ &= \log \left(\dfrac{n!}{x_1!\dots x_m!} \prod_{i=1}^m p_i^{x_i}\right) + \lambda(\sum_{i=1}^m p_i -1)\\ &= \log (n!) - \sum_{i=1}^m \log (x_i!) + \sum_{i=1}^m x_i \log p_i + \lambda(\sum_{i=1}^m p_i -1)\\ \end{align*} \]

Take partial derivatives with respect to each \(p_i, \; i = 1, \dots, m\). Remember that \(x_i\) and \(n\) are constants with respect to \(p_i\), as is \(\lambda\). \[ \dfrac{\partial \mathcal{L}}{\partial p_i} = \dfrac{x_i}{p_i} + \lambda \] Setting each of these partial derivatives equal to 0 gives us that for each \(i\), we have: \[ \dfrac{x_i}{\hat p_i} + \lambda = 0 \Rightarrow \hat p_i = -\dfrac{x_i}{\lambda}. \] We know that the sum of all the \(\hat p_i\)’s is equal to 1. Using this, we can see that the maximum likelihood estimates of the \(p_i\)’s are given by \(\hat p_i = \dfrac{x_i}{n}\).

The corresponding statistics, that is the maximum likelihood estimators are \(\hat p_i = \dfrac{X_i}{n}\).

Using the numbers in our example above, we have that the maximum likelihood estimates are: \[ \hat p_1 = \dfrac{4}{10}, \; \hat p_2 = \hat p_3 = 0, \; \hat p_4 = \dfrac{3}{10}, \; \hat p_5 = 0,\; \hat p_6 = \dfrac{3}{10}. \]

References

Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. John Wiley & Sons, Hoboken, NJ.

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.