Survey Sampling: Inference

Inference in Sampling

Inference involves using a sample to compute an estimate of a population parameter, and the population should always be defined in the context in which the results will be applied.

Estimator: The function (or algorithm) that maps sample data to a number.

Estimate: The actual observed value after applying the estimator on the sample (observed) data.

Then of course the question we would have is how good is our estimator and therefore our estimate? We want \(\mu\), and we have the estimate \(\hat{\mu}\). How close is this estimate to the true value \(\mu\)? We need a measure of goodness of our estimator. Note that our estimator is random. Each time we take a random sample, we will get a different value of the estimate. We want to know on average, what is the error of our estimator? To compute this, we need to consider the sampling distribution of our estimator. This is just a special name for the probability distribution of the estimator, which is a random variable. The randomness of the estimator is rooted in the randomness of the sampling. The spread of the probability distribution, measured by its standard deviation is one of the determinants of the accuracy of our estimator.

If we would hit our target (the population parameter) on average (that means that the expected value of our estimator is the population parameter), then we only need to consider how much our estimator’s sampling distribution spreads about the mean. The tighter the spread (the smaller the standard deviation), the more accurate the estimator. Now, because we are measuring the error of our estimator, we call the square root of its variance the standard error rather than the standard deviation.

What if, though, the expected value of our estimator is not the target parameter? In this case the difference between the expected value of the estimator and the true value of the population parameter will also contribute to the error. Because of this, we use a measure of goodness of our estimate that incorporates both the spread (standard error) and the average distance from the parameter (we call this bias). This measure is called the Mean Squared Error.

Mean Squared Error

Mean Squared Error: The mean squared error is the expected value of the squared difference between the estimator \(\hat{\theta}\) and the true value of the population parameter \(\theta\). We denoted it by \(MSE\): \[ MSE = E\left[\left(\hat{\theta} - \theta \right)^2\right] \]

Bias: The bias of an estimator is its distance, on average, from the true value of the population parameter: \[ \operatorname{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta \] We call an estimator unbiased if the bias is 0, that is if \(E(\hat{\theta}) = \theta\).

Exercise Show that \(MSE = \text{Variance} + \text{Bias}^2\).

Solution \[ \begin{align*} \mathrm{MSE}(\hat\theta) &= E\big[(\hat\theta - \theta)^2\big] \\ &= E\big(\hat\theta ^2\big) -2 \theta E\big(\hat\theta\big) + \theta^2\\ &= \mathrm{Var}\big(\hat\theta\big) + \big[E\big(\hat\theta\big)\big]^2 -2 \theta E\big(\hat\theta\big) + \theta^2\\ &= \mathrm{Var}\big(\hat\theta\big) + \big[E\big(\hat\theta\big) - \theta\big]^2\\ &= \mathrm{Var}(\hat\theta) + \big[\mathrm{Bias}(\hat\theta)\big]^2. \end{align*} \]

Here is a figure from Lohr’s text that shows the difference between low bias, low variance, and low MSE:

A figure from a text by Lohr showing the difference between accuracy, unbiasedness, and precision

Unbiased archers, precise archers, and accurate archers

This diagram shows that an estimator \(\hat{\theta}\) is unbiased if \(E\big(\hat{\theta}\big) = \theta\), it is precise if \(\mathrm{Var}(\hat{\theta})\) is small, but for the estimator to be accurate, both these quantities must be small, and therefore the Mean Squared Error (the sum of the squared bias and the variance) must be small, where \(MSE = E\big[\big(\hat{\theta} - \theta\big)^2\big]\).

\(\mathrm{Var}(\overline{X})\) and the finite population correction

Recall that if we let \(X_1, X_2, \ldots, X_n\) are independent and identically distributed random variables (IID), with common expected value \(\mu\) and variance \(\sigma^2\); and \(\overline{X}\) is the sample mean of this sample \(\big(\displaystyle \overline{X} = \dfrac{1}{n} \sum_{i=1}^n X_i\big)\), then \(E(\overline{X}) = \mu\) and \(\mathrm{Var}(\overline{X}) = \sigma^2/n\). (Note: You should be able to show this.)

Now suppose we have a finite population of size \(N\), and we take a simple random sample of size \(n\) from this population: \(X_1, X_2, \ldots, X_n\). Now the \(X_i\) cannot be IID as we are sampling without replacement. It is easily shown (Theorem A on page 206) that the expected value of the sample mean is still \(\mu\), where \(\mu\) is the population mean. What about \(\mathrm{Var}(\overline{X})\)?

It turns out that ( Theorem B on page 208): \[ \mathrm{Var}(\overline{X}) = \dfrac{\sigma^2}{n}\left( \dfrac{N-n}{N-1}\right). \]
Click for the proof

\[ \begin{align*} \mathrm{Var}(\overline{X}) &= \mathrm{Var}\left(\dfrac{1}{n}\sum_{i = 1}^n X_i\right)\\ &= \dfrac{1}{n^2} \mathrm{Var}\left(\sum_{i = 1}^n X_i\right), \: \text{because }\mathrm{Var}(aX) = a^2\mathrm{Var}(X)\\ &= \dfrac{1}{n^2}\mathrm{Cov}\left(\sum_{i = 1}^n X_i,\sum_{j = 1}^n X_j \right), \: \text{because } \mathrm{Var}(X) = \mathrm{Cov}(X,X)\\ &= \dfrac{1}{n^2}\left(\sum_{i = 1}^n \mathrm{Var}(X_i) + \sum_{i = 1}^n \sum_{\substack{j=1 \\ j \ne i}}^n \mathrm{Cov}(X_i,X_j)\right) \\ &= \dfrac{1}{n^2} \left( n\sigma^2 + n(n-1) \mathrm{Cov}(X_1, X_2) \right), \: \text{since all } n(n-1) \text{ pairs will have the same covariance.} \end{align*} \] This computation implies that if we figure out \(\mathrm{Cov}(X_1, X_2)\), we will able to figure out the variance we need. So let’s compute this covariance. Recall that \(\mathrm{Cov}(X_1, X_2) = E(X_1 X_2) - E(X_1)E(X_2)\).

We know that the possible values of the \(X_i\) are the population values: \(x_1, x_2, \ldots, x_N\). But some of these could be repeated, which can mess up the probability computations. To simplify our computations, we will define new values \(u_1, u_2, \ldots, u_m\) to be the distinct values in the population, \(m\) the number of distinct values, and let \(n_i\) be the number of times we see the value \(u_i\).

For example, suppose \(N = 6\) and the population values are \(1, 1, 4, 4, 4, 7\). Then \(x_1 = 1 = x_2, x_3 = x_4 = x_5 = 4\), and \(x_6 = 7\). Using the \(u_i's\), we have \(u_1 = 1, u_2 = 4, u_3 = 7\), and \(m=3\). Further, \(n_1 = 2, n_2 = 3, n_3 = 1\).

Now, if \(X_i\) is the \(i\)th sample value drawn, then \(X_i\) is a discrete random variable such that \(P(X_i = u_i) = \dfrac{n_i}{N}\). This is because there are still \(N\) total units in the population, and we have just grouped them by value.

For example, using the numbers above, \(P(X_i = 4) = \dfrac{3}{6}\).

You can check that \(E(X_i) = \mu\) and \(\mathrm{Var}(X_i) = \sigma^2\), using the fact that \(\displaystyle \sum_{j=1}^m u_j n_j = \sum_{i=1}^N x_i\).

Now let’s compute \(\mathrm{Cov}(X_1, X_2) = E(X_1 X_2) - \mu^2\).

\[ \begin{align*} E(X_1 X_2) &= \sum_{i=1}^m \sum_{j=1}^m u_i u_j P(X_1 = u_i, X_2 = u_j) \\ &= \sum_{i=1}^m \sum_{j=1}^m u_i u_j P(X_1 = u_i) P(X_2 = u_j \vert X_1 = u_i) \\ &= \sum_{i=1}^m u_i P(X_1 = u_i) \sum_{j=1}^m u_j P(X_2 = u_j \vert X_1 = u_i)\\ \end{align*} \] Now, as we discussed earlier, \(P(X_1 = u_i) = \dfrac{n_i}{N}\). But the second draw from the population, \(X_2\) will depend on the first. \(P(X_2 = u_j \vert X_1 = u_i) = \dfrac{n_j}{N-1}\) if \(j \ne i\) and \(P(X_2 = u_j \vert X_1 = u_i) = \dfrac{n_i-1}{N-1}\) if \(j = i\).

Thus, we can simplify the interior sum to: \[ \begin{align*} \sum_{j=1}^m u_j P(X_2 = u_j \vert X_1 = u_i) &= \sum_{\substack{j=1 \\ j \ne i}}^m u_j \cdot \dfrac{n_j}{N-1} + u_i\cdot \dfrac{n_i-1}{N-1}\\ &= \sum_{\substack{j=1 \\ j \ne i}}^m u_j \cdot \dfrac{n_j}{N-1} + u_i\cdot \dfrac{n_i}{N-1} - u_i\cdot \dfrac{1}{N-1}\\ &= \sum_{j=1}^m \dfrac{u_j n_j}{N-1} - \dfrac{u_i}{N-1} \end{align*} \] Back to \(E(X_1 X_2)\), noting that \(\displaystyle \sum_{i=1}^m u_i n_i = \sum_{k=1}^N x_k = \tau = N\mu\), that is, the sum total of all the population values, and also note that \(\displaystyle \sum_{i=1}^m u_i^2 n_i = \sum_{i=1}^N x_i^2 = N(\sigma^2 + \mu^2)\):

\[ \begin{align*} E(X_1 X_2) &= \sum_{i=1}^m u_i \dfrac{n_i}{N}\left[ \sum_{j=1}^m \dfrac{u_j n_j}{N-1} - \dfrac{u_i}{N-1}\right]\\ &= \dfrac{1}{N(N-1)}\left[ \left( \sum_{i=1}^m u_i n_i\right)\left( \sum_{j=1}^m u_j n_j\right) - \sum_{i=1}^m u_i^2 n_i\right]\\ &= \dfrac{1}{N(N-1)} \left[ \left(N\mu\right)^2 -\sum_{i=1}^m u_i^2 n_i\right]\\ &= \dfrac{1}{N(N-1)}\left(N^2 \mu^2 - N(\sigma^2 + \mu^2) \right)\\ &= \mu^2 -\dfrac{\sigma^2}{N-1} \end{align*} \] This implies that: \[ \begin{align*} \mathrm{Cov}(X_1, X_2) &= E(X_1X_2) - \mu^2\\ &= \mu^2 -\dfrac{\sigma^2}{N-1} - \mu^2\\ &= -\dfrac{\sigma^2}{N-1} \end{align*} \]

Now we can put it all together: \[ \begin{align*} \mathrm{Var}(\overline{X}) &= \dfrac{1}{n^2} \left( n\sigma^2 + n(n-1) \mathrm{Cov}(X_1, X_2) \right)\\ &= \dfrac{1}{n^2} \left( n\sigma^2 - n(n-1)\dfrac{\sigma^2}{N-1}\right)\\ &= \dfrac{\sigma^2}{n}\left(1- \dfrac{n-1}{N-1}\right)\\ &= \dfrac{\sigma^2}{n}\left(\dfrac{N-n}{N-1}\right) \end{align*} \]

Finite population correction

The quantity \(\displaystyle \left( \dfrac{N-n}{N-1}\right)=\left(1- \dfrac{n-1}{N-1}\right)\) is called the finite population correction. Note that \(\displaystyle \dfrac{n-1}{N-1} \approx \dfrac{n}{N}\), which is called the sampling fraction. The larger the sampling fraction, the larger the sample relative to the population, which means we have more information about the population. This should reduce the variability. The extreme case is when \(n=N\), and the sample mean has no variability. In practice, the sampling fraction is very small, and so the finite population correction is approximately 1. This means that the precision of the estimator (determined by the variance) depends only on the sample size, and not on the population size.

Estimating the Population Variance

We know that the population variance \(\sigma^2\) is defined by: \[ \sigma^2 = \dfrac{1}{N}\sum_{i=1}^N x_i^2 - \mu^2. \] We can define the quantity \(\hat{\sigma}^2\), which is a function of the sample \(X_1, X_2, \ldots, X_n\): \[ \begin{align*} \hat{\sigma}^2 &= \dfrac{1}{n} \sum_{i=1}^n (X_i -\overline{X})^2 \\ &= \dfrac{1}{n} \sum_{i=1}^n X_i^2 -\overline{X}^2 \\ \end{align*} \] and use this to estimate \(\sigma^2\). The question is then if this estimator is unbiased. Is \(E(\hat{\sigma}^2) = \sigma^2\)?

\[ \begin{align*} E(\hat{\sigma}^2) &= E\left( \dfrac{1}{n} \sum_{i=1}^n X_i^2 -\overline{X}^2 \right) \\ &= \dfrac{1}{n} \sum_{i=1}^n E\left(X_i^2\right) - E\big(\overline{X}^2 \big)\\ &= (\sigma^2 + \mu^2) - E\big(\overline{X}^2 \big)\\ \end{align*} \] The last line is because \(\mathrm{Var}(X_i) = \sigma^2 = E(X_i^2) - \mu^2\). Doing a similar computation with \(E\big(\overline{X}^2 \big)\), we see that (for a simple random sample): \[ E\big(\overline{X}^2 \big) = \mathrm{Var}(\overline{X}) + \mu^2 = \dfrac{\sigma^2}{n}\left(\dfrac{N-n}{N-1} \right) + \mu^2. \] Putting these together, and doing some tedious algebra offline, we have \[ \begin{align*} E(\hat{\sigma}^2) &= (\sigma^2 + \mu^2) - \left[\frac{\sigma^2}{n}\left(\dfrac{N-n}{N-1} \right) + \mu^2\right] \\ &= \dfrac{n\sigma^2}{n} - \frac{\sigma^2}{n}\left(\dfrac{N-n}{N-1} \right) + \mu^2 - \mu^2 \\ &= \frac{\sigma^2}{n} \left[n - \left(\dfrac{N-n}{N-1} \right)\right] \\ &= \sigma^2 \left[\left(\frac{n-1}{n}\right) \left(\dfrac{N}{N-1} \right)\right] \\ &= \sigma^2 \left[\dfrac{nN-N}{nN-n}\right]\\ \end{align*} \] This means that \(E(\hat{\sigma}^2) \ne \sigma^2\), and also that \(\hat{\sigma}^2\) underestimates \(\sigma^2\) on average (since \(N > n\)). Therefore, to get an unbiased estimator of the variance of the sample mean, we need to multiply \(\hat{\sigma}^2\) by the appropriate factor. Note that: \[ E\left[\left(\frac{n}{n-1}\right) \left(\dfrac{N-1}{N} \right)\hat{\sigma}^2\right] = \sigma^2 \] Our goal, of course, is to get an unbiased estimator for the variance of the sample mean. Recall that, for a simple random sample, \(Var(\overline{X}) = \dfrac{\sigma^2}{n}\left(\dfrac{N-n}{N-1}\right)\). Let’s substiute the unbiased estimator that we just derived above: \[ \begin{align*} \mathrm{(Estimated)\, Var}(\overline{X}) &= \left(\frac{n}{n-1}\right) \left(\dfrac{N-1}{N} \right)\hat{\sigma}^2 \cdot \frac{1}{n} \left(\dfrac{N-n}{N-1}\right) \\ &= \frac{\hat{\sigma}^2}{n-1}\left(\frac{N-n}{N}\right)\\ &= \frac{s^2}{n}\left(1-\frac{n}{N}\right) \end{align*} \] where \(\displaystyle s^2 = \dfrac{1}{n-1} \sum_{i=1}^n \big(X_i - \overline{X}\big)^2\), so that \(\displaystyle \dfrac{s^2}{n} = \dfrac{\hat{\sigma}^2}{n-1}\).

Thus we have that \(s_{\overline{X}}^2\) is an unbiased estimator of \(\sigma_{\overline{X}}^2=\mathrm{Var}(\overline{X})\).

If the population is dichotomous, then the estimator becomes: \[ \mathrm{Var}(\hat{p}) = s_{\hat{p}}^2 = \frac{s^2}{n} = \frac{\hat{p}(1-\hat{p})}{n-1}. \]

Putting all this together gives us the table on page 214 of the text, reproduced here:

Summary of estimators

Population Parameter Estimator Variance of Estimator (Square of Standard Error) Estimated Variance of the Estimator
(Square of Estimated SE)
\(\mu\) \(\overline{X}\) \(\sigma_{\overline{X}}^2 = \displaystyle \dfrac{\sigma^2}{n}\left(\dfrac{N-n}{N-1}\right)\) \(s_{\overline{X}}^2 = \displaystyle \dfrac{s^2}{n}\left(1-\dfrac{n}{N}\right)\)
\(p\) \(\hat{p}\) \(\sigma_{\hat{p}}^2 = \displaystyle \dfrac{p(1-p)}{n}\left(\dfrac{N-n}{N-1}\right)\) \(s_{\hat{p}}^2 = \displaystyle \dfrac{\hat{p}(1-\hat{p})}{n-1}\left(1-\dfrac{n}{N}\right)\)
\(\tau\) \(T = N\overline{X}\) \(\sigma_{\tau}^2 = N^2\sigma_{\overline{X}}^2\) \(s_{\tau}^2 = N^2s_{\overline{X}}^2\)
\(\sigma^2\) \(\left(1-\dfrac{1}{N}\right)s^2\)

References

Lohr, Sharon L. 2010. Sampling: Design and Analysis. 2nd ed. Cengage.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. New York: Springer.