Survey Sampling: Confidence Intervals

The (Asymptotic) Sampling Distribution of the Sample Mean

The Central Limit Theorem

The CLT states that for large \(n\), the sample mean, suitably standardized, will have a CDF that approaches the CDF of the standard Normal. That is, the standardized sample mean converges in distribution to the standard Normal.

If \(X_1, X_2, \ldots, X_n\) is an independent and identically distributed sample from a population with mean \(\mu\) and variance \(\sigma^2\), then:

\[ \left(\frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\right) = \sqrt{n}\left(\frac{\overline{X}-\mu}{\sigma}\right) \overset{dsn}{\longrightarrow}\mathcal{N}(0,1) \text{ as } n\longrightarrow \infty \] The CDF converges to \(\Phi\), which means that: \[ P\left(\frac{\overline{X}-\mu}{\dfrac{\sigma}{\sqrt{n}}} \le z \right) = F_{\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}}(z)\longrightarrow \Phi(z) \]

The CLT is an incredibly important theorem, because it guarantees that for a large enough sample, no matter what the distribution of the random variables \(X_i\), the sample mean behaves as though it is from an approximately \(\mathcal{N}(\mu, \dfrac{\sigma^2}{n})\) distribution.

Note that the CLT is a limit theorem, so it fully characterizes the asymptotic distribution of the sample mean. It provides a good approximation for large enough samples, so we can compute probabilities such as the \(P\)-value, and construct confidence intervals. We usually have a fixed population size \(N\), so it doesn’t make sense for the sample size \(n \rightarrow \infty\), but as long as \(n\) is large, but the sampling fraction \(\dfrac{n}{N}\) is small, the normal approximation is pretty good, and how we use it is demonstrated in the following example.

Example A, page 201

This is an example from the text that is used to illustrate many of the ideas from this chapter. Herkson (JASA 1976) presented data on the number of patients discharged from each of a population of \(N=393\) hospitals during January 1986. The mean number of discharges across the population is about 815, and the population standard deviation is about 590.

If a random sample of \(n=100\) is taken from this population with replacement, what is the standard error of the associated estimator of the population mean \(\overline{X}\)?
What is the standard error of \(\overline{X}\) if instead we take a simple random sample of size \(n=100\)?
If we are sampling with replacement, what is the (approximate) probability that our sample average exceeds 850?

Check your answer

Since we are sampling with replacement, the random variables form an IID sample, and so the standard error of the sample mean is \(\dfrac{\sigma}{\sqrt{n}} = \dfrac{590}{\sqrt{100}} = 59\).
In the case of an SRS, the standard error of the sample mean is given by \(\dfrac{\sigma}{\sqrt{n}}\left(\dfrac{N-n}{N-1}\right) = \dfrac{590}{\sqrt{100}}\sqrt{\left(\dfrac{393-100}{393-1}\right)} \approx 51\).
We want to approximate \(P(\overline{X} > 850)\). By the CLT, \[ P(\overline{X} > 850) = P\left(\dfrac{\overline{X}-815}{59} > \dfrac{850-815}{59}\right) = 1-\Phi\left(\dfrac{850-815}{59}\right) \approx 0.2765. \] We used 1-pnorm((850-815)/59) to compute the answer.

Confidence intervals for the Population Mean

Another way we could use the central limit theorem with estimated standard error is to build a range of plausible values for the population mean from the sample. This range is called the confidence interval. A confidence interval for some population parameter \(\theta\) is a random interval, whose endpoints are constructed using the sample, such that the interval contains \(\theta\) with some specified probability.

It is very important to note the assumption that the population parameter is fixed, and it is the interval that is random, and so the probability of coverage is associated with the random interval.

Since we compute the endpoints using the random sample, the endpoints are random variables. This means that each time we take a sample of size \(n\), and then plug in our observed data, we will get a different realization of this random interval. But because we can use the CLT to approximate probabilities, we can construct intervals that have a probability of \(1-\alpha\) of containing the true value. For example if \(\alpha = 0.05\), we construct an interval that contains the true mean 95% of the times (on average).

For \(0 \le \alpha \le 1\), let \(z(\alpha)\) denote that value on the \(x\)-axis such that the area under the standard normal density curve to the right of \(z(\alpha)\) is \(\alpha\). When we have a particular confidence interval, it is a realization of a random interval, where the random interval has a certain coverage probability of \(1-\alpha\). We call this coverage probability the confidence level.

Let’s derive the confidence interval for the population mean \(\mu\). By the central limit theorem, we know that \(\overline{X}\) is approximately normal, that is, \(\dfrac{\overline{X}-\mu}{\sigma_{\overline{X}}} \approx \mathcal{N}(0,1)\).

If \(Z\) follows the standard Normal distribution, then by the definition of \(z(\alpha)\) above, we see that \[ P\big(-z(\alpha/2) \le Z \le z(\alpha/2)\big) = 1-\alpha. \] Therefore, if \(\dfrac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is approximately normal, we have that: \[ P\left(-z(\alpha/2) \le \dfrac{\overline{X} - \mu}{\sigma/\sqrt{n}} \le z(\alpha/2)\right) \approx 1-\alpha. \] Now we multiply by \(\sigma/\sqrt{n}\), subtract \(\overline{X}\) and multiply by \(-1\). This gives us the confidence interval that we need: \[ P\left(\overline{X} - \frac{\sigma}{\sqrt{n}}z(\alpha/2) \le \mu \le \overline{X} + \frac{\sigma}{\sqrt{n}}z(\alpha/2)\right) \approx 1-\alpha. \] This statement says that the chance of this random interval, \(\left(\overline{X} - \dfrac{\sigma}{\sqrt{n}}z(\alpha/2) ,\; \overline{X} + \dfrac{\sigma}{\sqrt{n}}z(\alpha/2)\right)\) capturing the mean is approximately \(1-\alpha\), and so the interval is called a \(100(1-\alpha)%\) confidence interval.

We can then plug in our observed value of \(\overline{X}\) and will get a realization of the random interval: \(\left(\overline{x} - \dfrac{\sigma}{\sqrt{n}}z(\alpha/2) ,\; \overline{x} + \dfrac{\sigma}{\sqrt{n}}z(\alpha/2)\right).\) Note that this interval is not random. It is just an interval on the real line and as \(\mu\) is just some fixed constant, it either lies in this interval or does not. Therefore, once we plug in the observed sample mean, and the observed value of the estimator \(s_{\overline{X}}\), we don’t have any randomness. All the randomness is in the sampling procedure.

We can use the confidence interval to plan our data collection. Since the width of the confidence interval is given by \(2\times \dfrac{\sigma}{\sqrt{n}} \times z(\alpha)\), it is determined by \(\sigma\) and by \(\sqrt{n}\). Now \(\sigma\) is a constant of the population, so we can’t do much with it, but we can choose \(n\) so that our confidence interval is as narrow as we desire. The margin of error of the confidence interval is given by \(\dfrac{\sigma}{\sqrt{n}} \times z(\alpha).\)

Exercise (Problem 8 from section 7.7) A sample of size 100 is taken from a population that has a proportion \(p = 1/5\).

Find \(\delta\) such that \(P\big(\lvert \hat{p}-p\rvert \ge \delta\big)=0.025\)
If, in the sample, \(\hat{p} = 0.25\), will the 95% confidence interval for \(p\) contain the true value for \(p\)?

Check your answer

\(\delta = 0.0896\), \(p\) is given to be \(\dfrac{1}{5}\). Therefore, \(\sigma_{\hat{p}} = \displaystyle \sqrt{\dfrac{\frac{1}{5}\cdot \frac{4}{5}}{100}} = \frac{2}{50}.\)

By the Central Limit Theorem, \(\hat{p}\) is approximately \(\mathcal{N}(p, \sigma_{\hat{p}}^2)\).

\[ \begin{align*} P\big(\lvert \hat{p}-p\rvert \ge \delta\big) &= 0.025 \\ \Rightarrow P\big(\lvert \hat{p}-p\rvert < \delta\big) &= 0.975 \\ \Rightarrow P(-\delta < \hat{p}-p < \delta) &= 0.975 \\ \Rightarrow P\left(-\frac{\delta}{\sigma_{\hat{p}}} < \frac{\hat{p}-p}{\sigma_{\hat{p}}} < \frac{\delta}{\sigma_{\hat{p}}}\right) &= 0.975 \\ \Rightarrow P\left(-\frac{\delta}{\sigma_{\hat{p}}} < Z < \frac{\delta}{\sigma_{\hat{p}}}\right) &\approx 0.975 \\ \Rightarrow \Phi\left(\frac{\delta}{\sigma_{\hat{p}}}\right) - \Phi\left(-\frac{\delta}{\sigma_{\hat{p}}}\right) &\approx 0.975 \\ \Rightarrow 2\Phi\left(\frac{\delta}{\sigma_{\hat{p}}}\right) -1 &\approx 0.975 \\ \Rightarrow \Phi\left(\frac{\delta}{\sigma_{\hat{p}}}\right) &\approx 0.9875 \\ \Rightarrow \frac{\delta}{\sigma_{\hat{p}}} &\approx 2.24\\ \end{align*} \] Where we used qnorm(0.9875) to obtain 2.24. Plugging in the value of \(\sigma_{\hat{p}} = \dfrac{2}{50}\), we get that \(\delta\) is about \(0.0896\).

Yes.

\(z(\alpha/2) = 1.96\), and the 95% confidence interval is given by \(\hat{p} \pm 1.96\times \dfrac{2}{50}\). Since \(\hat{p} = 0.25\), this gives us \(0.25 \pm 1.96\times \dfrac{2}{50} = (0.1716, 0.3284)\) which contains \(p = \dfrac{1}{5}.\)

Exercise 20 different polling companies have conducted independent surveys to estimate the proportion of US voters who approve of RFK Jr’s stewardship of Health and Human Services. Each company estimates this proportion using a 95% confidence interval. About how many do you think will be successful in covering the true proportion?

Check your answer

If we let \(Y\) be the number of confidence intervals out of 20 that are successful, then since each interval has a 0.95 chance of success, we see that \(Y\sim Bin(20, 0.95)\). Therefore the expected number of successful intervals is \(E(Y) = 20\times 0.95 = 19.\)

References

Lohr, Sharon L. 2010. Sampling: Design and Analysis. 2nd ed. Cengage.

Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. New York: Springer.