Survey Sampling: An Introduction
Introduction
In this chapter, we look at the topic of survey sampling, which involves a particular type of inference, saying something about a population, given an observed subset of the population. By now, we are all very used to sample surveys, such as presidential approval polls, and polls on various issues such as: What percentage of Republicans support vaccine requirements for children to attend public schools. Pew Research investigated this question a couple of years ago, which they attempted to answer by taking a sample of Republican voters, and then drawing a conclusion about the population of Republican voters.

News outlets are constantly publishing polls, which are certainly not all the same quality. The famous FiveThirtyEight site, started by Nate Silver, is defunct now, but it was famous for its pollster ratings. You can read about their methodology and here are their rankings from a couple of years ago.

In our course, we learn a little bit about survey sampling, and if you like it, you might think about taking Stat 152 the next time it is offered in our department. Chapter 7 in our text discusses the probabilistic sampling techniques, in that each population unit has a specified probability of being included in the sample, which consists of randomly selected units from the population. Note that we make no distributional assumptions in this chapter. We will restrict our study to Simple Random Samples: every population unit has the same probability of being selected, and each particular sample of size \(n\) has the same probability. That is, if the size of our population is \(N\), then each of the \(\displaystyle \binom{N}{n}\) possible samples of size \(n\) taken without replacement has the same probability.
Warm up problem
Consider the following problem:
You have a box containing 5 cards. Four of the cards are labeled with the number \(0\) and one of them is labeled with the number \(1\).You pick two cards at random with replacment. Let \(Y\) represent the average of the two cards.
- What is the distribution of the random variable \(Y\)? (Hint: Define \(X\) to be the sum of the two cards. What is the distribution of \(X\)?)
Check your answer
\(X \sim Bin(2, \dfrac{1}{5})\), and \(Y = X/2\).
\(P(Y = 0) = P(X = 0) = \dfrac{16}{25}\)
\(P(Y = \dfrac{1}{2}) = P(X = 1) = \dfrac{8}{25}\)
\(P(Y = 1) = P(X = 2) = \dfrac{1}{25}\)
- Compute \(E(Y)\) and \(\mathrm{Var}(Y)\).
Check your answer
\(E(Y) = \dfrac{1}{5}\) and \(\mathrm{Var}(Y) = \dfrac{2}{25}\).
Now what if I sample without replacement? Let \(Z\) be the average of the two tickets in this case. What is the distribution of \(Z\)?
Check your answer
Now the sum is Hypergeometric. (What are the parameters of the distribution?)
You can work out that \(P(Z = 0) = \dfrac{3}{5}, P(Z = \dfrac{1}{2}) = \dfrac{2}{5}\).
Why can’t \(Z\) be \(1\)?
Definitions and vocabulary
The figure below, adapted from Lohr’s book on sampling, shows that we have to be careful regarding the scope of our conclusions. We can only generalize from results from our sample to the sampled population, even if the target population (the population we are interested in) is something different!

Population: the complete set of individuals or entities that we are interested in. We usually only have data on a subset of them (a sample). We will assume that our population is of (finite) size \(N\), and that associated with each member or unit of the population is some numerical value. We will denote these numbers by \(x_1, x_2, \ldots, x_N\). If the values of the \(x_i\) are \(0\) or \(1\) then we are usually investigating the presence or absence of some characteristic, such as a particular party affiliation. In this case, our population is dichotomous or binary.
Parameter: any quantifiable feature of a population. For now, we will assume that the parameter is fixed but unknown.
For example: The mean age of all undergraduate students at UC Berkeley.
The most common population parameters that we are interested in are:
Population mean or average: this is denoted by \(\mu\) and defined to be: \[ \mu = \dfrac{1}{N}\sum_{i = 1}^N x_i \]
Population proportion: This is just the population mean in the binary case, and we represent this special mean by \(p\) rather than \(\mu\).
Other parameters that we will consider:
Population total: this is denoted by \(\tau\) and defined to be: \[ \tau = \sum_{i = 1}^N x_i = N\mu \] Note that for a binary population, \(\tau\) represents how many population units possess the characteristic of interest.
Population variance: this is denoted by \(\sigma^2\) and defined to be: \[ \sigma^2 = \dfrac{1}{N}\sum_{i = 1}^N (x_i-\mu)^2 \] The population standard deviation is the square root of the population variance.
Exercise: Show that \(\sigma^2\) reduces to \(\displaystyle \dfrac{1}{N}\sum_{i = 1}^N x_i^2 -\mu^2\), and if the \(x_i\) are \(0\) or \(1\) only, then \(\sigma^2 = p(1-p)\).
Exercise Going back to the results of the Pew Research survey shown at the beginning of these notes. What is the population and the parameter of interest?
Check your answer
Population: US adults
Parameter: Percentage of US adults that think healthy children should be required to be vaccinated in order to attend public schools.
Exercise Consider a population of size 4: \(\{x_1, x_2, x_3, x_4\}\).
If we use simple random sampling, how many samples of size 2 will we have? What would be the expected value of the sample mean? Is it equal to the population mean?
If, rather than a simple random sample, when all samples of size 2 are equally likely, we use a different probabilistic scheme for getting our samples of size 2: the following four samples are equally likely, and only these samples are possible: \(\{x_1, x_2\}, \{x_2, x_3\}, \{x_3, x_4\}, \{x_1, x_4\}\). What would be the expected value of the sample mean? Is it equal to the population mean?