Introducing Hypothesis Tests

Introduction

We turn from parameter estimation to another cornerstone topic in statistical inference: hypothesis testing. This is one of the most commonly used statistical techniques. We first consider a model that we think might have generated the sample data, and then use probability calculations based on this assumption (called the null hypothesis) to see if our observed sample data is consistent with this assumption. We compute how likely it is that any observed difference is due to random variation. If it is too unlikely, we conclude that the data is extreme enough to contradict the null hypothesis, and we reject our assumption, that is we reject our null hypothesis. Lehmann Lehmann (1992) describes how the theory of hypothesis testing was mostly developed by three people in the first part of the 20th century : Ronald Fisher, Jerzy Neyman, and Egon Pearson. They were not always in agreement, with the Fisherian approach, for example, focusing on only the null hypothesis, and trying to solve the scientific problem. Neyman and Pearson formulated a decision problem, so they had a different philosophy. In his article though, Lehmann argues that “… in their main practical aspects the two theories are complementary rather than contradictory and that a unified approach is possible that combines the best features of both.” Indeed, what we will use is some sort of hybrid approach, in which we specify both null and alternative hypotheses, and thus choose one, following the Neyman-Pearson paradigm, but we also use \(p\)-values to quantify the likelihood of our sample under the null assumption.

We will first study the Neyman-Pearson approach. Unlike Fisher, who focussed only on the null hypothesis and developed \(p\)-values, Neyman and Pearson defined an alternative hypothesis, so they said it wasn’t enough to reject the null, but when you did that, you decided on the alternative, accepting it. The Neyman-Pearson approach emphasizes the probabilities of the errors that can are possible, which are of two types: we might reject a true null (type I error), or we might accept a false null (type II error).

Questions that might be addressed

Often we have data, and have some assumptions about how the data were generated. Here are some questions that might be addressed by hypothesis testing:

  • Is a coin/die/roulette wheel) fair?
  • Does the Covid vaccine work?
  • Does the police force of a city reflect the ethnic make up of the city residents?
  • Has the proportion of Americans that believe in the theory of evolution declined over the years?
  • Has the American’s population support of Planned Parenthood declined?
  • Is the Salk vaccine effective at preventing polio?
  • Is this new cancer treatment a significant improvement over an existing one?
  • Do voters approve of the US action in the Middle East?
  • Is a nurse accused of murdering patients on their shift guilty?
  • Does the data come from a Normal population centered at zero or centered somewhere else?
  • What is the rate of the exponential distribution that generated our sample?

For example, we might toss a coin 100 times, and see how many times it lands heads. Would we comfortable assuming that it is fair if it lands heads 55 times? What about 60 heads? Where would we draw the line? We need to quantify how unusual our sample is given our assumption of fairness.

Example: What is the probability of a selected coin?

I have two types of coins, some that have probability of heads \(p=0.5\), and others with \(p=0.7\). If I pick a coin at random, how can I decide if it is biased or not? What do I actually want to test?

We want to test the null hypothesis, denoted by \(H_0\), that the coin is fair: \[ H_0: p = 0.5 \] Against the alternative hypothesis, denoted by \(H_1\), that the coin is biased with a particular probability of landing heads of \(0.7\): \[ H_0: p = 0.7 \] How would you do this? Well, you could toss the coin many times and count the fraction of heads. If it is too far from 50%, that would be evidence that it is not a fair coin. But how far is too far? How can we decide?

Suppose I toss the coin ten times and count the number of heads \(X\). The following table gives the probabilities of each outcome (see page 329 in Rice (2006)):

# Heads 0 1 2 3 4 5 6 7 8 9 10
\(p=0.5\) 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010
\(p=0.7\) 0.0000 0.0001 0.0014 0.0090 0.0368 0.1029 0.2001 0.2668 0.2335 0.1211 0.0282

What would you do if \(X=3\)? \(X=6\)? \(X=8\)?

Consider the likelihoods under each hypothesis: \[ P(X=3|H_0) \text{ vs. } P(X=3|H_1) \]

We can compare them using the Likelihood Ratio, which is \[ \frac{P(X=x|H_0)}{P(X=x|H_1)} \] This ratio compares how likely the observed data is under the null hypothesis (\(p=0.5\)) versus the alternative hypothesis (\(p=0.7\)). Let’s look at this ratio for the three values of \(X\) listed above:

For \(X = 3\): \[ \frac{P(X=3|H_0)}{P(X=3|H_1)} = \frac{0.1172}{0.0090} \approx 13.02 \]

We see that the data seems to be approximately 13 times more likely under the null than the alternative. This provides strong evidence in favor of \(H_0\).

For \(X = 6\): \[ \frac{P(X=6|H_0)}{P(X=6|H_1)} = \frac{0.2051}{0.2001} \approx 1.02 \] In this case, the likelihoods are nearly identical. In this case it is hard to distinguish between a fair coin and this specific biased coin, so we might (rather arbitrarily) decide that the coin is fair.

For \(X = 8\): \[ \frac{P(X=8|H_0)}{P(X=8|H_1)} = \frac{0.0439}{0.2335} \approx 0.188 \] In this case, we see that the ratio is quite low, meaning the data is much more likely under the alternative hypothesis.

Before we go further, we need to define some basic vocabulary, which we take from the text Rice (2006):

Terminology of Hypothesis testing

Null and alternative hypotheses: We single out one of the hypotheses and call it the null hypothesis (\(H_0\)). The other is called the alternative hypothesis (\(H_1\), sometimes \(H_A\)).

Type I Error: Rejecting the null hypothesis (\(H_0\)) when it is actually true.

Significance Level (\(\alpha\)): The probability of committing a Type I error (\(P(\text{Reject } H_0 \mid H_0)\)).

Type II Error: Accepting the null hypothesis when it is actually false. This probability is denoted by \(\beta\).

Power: You should think of this as the power of the test to reject a false null. It is the probability that the null hypothesis is rejected when it is false, and is therefore \(1 - \beta\).

Test Statistic: The specific function of the data that is used to make a decision. In our coin example above, the likelihood ratio served as the test statistic, but we could have equivalently used the number of heads.

Rejection Region: The set of values of the test statistic that leads to the rejection of the null hypothesis.

Acceptance Region: The set of values of the test statistic that leads to the acceptance of the null hypothesis.

Null Distribution: The probability distribution of the test statistic under the assumption that the null hypothesis is true.


Example of a decision rule

In our coin example above, the likelihood ratios are, with the three values we considered highlighted:

Number of Heads (\(x\)) Likelihood Ratio \(\frac{P(x|H_0)}{P(x|H_1)}\) Evidence Direction
0 165.4 Strongly favors \(H_0\)
1 70.88 Strongly favors \(H_0\)
2 30.38 Favors \(H_0\)
3 13.02 Favors \(H_0\)
4 5.579 Favors \(H_0\)
5 2.391 Weakly favors \(H_0\)
6 1.025 Nearly neutral (\(H_0 \approx H_1\))
7 0.4392 Weakly favors \(H_1\)
8 0.1882 Favors \(H_1\)
9 0.0807 Strongly favors \(H_1\)
10 0.0346 Strongly favors \(H_1\)

Notice that the strength of the evidence against the null is monotonically increasing. Therefore, we can conclude that rejecting \(H_0\) when the likelihood ratio is less than a constant \(c\) is equivalent to rejecting when the number of heads is greater than some value \(x_0\). This might lead us to the following decision rule in which we reject when the likelihood ratio is smaller than 1. That is, based on the table above, if we set our threshold for the likelihood ratio at \(c=1\) (rejecting when it is smaller than \(c\):, we would get:

  • Rejection Region: \(X \in \{7, 8, 9, 10\}\)
  • Acceptance Region: \(X \in \{0, 1, 2, 3, 4, 5, 6\}\)
  • Significance Level (\(\alpha\)): \(P(X > 6 | H_0) = 0.18\)

Note that the significance level is a conditional probability.

(Rice 2006; Pimentel 2024; Hogg, McKean, and Craig 2005; Wasserman 2004; Lehmann 1992)

References

Hogg, Robert V., Joseph W. McKean, and Allen T. Craig. 2005. Introduction to Mathematical Statistics. 6th ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Lehmann, E. L. 1992. “The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?” 333. Berkeley, CA: Department of Statistics, University of California.
Pimentel, Sam. 2024. “STAT 135 Lecture Slides.” Lecture slides (shared privately).
Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. New York: Springer.