Two sample tests: Nonparametric Tests

[DRAFT, IN PROGRESS]

Introduction

So far, we have considered parametric tests - that is, tests in which we make assumptions about the underlying population distribution or its parameters (for example, we might assume that the underlying distributions are normal). If our samples are large, these assumptions might be valid, but if we don’t have large samples, and we don’t know if the populations are normal, then these assumptions can lead to errors. We could avoid this by considering nonparametric tests, which make no assumptions about the underlying populations. They can be tests based on resampling or perhaps tests based on the ordering of observations, that is, rankings of the sample values.

The advantage of using nonparametric tests is that they are robust to distributional assumptions, since they don’t make any. They are also usually unaffected by outliers. The con is that since they don’t use any information about the data generating process (the population distribution), they might lose some power.

Permutation Tests

In past lectures, we had discussed permutations tests for testing the independence of categorical variables. Now we will consider permutation tests, which use resampling methods, to test if two samples have the same underlying population distribution. Note that now our hypotheses are about the full distributions:

\[ H_0: F = G \qquad \text{vs} \qquad H_1: F \neq G \]

Setup: \(X_1, \ldots, X_n \overset{\text{IID}}{\sim} F\) and \(Y_1, \ldots, Y_m \overset{\text{IID}}{\sim} G\), where \(F\) and \(G\) are unknown.

Under \(H_0\), \(F = G\), so the combined sample \(X_1, \ldots, X_n, Y_1, \ldots, Y_m\) is also iid from \(F\).
Any shuffle of the combined sample (e.g. \(X_5, X_3, Y_2, Y_{12}, \ldots\)) is also iid from \(F\) under \(H_0\).
Under \(H_0\): the actual sample is indistinguishable from a permuted sample, but under \(H_1\): permuted samples will have a different distribution.

The null hypothesis implies that the group labels of \(X\) and \(Y\) are arbitrary.

Two-Sample Permutation Test: Steps

Pick a test statistic \(T\) (for example, the difference of means; or the t-statistic) and compute its value on the original sample: call this \(T_{\text{obs}}\).
Combine the samples, so we have a combined sample of size \(m+n\)
Draw a sample of size \(m\) from this combined sample without replacement
Compute the value of the test statistic that you chose on these new samples that you obtained by resampling the original sample. Call this \(T^*_1\).
Repeat steps 3, 4 many times, say \(B\) times, where \(B\) might be 5,000, thus building a null distribution of the test statistic \(T^*\), under \(H_0\), that is, assuming the null hypothesis that there is no difference between the underlying distributions of the two samples.
Compute the \(p\)-value as the proportion of times \(T^*\) is more extreme than \(T_{\text{obs}}\).
Reject \(H_0\) if \(T_{\text{obs}}\) is too extreme relative to this null distribution, that is, if the \(p\)-value is very small.

The great thing about the permutation test is that we can build a null distribution for any statistic of our choice. So we usually will try to use a statistic that has a higher power.

Wilcoxon Rank-Sum Test (Mann-Whitney U Test)

We now define a different nonparametric test, that is also, in essence a permutaiton test. It uses a specific test statistic, the rank sum, to answer the question: are the distributions the same, or does one population tend to yield larger values?

Procedure:

Combine both samples and rank them in order of increasing size.
Compute the sum of ranks belonging to one of the groups.
If this rank sum is too large or too small, it suggests the populations differ.

The basic idea is that if If \(X_i\) and \(Y_j\) have the same distribution, then \[ P(X_i < Y_j) = \frac{1}{2} \]

for any \(X_i\) and \(Y_j\). This means that if the null is true, then the \(X_i\)’s should be scattered randomly among the \(Y_j\)’s, and the sum of their ranks should not be too small or too large. Let’s go over an example that will illustrate the idea.

Example 1 Survival times (years) after a coronary attack

A physician believes that a certain treatment can prolong the life of people who have suffered a coronary attack. He does an experiment on 10 patients, randomly selecting 5 to the treatment group and 5 to the control (no treatment), and makes sure to control for other factors such as age and general health. 5 years later, 4 out of 5 patients from the treatment group are still alive, and only 2 of the control group. Below are the survival times in years of all the ten patients:

Group
Treatment	4.2	6.5	7.9	13.2	17.8
Control	0.4	1.2	2.9	5.6	6.7

Combined and ordered:

\[ \underbrace{0.4}_{1},\ \underbrace{1.2}_{2},\ \underbrace{2.9}_{3},\ \underbrace{4.2}_{4},\ \underbrace{5.6}_{5},\ \underbrace{6.5}_{6},\ \underbrace{6.7}_{7},\ \underbrace{7.9}_{8},\ \underbrace{13.2}_{9},\ \underbrace{17.8}_{10} \]

\[ \underbrace{C \quad C \quad C \quad T \quad C \quad T \quad C \quad T \quad T \quad T}_{\text{group labels}} \]

\[ W_T = \text{rank sum of treatment} = 4 + 6 + 8 + 9 + 10 = 37 \]

\[ W_C = \text{rank sum of control} = 1 + 2 + 3 + 5 + 7 = 18 \]

Note: \(W_T + W_C = \frac{10 \cdot 11}{2} = 55\) ✓

Null hypothesis: The treatment has no effect — each person just lives as many years as they would have anyway.

Computing the P-value

We compute \(P(W_C \leq 18)\). There are \(\binom{10}{5} = 252\) equally likely rank assignments.

Rank subsets of size 5 with sum \(\leq 18\):

\[ 1+2+3+4+5 = 15 \]

\[ 1+2+3+4+6 = 16 \]

\[ 1+2+3+4+7 = 17 \]

\[ 1+2+3+5+6 = 17 \]

\[ 1+2+3+4+8 = 18 \]

\[ 1+2+3+5+7 = 18 \]

\[ 1+2+4+5+6 = 18 \]

\[ P(W_C \leq 18) = \frac{7}{252} = 0.028 \]

Conclusion: Reject \(H_0\) at \(\alpha = 0.05\). The treatment appears to have an effect at increasing survival.

Note: Group sizes do not need to be equal. For large samples, use a normal approximation.

Building the Null Distribution (Small Example)

With \(n = 2\) treatment and \(m = 3\) control subjects (5 total), there are \(\binom{5}{2} = 10\) rank assignments.

Ranks	(1,2)	(1,3)	(1,4)	(1,5)	(2,3)	(2,4)	(2,5)	(3,4)	(3,5)	(4,5)
\(W_T\)	3	4	5	6	5	6	7	7	8	9

Range: \(1+2 = 3 \leq W_T \leq 4+5 = 9\)

\[ P(W_T \leq w) = \frac{\#(W_T \leq w)}{10} \]

The distributions of \(W_T\) and \(W_C\) are the same, just shifted.

Mann-Whitney Test: Formal Setup

Given independent samples \(X_1, \ldots, X_n \sim F\) and \(Y_1, \ldots, Y_m \sim G\), let \(Z = (X, Y)\) be the combined sample.

\[ R_i(Z) = \text{rank of } X_i \text{ in } Z, \quad 1 \leq i \leq n \]

\[ R_{n+j}(Z) = \text{rank of } Y_j \text{ in } Z, \quad 1 \leq j \leq m \]

\[ R(X) + R(Y) = \frac{(m+n)(m+n+1)}{2} \]

Under \(H_0\):

\[ P(X < Y) = \frac{1}{2} \]

Every assignment of \(m + n\) ranks is equally likely (there are \((m+n)!\) assignments).
There are \(\binom{m+n}{n}\) distinct rank assignments to \(X\).
The null distribution of \(R\) is found by enumerating all \(\binom{m+n}{n}\) rank assignments.

Why Use Wilcoxon-Mann-Whitney?

If the actual data values have no inherent meaning beyond their relative ordering, ranks carry as much information as the raw numbers.
When distributions are non-normal and samples are small, parametric analysis is difficult.
The power of the Mann-Whitney test is comparable to the two-sample \(t\)-test. When normality is unknown and sample sizes are too small to verify it, Mann-Whitney is safer.

References

Rice, John A. 2006. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury Press.