More about Estimators and Mathematical Digressions
Properties of Estimators
In this chapter, we discuss what are some desirable properties that an estimator should have. This is just the beginning of the discussion and we will continue it when we study the method of maximum likelihood estimation. So far, we have discussed three criteria of estimators:
- Unbiasedness
- Low variance
- Low MSE (Mean Square Error)
Unbiasedness
Recall that we call an estimator \(\hat{\theta}\) unbiased if on average, the estimator is on target. That means that if we are trying to estimate \(\theta\), then \(E(\hat{\theta}) = \theta\). For example, since \(E(\overline{X}) = \mu\), the population mean, the sample mean \(\overline{X}\) is an unbiased estimator of the population mean, and \(\hat{p}\), the sample proportion, is an unbiased estimator of the population proportion.
Exercise Let \(X_1, \ldots, X_n\) be an IID sample from some distribution with mean \(\mu\) and variance \(\sigma^2\). Show that \(S^2\) is an unbiased estimator of \(\sigma^2\), where \(\displaystyle S^2 = \dfrac{1}{n-1}\sum_{i=1}^n \left(X_i - \overline{X}\right)^2\) is the sample variance of \(X_1, \ldots, X_n\).
Exercise Consider an IID random sample \(X_1, \ldots, X_n\) from a \(Unif(0,b)\) distribution. Find \(\hat{b}\), the method of moments estimate of \(b\)? Is \(\hat{b}\) an unbiased estimator? What about \(\hat{b}^2\)? Is it an unbiased estimator of \(b^2\)?
As we saw in the previous chapters, if the expectation of the estimator is the true value multiplied by a constant, then we can just divide the estimator by the constant and this will make the estimator unbiased. That is, if \(E(\hat{\theta}) = c\theta\), where \(c\) is a constant, then since \(E(\dfrac{\hat{\theta}}{c}) = \dfrac{c\theta}{c} = \theta\).
Low Variance
This is important since it increases the accuracy of the estimator. The last paragraph shows that we can get rid of a bias, but if an estimator has high variability, there is not much we can do. A low variance implies that the standard error of the estimator is low, which is a desirable property.
Mean Square Error
Recall that \(MSE = \mathrm{bias}^2 + \mathrm{Variance}.\) Since the MSE combines both bias and variance, it is very useful for comparing estimators that may be biased.
Exercise (Chihara and Hesterberg 2018) Let \(X \sim Bin(n,p)\), where \(n\) is known and \(p\) is unknown. Show that the sample proportion \(\hat{p}_1 = X/n\) is an unbiased estimator of \(p\), and that \(MSE\left(\hat{p}_1\right) = \dfrac{p(1-p)}{n}\).
Suppose we define an alternative estimator of \(p\), denoted by \(\hat{p}_2\), where \(\hat{p}_2 = \dfrac{X+1}{n+2}\), that is, we add one artificial success and one artifical failure to the data. What is \(E(\hat{p}_2)\)? Is \(\hat{p}_2\) unbiased? Compute \(\mathrm{Bias}(\hat{p}_2) = E(\hat{p}_2) -p\), the variance of \(\hat{p}_2\), and the MSE of \(\hat{p}_2\). Which estimator would you use, \(\hat{p}_1\) or \(\hat{p}_2\)?
Consistency
All these criteria are well and fine, but one thing we would definitely like is that as our sample size gets larger and larger (we get more data), our estimator should become more accurate. In fact, as \(n \rightarrow \infty\), we would like \(\hat{\theta} \equiv \hat{\theta}_n\) to converge to \(\theta\) (the \(n\) denotes the size of the sample that we are using), whatever “converge” might mean here. In fact, the convergence here is what we call convergence in probability, which basically means that in the long run, the probability of \(\hat{\theta}_n\) being very close to \(\theta\) goes to 1.
Consistent: Let \(\hat{\theta}_n\) be an estimator of a parameter \(\theta\), based on a sample of size \(n\). We say that \(\hat{\theta}_n\) is consistent in probability if \(\hat{\theta}_n\) converges in probability to \(\theta\) as \(n \rightarrow \infty\); that is, for any \(\varepsilon > 0\), we have that \(\displaystyle \lim_{n \rightarrow \infty} P\left(\lvert \hat{\theta}_n -\theta \rvert > \varepsilon \right)=0\). We write this as \(\hat{\theta}_n \overset{P}{\rightarrow} \theta\).
This is saying that for any acceptable amount of error epsilon, the probability of an actual error that is greater than epsilon goes to 0. This means that the error might be large on events of very low probability, and these probabilities get lower as \(n\) gets larger. If the sample is large enough, therefore, the estimator is very likely to be close to its target.
Types of Convergence
Before we go any further, let’s briefly talk about the three different kinds of ways of interpreting the statement \(\hat{\theta}_n \rightarrow \theta\) that you might encounter while studying probability - well, you will mostly encounter the first two, but you might have seen the third in a course on real analysis or measure theory.
Let \(X_1, X_2, \ldots\) be a sequence of random variables, such that \(X_n\) has cdf \(F_n\), and let \(X\) be another random variable with CDF \(F\).
\(X_n\) converges in distribution to \(X\), written as \(X_n \overset{D}{\rightarrow} X\), if \(P(X_n \le x) \rightarrow P(X \le x)\) as \(n \rightarrow \infty\) at all points \(x\) at which the function \(F(x) = P(X \le x)\) is continuous. This means that the CDFs of the \(X_n\) converge to the CDF of \(X\) (\(F_n \rightarrow F\) as \(n\rightarrow \infty\)).
\(X_n\) converges in probability to \(X\), written as \(X_n \overset{P}{\rightarrow} X\), if, for all \(\varepsilon > 0\), we have that \(P(\lvert X_n - X \rvert > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\). This isn’t saying that we guarantee that \(X_n\) will get very close to \(X\) for large \(n\). We are saying that it is very likely.
\(X_n\) converges almost surely, written as \(X_n \overset{a.s.}{\rightarrow} X\), if \(\displaystyle P\left(\lim_{n\rightarrow \infty} X_n = X\right) =1\). That is, the event \(\{\omega \in \Omega: \displaystyle \lim_{n\rightarrow \infty} X_n(\omega) = X(\omega)\}\) has probability 1. This convergence guarantees that \(X_n\) will converge to \(X\) except on a set of measure 0 (for example, on finitely many points in a continuous interval).
You can see that the first type is the weakest, and the last is the strongest. \[ \left(X_n \overset{a.s.}{\rightarrow} X\right) \Rightarrow \left(X_n \overset{P}{\rightarrow} X\right) \Rightarrow \left(X_n \overset{D}{\rightarrow} X\right) \]
That is, if we have that \(X_1, \ldots\) converge to \(X\) a.s., then the other forms of convergence will also be true.
Here is an example (Grimmett and Stirzaker 2001): Suppose \(X\sim Bernoulli(1/2)\), and \(X_1, \ldots\) are identical random variables, with \(X_n = X\) for all \(n\). Now the \(X_n\) are not independent, but \(X_n \overset{D}{\rightarrow}X\). If \(Y = 1-X\), then \(X\) and \(Y\) have the same distribution, which means that \(X_n \overset{D}{\rightarrow} Y\). But \(X_n\) cannot converge to \(Y\) in any other sense because \(\lvert X_n - Y\rvert = 1\) always.
We will revisit two famous limit theorems now, that use different kinds of convergence.
The Central Limit Theorem
The CLT states that for large \(n\), the sample mean, suitably standardized, will have a CDF that approaches the CDF of the standard Normal. That is, the standardized sample mean converges in distribution to the standard Normal.
If \(X_1, X_2, \ldots, X_n\) is an independent and identically distributed sample from a population with mean \(\mu\) and variance \(\sigma^2\), then:
\[ \left(\frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\right) = \sqrt{n}\left(\frac{\overline{X}-\mu}{\sigma}\right) \overset{D}{\longrightarrow}\mathcal{N}(0,1) \text{ as } n\longrightarrow \infty \]
The Weak Law of Large Numbers
Let \(X_1, X_2, \ldots, X_n\) be a sequence of iid random variables with mean \(\mu\) and variance \(\sigma^2\), then if \(\overline{X}_n \left(= \dfrac{1}{n} \displaystyle\sum_{i=1}^{n}X_i \right)\) is the sample mean,then for every \(\varepsilon > 0\), we have that \(\lim_{n \rightarrow \infty} P(\lvert \overline{X}_n - \mu \rvert > \varepsilon )=0\).
That is, the sample mean converges in probability to the true mean.
Tail Inequalities
In order to prove the WLLN, we need Chebyshev’s inequality. In order to prove Chebyshev’s inequality, we need Markov’s inequality. These are both called tail inequalities because they bound the tail of the distribution. That is, they put bounds on the probability that the random variable takes very large values.
Markov’s inequality
If \(X\) is a nonnegative random variable (\(P(X\ge 0 ) = 1\)) such that \(E(X) < \infty\) (that is, the expectation exists), then for any \(c> 0\), \[ P(X \ge c) \le \dfrac{E(X)}{c}. \]
Notice that this is a very simple inequality, and requires nothing but that the random variable be nonnegative. The proof is very simple - we just have to write the expected value as a sum over the region where \(X < c\) and where \(X \ge c\). Try it.
Chebyshev’s inequality
If \(X\) is a random variable with mean \(\mu\) and variance \(\sigma^2\), then we have that for any \(c>0\),
\[ P(\lvert X-\mu\rvert> c)\le \dfrac{\sigma^2}{c^2}. \]
Exercise Apply Markov’s inequality to \(\lvert X - \mu\rvert^2\) to prove Chebyshev’s inequality.
Consequences of the WLLN
Please work through the following consequences of the WLLN:
If \(X_n \xrightarrow{P} X\) and \(Y_n \xrightarrow{P} Y\), then
\[ X_n + Y_n \xrightarrow{P} X + Y. \]If \(X_n \xrightarrow{P} X\) and \(c\) is a constant, then
\[ c X_n \xrightarrow{P} c X. \]If \(X_n \xrightarrow{P} a\) and the real-valued function \(g\) is continuous at \(a\), then
\[ g(X_n) \xrightarrow{P} g(a). \]If \(X_n \xrightarrow{P} X\) and \(Y_n \xrightarrow{P} Y\), then
\[ X_n Y_n \xrightarrow{P} X Y. \]
Sample moments converge to the true moments of the distribution
This is another important consequence of the WLLN. That is,The sample moments (\(\hat{\mu}_k\)) converge to the population moments (\(\mu_k = E(X^k)\)). If the functions relating the estimator \(\hat{\theta}\) to the sample moments are continuous, then the estimator will converge to the parameter as the sample moments converge to the population moments.
Therefore, the WLLN implies that the sample mean (\(\overline{X}_n\)) is a consistent estimator of \(\mu\).
Recall the sample variance \(\displaystyle S^2 = \dfrac{1}{n-1}\sum_{i=1}^n \left(X_i - \overline{X}_n\right)^2\). Is \(S^2\) a consistent estimator of \(\sigma^2\)?
What about \(\hat{\sigma}^2 = \dfrac{n-1}{n}S^2\)?