We have covered three approaches to testing hypotheses: critical values, p-values, and confidence intervals. If you only want to decide on a hypothesis — to accept it or reject it — all three methods work equally well. For a critical value, a statistically significant result occurs when the observed statistic is more extreme than the critical value. For a p-value, a statistically significant result is when the p-value is smaller than a chosen significance level, often 0.05. A statistically significant result arises for a confidence interval when the null hypothesis lies outside of the confidence interval.
Only p-values and confidence intervals provide, in probabilistic terms, the degree of support by showing how rare a given result is. For this reason, both are preferred over critical values. You will likely use p-values and confidence intervals more often in your research and encounter them more often in your reading than critical values.
The drawback to the critical value and p-value approaches is that neither supplies an estimate of the quantity you are interested in (for example, a difference in means, or a correlation coefficient) or your uncertainty in your estimate. Confidence intervals do, and that is a strong reason for preferring them.
Of the three approaches, p-values are by far the most widely used. Even so, their use in data analysis is increasingly criticized. Moreover, statistical hypothesis testing, which results in a simple accept or reject decision, is also coming under fire. These criticisms are well-founded, and this course will follow a best-practices approach to data analysis: we will emphasize estimation and uncertainty (confidence intervals).
It is important to remember that the distributions upon which these three approaches are based differ. Critical values and p-values are based on the distribution of outcomes generated by the null hypothesis. In contrast, confidence intervals are derived from the data (sometimes with an appropriate theoretical distribution), making them useful when one wants to test multiple hypotheses or when there is no clear null hypothesis.
Two common misconceptions surround p-values, and you are certain to encounter them. They are common in scientific literature and at conferences.
The first misconception is that p-values are the probability that the null hypothesis is correct. This is not true. Recall the definition of a p-value: If the null hypothesis is assumed to be correct (true), a p-value is then the probability of observing a statistic or one more extreme. We could simplify this statement to the form of “if X, then Y”. The misconception flips this argument to “if Y, then X”, but this is a logical error called affirming the consequent. A simple example shows you why you cannot reverse the definition of a a p-value.
Suppose you study frogs and find that 80% are green. You could then say, “If it’s a frog, there’s an 80% chance it is green”. By following the misconception of p-values, we might try to flip this statement so that it becomes “If it is green, there’s an 80% chance that it is a frog”. This is obviously not true because if you find something green, it is far more likely that it is a plant!
Second, scientists commonly (but incorrectly) use p-values to indicate how important a result is. It is common to say that a a p-value less than α (often 0.05) is statistically significant, and values greater than α being not statistically significant. Both statements are fine. The problem comes by dropping the word “statistically” and adding adjectives that imply importance. For example, a p-value slightly smaller than 0.05 may be described as marginally significant, and a tiny p-value may be described as highly significant. Values greater than 0.05 are called insignificant. By dropping “statistically” the word “significant” subtly transforms its meaning: it is used to imply importance. Another way this is done is by attaching asterisks to p-values, with more asterisks for increasingly smaller p-values. Even R has this built into most results, as we will see when we get to ANOVA, regression, and other linear modeling techniques. Because of this misconception that p-values convey importance, scientists sometimes present tables of p-values as the sole results of their statistical analyses.
Because p-values are controlled not only by effect size but also by the amount of variation and, especially, sample size, p-values cannot be read as describing the importance of a result. For example, a large sample size may be all that is needed to detect a minor pattern and achieve statistical significance through a low p-value, even one that is quite small. As your size of your sample increases, you can detect ever smaller, ever less important phenomena.
More broadly, there are several reasons to be concerned with the entire concept of statistical hypothesis testing, of using statistics to reach a simple dichotomous accept-or-reject decision. These criticisms apply not only to p-values but also to critical values and even to confidence limits when used simply to accept or reject a null hypothesis.
First, most significance tests serve no purpose, especially two-tailed tests. For example, one common test is to compare the means of two groups. Here, we would test whether the difference in their means is zero. In most cases, we know the answer to this question before we collect any data all. We do not need to collect any data to answer this question, and we can state with complete certainty that the means are different. How do we know this? Take two populations of anything in nature, measure some quantity on them, and calculate the means. Those two means are almost assuredly different; the difference might be in the sixth decimal place, but those means differ. No statistical test is needed.
Second, significance tests ask the wrong question. Let’s continue with our example of comparing two means. We know that they are different, and therefore, the null hypothesis is false. Our significance test therefore has two outcomes. We might have a large enough sample size to achieve statistical significance and correctly reject the false null hypothesis. On the other hand, we might not have enough data to achieve statistical significance and we therefore have to accept the false null hypothesis as one possible reasonable explanation of the data, which would be a type 2 error. This is not a helpful set of alternatives because all we learn is whether we have collected enough data. Just asking whether the difference in means is statistically significant is the wrong question. We should instead ask what the difference in means is and what our uncertainty is. Similarly, we should not ask whether a null hypothesis can be rejected but which of multiple hypotheses is best supported by the data.
Third, non-significant results of tests are commonly misinterpreted. When scientists generate a p-value greater than 0.05, they conclude that the result is not statistically significant, which is correct. If they are comparing the difference in two means, they may further describe their results as saying that there is no difference in the two means, which is not correct. This is not what the statistical test allows you to say because it tested only one possibility: zero difference in the means. If one were to test other hypotheses in which the means differed by some value, they may also produce large p-values, meaning that those hypotheses would be acceptable, just as the null hypothesis is. Similarly, in correlation tests, a p-value greater than 0.05 does not indicate a lack of a correlation, only that no correlation is one of several acceptable alternatives. All of this arises from misunderstanding what it means to accept a null hypothesis. Accepting the null hypothesis does not mean that it is correct or true; it means only that the null hypothesis is one of potentially many acceptable null hypotheses.
Although these examples were demonstrated with p-values, these criticisms also apply to critical values and confidence intervals when we use them to reach a dichotomous accept-or-reject decision on a null hypothesis.
The important question is not whether the means are different. The important questions are how different the means are and what our uncertainty is in that difference. If we had sufficient data and performed our statistical test, we would find that the difference is statistically significant, perhaps highly so (a tiny p-value). However, because our sample size is large, we could detect even a minuscule difference in means, a difference so small that although it is statistically significant, it may not be scientifically important. For example, the mean heights of two forests might be 1 mm, surely of no consequence when the trees are 50 m tall. Similarly, in a large enough study we might find that the effectiveness of a drug is statistically significant when it raises survival by merely 0.01%, an effect so small to be of little use.
Given the criticisms of p-values and those of significance testing, one might wonder what a scientist should do.
These criticisms do not mean that p-values have no place or should be banned, although some statisticians have argued this, and some journals have banned them. These criticisms also do not mean that we should stop using statistics altogether. What they do mean is that we have to stop using them in a dichotomous way and that we have to report our results fairly and without the misuses that abound.
Most importantly, we should place a greater emphasis on estimation and the uncertainty of our estimate. For example, when we compare sample means, we should not focus on the narrow and almost certainly false hypothesis of zero difference. Instead, we should focus on estimating that difference in sample means and measuring the uncertainty in that difference. This is what confidence intervals do, and we should use them. In short, we should embrace uncertainty. We should estimate the quantities that interest us and our uncertainty in them because that uncertainty will reveal the range of explanations compatible with our data, far more than a simple null hypothesis.
Even within a confidence interval, it is important to realize what it represents. Although a confidence interval is a set of hypotheses consistent with the data for a given level of confidence (often 95%), hypotheses towards the center of that interval (that is, closer to the estimate) are more probable than those near the edges of the confidence interval. Moreover, hypotheses just beyond the confidence interval are still possible, although increasingly improbable: Remember that confidence intervals become broader as the level of confidence increases. Values just outside a 95% confidence interval would still be within a 99% confidence interval.
In this approach, we no longer make simple cut-and-dried decisions but instead emphasize how likely certain explanations (hypotheses) are relative to others. In short, we embrace the uncertainty.
For now, and likely for some time into the future, we will have to live with p-values. Advisors, editors, and reviewers may demand them, particularly when they do not understand the issues. You should try to raise awareness but realize that you may be forced to use p-values. If so, be clear about what the p-value represents: the probability of observing a statistic or one more extreme if the null hypothesis is true, not vice versa. Report p-values with reasonable precision (one or two significant figures, at most), and don’t report them with inequalities (e.g., p<0.05). Don’t tag p-values as significant or insignificant based on arbitrary criteria such as 0.05, and don’t flag p-values with stars to imply that some are more special than others. Even if you must use p-values, focus on the values of your estimates (and their uncertainty) to understand the scientific significance of your results and the hypotheses consistent with the data.
Scientists rise up against statistical significance
Amrhein, V., S. Greenland, B. McShane. 2019. Scientists rise up against statistical significance. Nature 567:305–307.
Holland, S.M. 2019. Estimation, not significance. Paleobiology 45:1–6.