Lecture Notes

Home

Contact

Hypothesis testing

Probability

Probability describes the frequency of observing some outcome that is subject to chance. Probability may be expressed as a decimal or as a percentage, but it is always between 0 and 1 (0–100%) inclusive.

In a game of chance, probability is easy to imagine. For example, there is some probability that we could roll a 2 on a die, get heads in a coin flip, or draw a royal flush in a poker game.

For scientists, chance enters our world primarily through how we sample a population. For example, if we measured the heights of a dozen randomly selected trees and calculated their mean height, that mean would be subject to variation because of the trees we happened to measure. We would get a different mean height if we measured a different dozen trees. As such, all scientific data is subject to sampling and probability.

Sometimes the probabilities of different outcomes are the same (such as the faces of a die or heads vs. tails on a coin), and sometimes they are different (such as the various hands in poker). In our tree example, we are likely to get a mean of our sample that’s relatively similar to the population mean, and we are much less likely to get a mean based on the rare largest trees in the study area. Regardless of the individual probabilities, the sum of probabilities of all possible outcomes equals 100%. We can visualize this as the area under a probability distribution; in other words, we can integrate the probability distribution, and the total area is always 100%.

Probability has two meanings in the frequentist definition that we will use for most of this course. Probability for a frequentist is the chance of a given outcome in one trial, but it can also be interpreted as the proportion of times an outcome will occur over many trials. For example, the probability of getting heads in a fair coin flip is 50%, and if you flip a coin a great many times, close to 50% of the flips will be heads.

Later in the course, we will discuss another way to think about probability: Bayesian probability. For a Bayesian, probability is a value that describes the strength of belief in a statement and can exceed 100%. Probability in the Bayesian sense does not have the same properties as in the frequentist sense.

Hypothesis testing — the classical approach

One of the most common situations for a scientist is wanting to know if a given statistic (an observation) is consistent with a hypothesis they have about a parameter of the population. For example, we might want to know whether a sample mean would have been expected if the population mean had some particular value. Alternatively, we might want to know if a sample mean was improbable given the hypothesis, in other words, that the sample mean was an unlikely outcome. If it was an unlikely outcome, we might conclude that we could rule out our hypothesis for the population’s mean. This approach is called hypothesis testing, developed by Jerzy Neyman and Egon Pearson (not the Karl Pearson that developed the Pearson correlation coefficient).

Hypothesis testing follows six steps:

State a null hypothesis
Generate a distribution of expected outcomes
Declare a level of rarity called significance (α)
Find the critical value from the distribution of outcomes and the significance level
Calculate the statistic from the data
Compare the statistic to the critical value to decide whether to accept or reject the null hypothesis

To test a hypothesis, we first state a null hypothesis, a statement we will use to evaluate the data. The null hypothesis is a definitive statement about the world; it is not a question. A null hypothesis is a special type of hypothesis, and it is generally a statement that nothing special is going on. For example, a null hypothesis about an experimental treatment could be that its effects are no different than applying no treatment at all. A null hypothesis about two means is that the means are the same and that their difference is zero. A null hypothesis about the correlation of two variables is that they are not correlated, that is, that the correlation coefficient is zero. A null hypothesis about the slope of the line is that the slope is zero.

Many students find two things counterintuitive about statistics. First, the scientific hypothesis is usually not the null hypothesis. For example, the scientific hypothesis may be that applying a fertilizer causes faster growth rates than not applying it. The null hypothesis would be that applying the fertilizer does not improve growth rates (in other words, that there is zero difference between applying a fertilizer and not applying it). Second, our aim is not to directly prove the scientific hypothesis or even to test it. Instead, we will test — attempt to disprove — the null hypothesis by showing that the data we observed would have been unlikely if the null hypothesis was true. If we can rule out the null hypothesis in this way, we will conclude that the alternative hypothesis, the scientific hypothesis, better explains the data.

Next, we need to generate a probability distribution of possible values of the statistic that we plan to measure. This distribution will be based on the sample size we plan to collect and on the assumption that the null hypothesis is true. For example, suppose we are interested in the mean of a variable. We would want to know the probability of observing any possible sample mean if the null hypothesis was true. Because we calculate this probability for all possible values of the mean, we can build a probability distribution of the expected value of the mean.

This distribution will let us evaluate whether the observed statistic (the sample mean) was a likely or an unlikely result, again assuming that the null hypothesis is true. The farther out we are on the tails of this distribution, the less probable it is that the statistic was generated by sampling from a population described by the null hypothesis. We will learn several different ways to build these probability distributions in this course.

Next, because our conclusion is based on probability, we need to state how unlikely the observed statistic would have to be before we should conclude to reject the null hypothesis and accept the alternative. This level of rarity is called the significance level, and it is symbolized by the Greek letter α. Significance is a probability; it is the probability that describes what we consider to be rare, unexpected, or unusual.

Despite what you may hear, there is no magical or universally accepted significance level, and significance varies by field. That said, significance is commonly 0.05 (5%) in the natural sciences because when Ronald Fisher (the statistical genius behind the Modern Synthesis of evolution) first proposed it, he suggested that a 1 in 20 event was unusual enough to warrant further investigation. We must realize that we can set significance to any value we wish. For example, we could set it to be very small (<0.001%) to be more certain about rejecting the null hypothesis. Some have proposed that significance be generally set to 0.005 and not 0.05, in other words, we need a more stringent standard of rarity. There are tradeoffs to setting significance to smaller values; we will explore those in the next lecture.

Fisher’s portrayal of what 1 in 20 means, that it is enough to warrant further investigation, is crucially important and is something many data analysts overlook. Fisher did not claim that anything rarer disproves the hypothesis, only that it makes us suspicious about the hypothesis and that we should investigate this finding further.

From the probability distribution for the statistic, the significance level, and the alternative hypothesis, we can find one or two critical values that define the limits of what would be considered unexpected or rare outcomes. Any value of the statistic equal to or more extreme than those critical values, that is, a statistic farther out on the tails of the probability distribution than these critical values, would be regarded as unexpected if the null hypothesis was true.

For example, one alternative hypothesis might be that the sample mean is larger than that of the null hypothesis. This would likely be the case in a test of a new fertilizer, where we would hope that mean growth is greater than when no fertilizer is applied. In this case, we are interested only in rare results on the right (large-value) side of the probability distribution. To find the critical value, we would add up the area under the probability distribution starting at the far right of the distribution and working towards the left (smaller values of the statistic) until we reached a cumulative probability that equals the significance level. When we have reached the significance level, that position marks the critical value: any observed statistic (sample mean in this example) beyond that lies in the realm of improbable results if the null hypothesis was true. This realm is called the critical zone. This would be a one-tailed test, more specifically, a right-tailed test because we are interested in extreme values on only that side of the distribution of outcomes.

In some cases, we might expect the statistic to be smaller than that of the null hypothesis. We would follow the same procedure, except we would start from the far left of the probability distribution and work towards the right until we hit a cumulative probability equal to the significance level. That would mark the critical value, and any outcome to the left of that would be considered improbable to have been generated by the null hypothesis. This is also called a one-tailed test, specifically a left-tailed test.

We often do not have any information to suggest that the statistic should be larger or smaller than the null hypothesis. In that case, we split the significance probability evenly into the two tails of the distribution and calculate two critical values, one for each tail. Anything beyond those critical values, that is, in either of the two critical zones, would be considered an improbable outcome. This is called a two-tailed test.

Notice that the test setup is complete by this point: we know the critical values, so we know how to respond if the measured statistic is in the critical zone. Notice also that we have not yet seen the data and we therefore do not yet have the statistic. This is how it should be: none of what we have done so far should be influenced by the data. In particular, we do not decide to do a left-tailed, right-tailed, or two-tailed test based on the data. Modifying this process by what is in the data nullifies the interpretation of the statistical analyses. Stated another way, we should not test a hypothesis that was created by examining the data. Once we have stated the null hypothesis, declared the significance level, generated a distribution of expected outcomes, and found the critical values, we can then collect the data and calculate the statistic.

With the critical values and the statistic, we can now make a decision about the null hypothesis. If the statistic falls in the critical zone, we reject the null hypothesis, and if it does not fall in the critical zone, we accept the null hypothesis. When we reject the null hypothesis, we implicitly accept the alternative hypothesis. In other words, we never test the scientific (alternative) hypothesis directly; we test the null hypothesis, which leads to a decision on the alternative hypothesis. When we accept the null hypothesis, we are saying that it is one reasonable explanation of the data, but it is not necessarily the only reasonable explanation.

It is important to realize that we did not determine whether the null hypothesis is true or false, that is, whether it is correct or not. Instead, we have only made a decision on how to act: we will act as if the null hypothesis is true or we will act as if it is false. In most cases, we will never know whether the null hypothesis was indeed true or false. Remember, we accept or reject hypotheses; we do not determine if they are true or false.

Some people use the more convoluted language of “fail to accept” and “fail to reject” instead of reject and accept. Avoid using this language, not only because the wording is cumbersome, but more importantly because it confuses the truth of the hypothesis with our decision, a distinction we will explore later.

An example in R

We can explore hypothesis testing with a simulation. Suppose we are engaged in an exploration of a potential gold mine. We do not want to develop the mine if it is not economically viable, meaning the concentration of gold is less than the minimum economically viable concentration.

Our scientific hypothesis is that the potential mine is economically viable, but we need to state a null hypothesis, which we will state as “The mean concentration of gold in our potential mine is not greater than a mine that is at the financial break-even point”.

Solely for the purposes of illustration, let us suppose we knew the population of gold concentration in assays from a break-even mine, that is, one that is right at the boundary between making money and losing money. We will never know this, but imagining that we could helps us understand how hypothesis testing works. Plus, it is fun to pretend that you are omniscient and know things like populations.

First, suppose we know from experience that gold concentrations in a break-even orebody follow a lognormal distribution of known parameters (log mean and log standard deviation). We can use the rlnorm() function to simulate the population of assays from the break-even mine. We will display the frequency distribution of these concentrations, and we can also show the mean of the population, that is the mean gold concentration of all rocks in a break-even mine.

population <- rlnorm(10000) hist(population, breaks=50, las=1, main="Population", xlab="Gold concentration", ylab="frequency", col="pink") mean(population) abline(v=mean(population), lwd=2, col="red") text(mean(population), 4500, "population mean", col="red", pos=4)

(Note that in the last line of this code, because I used magic numbers to specify the x and y coordinates of the label in text(), the label might not plot in the correct position. This is also true in some of the code below. Hard-coding magic numbers is not good practice, and I do it here solely to keep the code simple and not detract from the purpose of the simulation.)

Because I plan to collect a small sample (n=20 rocks) from the potential mine, it is worth exploring what the mean of this sample might be if the mine was a break-even mine (that is, the null hypothesis). It is easy to simulate a sample of n=20 from the break-even mine and calculate the mean of that sample.

mean(sample(population, size=20))

Note that the mean of this sample is not the same as the mean of the population, although it is close. If you repeat these lines of code, you will see that sometimes the sample mean is smaller than the population mean, and sometimes it is larger. This is the effect of chance brought on by random sampling: if I measure different rocks, I will calculate a different mean. This is always true. The effects of chance underlie every measurement you make as a scientist.

Because of the effects of chance introduced in sampling, we need to simulate not just one sample mean but many of them so that we have the probability distribution of sample means (when n=20) when we sample from the break-even mine. To do this, we repeat the process of drawing a sample of size n=20 many times (say, 50,000) and plot a frequency distribution of those sample means. In this plot, we add the population mean so we can see how it compares with the distribution of sample means.

numTrials <- 50000 n20means <- replicate(n=numTrials, mean(sample(population, size=20))) dev.new() hist(n20means, breaks=50, las=1, main="Distribution of sample means when n=20", xlab="mean gold concentration", ylab="frequency", col="gray") abline(v=mean(population), lwd=2, col="red") text(mean(population), 4500, "population mean", col="red", pos=4)

This frequency distribution shows that some sample means are likely to occur, but others are not. For example, if we grabbed a random sample of size n=20 from a mine at the financial break-even point, it is quite probable that the sample would have a mean gold concentration between 1 and 2.

On the other hand, it is improbable that the sample would have a gold concentration of 2.5, or 3, or more if it came from a break-even mine. In other words, such large concentrations are unlikely if the null hypothesis (that we are sampling from a break-even mine) is true.

But what would we conclude about a mean gold concentration of 2.2 or 2.3? We need a standard for what would constitute a rare outcome; that standard of rarity is the significance level, and we choose it. It is entirely under our control. Feeling conventional, we follow the convention in the natural sciences and set the significance at 0.05 (5%). This is a good place to emphasize that we could have chosen any value for significance that we want.

We can find the critical value from the probability distribution and the significance value. In this case, we are interested only in unusually high gold concentrations (small ones would indicate an unprofitable mine), so we perform a right-tailed test. We integrate under the curve starting from the far right end, working leftward into the distribution until the cumulative probability equals the significance level. Where these match is the critical value.

significance <- 0.05 criticalValue <- quantile(n20means, 1-significance) criticalValue abline(v=criticalValue, lwd=3, col="blue") text(criticalValue, 4000, "critical value", col="blue", pos=4) text(criticalValue, 2500, "unlikely values", cex=2, col="blue", pos=4)

We can now measure the gold concentrations in the sample and find the mean of those 20 values. If the mean concentration is greater than the critical value, it lies in the critical zone, and we would reject the null hypothesis that we sampled from a break-even mine. Because we did a right-tailed test, we would therefore implicitly accept the alternative hypothesis that our potential mine will be profitable. If the sample mean was to the left of the critical value, outside the critical zone, we would accept the null hypothesis that we sampled from a break-even mine (or worse), and we would not develop this new mine as a result.

This example shows you the logic behind hypothesis testing. Although the actual mechanics of how we obtain the distribution of expected outcomes for the statistic will differ depending on the situation, we will follow these steps when hypothesis testing:

State a null hypothesis
Generate a distribution of expected outcomes
Declare a level of rarity called significance (α)
Find the critical value from the distribution of outcomes and the significance level
Calculate the statistic from the data
Compare the statistic to the critical value to decide whether to accept or reject the null hypothesis

Supplemental code

If you are interested, here is the code I used to simulate the probabilities of rolling six-sided dice.

First, let’s simulate rolling a fair six-sided die. This plot shows that there is an equal probability (1/6, or 0.1667) for each face of the die.

die <- 1:6 numSides <- length(die) fairProbabilities <- rep(1/length(die), length(die)) dev.new(height=4, width=7) barplot(fairProbabilities, col="dodgerblue", las=1, space=0, names=die, ylim=c(0, 0.3), xlab="die result", ylab="probability", main="probabilities for rolling a six-sided die")

Next, we’ll simulate rolling a loaded die. On this one, the probability of rolling a six is 0.3, almost double what it should be. The probabilities of the other five faces are equal to one another, and the sum of the probabilities for all of the faces is 1.0

weightedDie <- 0.3 loadedProbabilities <- c(rep((1-weightedDie)/(numSides-1), (numSides-1)), weightedDie) dev.new(height=4, width=7) barplot(loadedProbabilities, col="coral2", las=1, space=0, names=die, ylim=c(0, 0.3), xlab="die result", ylab="probability", main="probabilities for rolling a loaded die")

Imagine we are playing a game where we roll 5 dice and sum them to get our score. This plot shows the probability distribution of doing that. For fair dice, the most probable result is 16. Increasingly larger and increasingly smaller scales are progressively less probable. For example, getting a 30 (all dice with a 6) would be highly unusual, maybe even suspicious.

This distribution is calculated by simulating rolling five dice (sample), summing them, and repeating this process 10,000 times (replicate). When we use sample(), we want to ensure we can get a given die face more than once, so we set replace=TRUE. The default for sample() is that the probability of each outcome (face of a die here) is the same. We could get a more accurate frequency distribution by repeating the process more times (more than 10,000), with the only drawback being that the code would be slower.

fairTrials <- replicate(n=10000, sum(sample(die, 5, replace=TRUE))) dev.new(height=4, width=7) hist(fairTrials, xlab="sum of five fair dice", main="", breaks=5:30, las=1, col="dodgerblue")

If we do this with our loaded dice, we find that the most probable outcome is now 19, not 16. Moreover, the probabilities on the right tail (near 30) look to be almost double what they were for the fair dice. Note that we must specify the loaded probabilities (prob=loadedProbabilities) to simulate rolling loaded dice.

loadedTrials <- replicate(n=10000, sum(sample(die, 5, replace=TRUE, prob=loadedProbabilities))) dev.new(height=4, width=7) hist(loadedTrials, xlab="sum of five loaded dice", main="", breaks=5:30, las=1, col="coral2")

Data Analysis in the Geosciences

GEOL 8370