We frequently work with two variables simultaneously and want to understand their relationship. For this, there are two related yet distinct goals. The first is to understand the strength of the relationship between them, called correlation. The second is to know the mathematical function that describes the relationship, which could also be used to make predictions; finding this relationship is called regression. We will cover these two topics separately in the next two lectures.
There are two common types of correlation, one describing the linear relationship and one describing the monotonic relationship. As you might guess, a linear relationship is one described by a line. A monotonic relationship is where one variable consistently increases or decreases as the other increases, although the relationship might not be linear (e.g., exponential). To measure the linear correlation, we must introduce the concept of covariance.
Covariance is based on the sum of products, which is analogous to the sum of squares used to calculate variance:
where i is the observation number, n is the total number of observations, j is one variable, and k is the other variable. The sum of products is the deviations from the mean of one variable, which are multiplied by the deviations from the mean for the other variable, and these are summed over all members of the data series.
The sum of products can also be calculated in a single pass, without first calculating the means:
Here, the first term is the uncorrected sum of products, and the second is the correction term. It is corrected in the sense that it subtracts the mean from each of the observations. In other words, it is the sum of products relative to the means of each variable, just as we measured the sum of squares relative to the mean.
Covariance equals the sum of products divided by the n-1 degrees of freedom. Note the similarity in the formulas for variance and covariance.
Unlike the sum of squares and variance, which can have only positive values (or zero), the sum of products and covariance can be positive or negative. Covariance has units equal to the product of the units of the two measurements, similar to variance, which has units that are the square of the measurement’s units. Also like variance, covariance is a function of the scale on which the data were measured, which is generally undesirable. Converting covariance into a dimensionless measure will give us Pearson’s correlation coefficient.
Pearson’s correlation coefficient is also called the product-moment correlation coefficient. Often, it is just called the correlation coefficient. When someone doesn’t specify which type of correlation coefficient, they usually mean Pearson’s.
To produce a dimensionless correlation measure, we need to divide covariance by something with the same units, namely the product of the standard deviations of the two variables. This standardizes the joint variation in the two variables by the product of the variation in each variable. The correlation coefficient therefore measures the strength of the relationship between two variables, but it doesn’t depend on the units of those variables.
Pearson’s correlation coefficient varies from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). A value of 0 indicates no linear relationship, which occurs when covariance equals zero. If one variable does not vary, one of the standard deviations will be zero, causing the denominator to be zero and making the correlation coefficient undefined.
Pearson’s correlation coefficient can also be calculated as the sum of products divided by the square root of the product of the two sums of squares:
In R, covariance is calculated with cov(), and correlation is calculated with cor():
cov(x, y) cor(x, y)
Both take two arguments, vectors corresponding to the two variables.
As always, we can generate estimates with confidence intervals and test hypotheses about a statistic if we know its expected distribution. If the data are normally distributed, and if the null hypothesis is that there is zero correlation, Pearson’s correlation coefficient follows a t-distribution, where the standard error of Pearson’s correlation coefficient is:
The t-test for Pearson’s r has the standard setup, that is, the statistic minus the null hypothesis for the parameter, divided by the standard error of the statistic:
Since the null hypothesis is that the correlation is zero, the ρ term drops out, and the t-statistic is simply Pearson’s correlation coefficient divided by its standard error. The t-test of Pearson’s correlation coefficient has n-2 degrees of freedom because population variance is estimated for two variables instead of one.
All t-tests assume random sampling and that the statistic is normally distributed. In this case, Pearson’s r will be normally distributed when the null hypothesis is that ρ equals zero and if both variables are normally distributed. Finally, the test assumes that the correlation coefficient is calculated on interval or ratio data. If you have ordinal data, you must use a non-parametric test.
The t-test for Pearson’s correlation coefficient is most easily performed in R with cor.test(x, y).
cor.test(x, y)
Like other parametric tests in R, the output provides an estimate of the statistic (the correlation coefficient), a confidence interval on the statistic, and a p-value (corresponding to a null hypothesis of zero correlation). Confidence levels other than 95% can be set with the conf.level argument, and one-sided tests can be specified with the alternative argument. As always, you must have external reasons for specifying a one-sided test: you cannot observe the correlation and decide on a one-sided test based on that observation. These confidence intervals assume normality of the data. If your data are not normally distributed, try a bootstrap or jackknife, or use one of the methods suggested by Bishara and Hittner (2019).
Although it is more complicated, you can manually perform a t-test on Pearson’s r, using the t-score and standard error formulas above. It will produce the same results as running cor.test(), so there is seldom a reason to perform it manually.
The square of Pearson’s correlation coefficient, r2, is one form of a widely reported statistic called the Coefficient of Determination. This coefficient is extremely useful: it measures the proportion of variation explained by a linear relationship between two variables. As a proportion, r2 is always between zero and one.
Pearson’s correlation coefficient measures how well a line describes the relationship between two variables, but there are other ways that two variables can be correlated besides linearly. For example, two variables might have an exponential or logistic relationship. Both would be described as monotonic relationships: as one variable increases, the other consistently increases or consistently decreases. A sinusoidal relationship would be an example of a non-monotonic correlation.
Spearman’s rank correlation coefficient is a non-parametric measure of monotonic correlation. The Spearman correlation test assumes only that your data are a random sample. Like most non-parametric tests, Spearman’s is calculated on the ranks of the observations and will therefore work with any type of data that can be ranked, including ordinal, interval, or ratio data. Because it is based on ranks, it is less sensitive to outliers than the Pearson correlation test, and it is sometimes used to evaluate a correlation when outliers are present. Also like other non-parametric tests, the Spearman correlation test provides a p-value but not a confidence interval. A confidence interval could be calculated through bootstrapping; this post by Thom Baguley suggests other ways to calculate the confidence interval.
To understand how Spearman’s correlation works, imagine ranking all x observations, assigning 1 to the smallest value, 2 to the next smallest, and so on, and doing the same for the y-variable. If the two variables have a perfect monotonic relationship, the ranks for the two variables would be either identical (a positive correlation) or reversed (a negative correlation). Spearman’s rank correlation coefficient measures how well these two ranks agree.
In R, the Spearman rank correlation is performed by setting the method argument of cor() and cor.test() to "spearman":
cor(x, y, method="spearman") cor.test(x, y, method="spearman")
Kendall’s tau is another non-parametric correlation coefficient available through the cor() function.
The Pearson correlation coefficient measures the strength of a linear relationship, that is, how well a line describes the data. In contrast, the Spearman correlation coefficient measures the strength of a monotonic relationship, that is, as one value increases, whether the other consistently increases or decreases, regardless of the amount. Relationships can be monotonic and non-linear (for example, an exponential relationship), and as a result, they can have a strong Spearman correlation but a weak Pearson correlation. An example demonstrates this:
> x <- seq(1, 10, by=0.2) > y <- exp(x) > plot(x, y, las=1, pch=16) > cor(x, y, method="pearson") [1] 0.71554 > cor(x, y, method="spearman") [1] 1
Data can also have strong relationships that are neither linear nor monotonic (for example, a sine wave). When this happens, the Pearson and the Spearman correlations may be near zero, even though a very simple relationship may underlie the data. Always plot your data first to understand what the relationship looks like, and always use the appropriate correlation measure to describe your data.
> x <- seq(1, 10, by=0.2) > y <- cos(x) > plot(x, y, las=1, pch=16) > cor(x, y, method="pearson") [1] -0.03977013 > cor(x, y, method="spearman") [1] -0.0702436
Finally, remember that outliers affect a Pearson correlation coefficient more than the Spearman coefficient. One outlier (in red) is added to the data in the example below. Notice how much the Pearson correlation increases compared with the Spearman correlation.
> x <- rnorm(10) > y <- rnorm(10) > plot(x, y, pch=16, xlim=c(-3,5), ylim=c(-3,5)) > cor(x, y) [1] -0.491695 > cor(x, y, method="spearman") [1] -0.2121212 > > # Add one outlier > x <- c(x, 5) > y <- c(y, 5) > points(x[11], y[11], pch=16, col="red") > cor(x, y) [1] 0.7568571 > cor(x, y, method="spearman") [1] 0.09090909
The insensitivity of Spearman to outliers comes from its use of ranks: no matter how large the largest value becomes, its rank never changes, and it’s still just the highest-ranked point. As always, plot your data first to know if outliers are an issue.
Random walks commonly show a strong and statistically significant correlation with one another. As you might expect, increasing the time series length (that is, sample size, n) causes the correlation to have even greater statistical significance, that is, a lower p-value. As a result, two unrelated time series are often strongly and significantly correlated with each other. This behavior often means that the two variables are correlated to a third variable, such as time
For example, the number of ministers and the number of alcoholics are positively correlated through the 1900’s. The number of each largely reflects the growing population during that time, so the two display a correlation, even though there is no direct link between the two. There are plenty of other examples.
The spurious correlation of two random walks is easily simulated by taking the cumulative sum (also called the running sum) of two sets of random numbers:
x <- cumsum(rnorm(25)) y <- cumsum(rnorm(25)) cor(x, y)
Because random walks are unrelated, you might expect a correlation test of random walks to produce a statistically significant result at a rate equal to α (e.g., 5%), that is that 5% of random walks would have a statistically significant correlation. Statistically significant correlations of random walks are far more common than that, meaning that the rate of making a type I error can be much larger than your chosen α value would suggest. You can see this by running the following code several times; p-values less than 0.05 occur far more frequently than the expected 1 out of 20 times.
x <- cumsum(rnorm(25)) y <- cumsum(rnorm(25)) cor.test(x, y)$p.value
A spurious correlation is a problem not only for time series but also for spatially correlated data.
You might think that you could address the problem of spurious correlation by collecting more data, but this worsens the problem because increasing the sample size will lower the p-value.
Spurious correlation can be solved by differencing the data, that is, calculating the change from one observation to the next, which reduces by one the size of the data set. Differencing the data removes the time (or spatial) correlation, removing the dependence of a value on the preceding value. Differenced data will not display spurious correlation because differencing removes the trends that cause spurious correlation. Use the diff() function to calculate your differences quickly.
xdiff <- diff(x) ydiff <- diff(y) cor(xdiff, ydiff)
Whenever you encounter someone trying to correlate time series or spatial series, ask if the data were differenced. If they were not, be skeptical of any non-zero correlation and any claims of statistical significance, especially when the sample size is large.
Here is a simulation that illustrates the problem of spurious time-series correlation. Each plot shows 10,000 random simulations of the correlation coefficient of two data series. The upper plot shows the results for two uncorrelated variables, the middle plot shows two random walks (time series), and the bottom plot shows two differenced random walks (time series). Note how commonly two random walks can produce strong negative or positive correlations, and note how differencing returns the expectation to that of two uncorrelated variables.
par(mfrow=c(3,1))
trials <- 10000
sampleSize <- 25
r <- replicate(trials, cor(rnorm(sampleSize), rnorm(sampleSize)))
hist(r, xlim=c(-1, 1), breaks=50, col="green", main="Uncorrelated Variables", las=1)
r <- replicate(trials, cor(cumsum(rnorm(sampleSize)), cumsum(rnorm(sampleSize))))
hist(r, xlim=c(-1, 1), breaks=50, col="green", main="Random Walks", las=1)
# Note that the correlation coefficients are not clumped around zero as they would be for uncorrelated variables
r <- replicate(trials, cor(diff(cumsum(rnorm(sampleSize))), diff(cumsum(rnorm(sampleSize)))))
hist(r, xlim=c(-1, 1), breaks=50, col="green", main="First Differences of Random Walks", las=1)