We frequently work with two variables simultaneously and want to understand their relationship. For this, there are two related yet distinct goals. The first is to understand the strength of the relationship between them, called correlation. The second is to know the mathematical function that describes the relationship, and can be used to make predictions; finding this relationship is called regression. We will cover these two topics separately in the next two lectures.
There are two common types of correlation, one describing the linear relationship and one describing the monotonic relationship. As you might guess, a linear relationship is one described by a line. A monotonic relationship is where one variable consistently increases or decreases as the other increases, although the relationship might not be linear (e.g., exponential). To measure the linear correlation, we start with the concept of covariance.
Covariance is based on the sum of products, which is analogous to the sum of squares used to calculate variance:
where i is the observation number, n is the total number of observations, j is one variable, and k is the other variable. The sum of products is the deviations from the mean of one variable, multiplied by the deviations from the mean for the other variable, summed over all members of the data series.
The sum of products can also be calculated in a single pass, that is, without first calculating the means:
In this formulation, the first term is the uncorrected sum of products, and the second is the correction term. It is “corrected” in the sense that it subtracts the mean from each of the observations. In other words, it is the sum of products relative to the means of each variable, just as we measured the sum of squares relative to the mean.
Covariance is the sum of products divided by the n-1 degrees of freedom. Note the similarity in the formulas for variance and covariance.
Unlike the sum of squares and variance, which can have only positive values (or zero), the sum of products and covariance can be positive or negative. Covariance has units equal to the product of the units of the two measurements, similar to variance, which has units that are the square of the measurement’s units. Also like variance, covariance is a function of the scale on which the data were measured, which is generally undesirable. Converting covariance into a dimensionless measure produces Pearson’s correlation coefficient.
Pearson’s correlation coefficient is also called the product-moment correlation coefficient or just, the correlation coefficient. When someone doesn’t specify the type of correlation coefficient, they usually mean Pearson’s.
To produce a dimensionless correlation measure, we need to divide covariance by something with the same units, namely the product of the standard deviations of the two variables. This standardizes the joint variation in the two variables by the product of the variation in each variable. The correlation coefficient therefore measures the strength of the relationship between two variables, but it doesn’t depend on the units of those variables.
Pearson’s correlation coefficient varies from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). A value of 0 indicates no linear relationship, which occurs when covariance equals zero. If one variable does not vary, one of the standard deviations will be zero, causing the denominator to be zero and making the correlation coefficient undefined.
Pearson’s correlation coefficient can also be calculated as the sum of products divided by the square root of the product of the two sums of squares:
In R, covariance is calculated with cov(), and correlation is calculated with cor():
cov(x, y) cor(x, y)
Both take two arguments, each being a vector corresponding to a variable.
As always, we can generate estimates with confidence intervals and test hypotheses about a statistic if we know its expected distribution. In this case, if the data are normally distributed, and if the null hypothesis is that there is zero correlation, Pearson’s correlation coefficient follows a t-distribution, where the standard error of Pearson’s correlation coefficient (sr) is:
The t-test for Pearson’s r has the standard setup, that is, the statistic minus the null hypothesis for the parameter, divided by the standard error of the statistic:
Since the null hypothesis is that the correlation is zero, the ρ term drops out, and the t-statistic is simply Pearson’s correlation coefficient divided by its standard error. The t-test of Pearson’s correlation coefficient has n-2 degrees of freedom because population variance is estimated for two variables instead of one.
All t-tests assume random sampling and that the statistic is normally distributed. In this case, Pearson’s r will be normally distributed when the null hypothesis is that ρ equals zero and if both variables are normally distributed. Finally, the test assumes that the correlation coefficient is calculated on interval or ratio data. If you have ordinal data, you must use a non-parametric test. For nominal data, use a test of association, such as the Chi-Squared test (χ2) or G test .
The t-test for Pearson’s correlation coefficient is most easily performed in R with the cor.test() function.
cor.test(x, y)
Like other parametric tests in R, the output provides an estimate of the statistic (the correlation coefficient), a confidence interval on the statistic, and a p-value (corresponding to a null hypothesis of zero correlation). Confidence levels other than 95% can be set with the conf.level argument, and one-sided tests can be specified with the alternative argument. As always, you must have external reasons for specifying a one-sided test: you cannot observe the correlation and decide on a one-sided test based on that observation. These confidence intervals assume normality of the data. If your data are not normally distributed, try a bootstrap or jackknife, or use one of the methods suggested by Bishara and Hittner (2019).
The square of Pearson’s correlation coefficient, r2, is one form of a widely reported statistic called the Coefficient of Determination. This coefficient is extremely useful: it measures the proportion of variation explained by a linear relationship between two variables, making it similar to η2 used in ANOVA. As a proportion, r2 is always between zero and one.
Pearson’s correlation coefficient measures how well a line describes the relationship between two variables, but two variables can be correlated non-linearly, such as if the relationship is exponential or logistic. Both would be described as monotonic relationships: as one variable increases, the other consistently increases or consistently decreases. A sinusoidal relationship would be an example of a non-monotonic correlation. In some cases, a data transformation can be applied to one or both variables to produce a linear relationship, allowing Pearson’s coefficient to be used. If that is not possible and the relationship is monotonic, a non-parametric measure of correlation is called for. There are two common choices.
Spearman’s rank correlation coefficient (ρ) is a non-parametric measure of monotonic correlation. The Spearman correlation test assumes only that your data are a random sample. Like most non-parametric tests, Spearman’s is calculated on the ranks of the observations and will therefore work with any type of data that can be ranked, including ordinal, interval, or ratio data. Because it is based on ranks, it is less sensitive to outliers than the Pearson correlation test, and it is sometimes used to evaluate a correlation when outliers are present. Also like other non-parametric tests, the Spearman correlation test provides a p-value but not a confidence interval. A confidence interval could be calculated through bootstrapping; this post by Thom Baguley suggests other ways to calculate the confidence interval.
To understand how Spearman’s correlation works, imagine ranking all x observations, assigning 1 to the smallest value, 2 to the next smallest, and so on, and doing the same for the y-variable. If the two variables have a perfect monotonic relationship, the ranks for the two variables would be either identical (a positive correlation) or reversed (a negative correlation). Spearman’s rank correlation coefficient measures how well these two sets of ranks agree.
In R, the Spearman rank correlation is performed by setting the method argument of cor() and cor.test() to "spearman" (in quotes):
cor(x, y, method="spearman") cor.test(x, y, method="spearman")
Kendall’s tau (τ) is another non-parametric correlation coefficient available through the cor() function. Kendall is less commonly used than Spearman, and many statistical texts mention only Spearman. The two metrics tend to produce similar values when the correlation is weak (Sokal and Rohlf 1995). Some have noted that the values of Spearman tends to be larger, sometimes substantially so (Colwell and Gillett 1982), but this seems to be anecdotal. Despite the limited comparative discussions, there are cases where one of the metrics may be more appropriate. Because Spearman places greater weight on dissimilar rankings, whereas Kendall weights all ranks equally, Spearman should be preferred when the validity of similar rankings is uncertain (Sokal and Rohlf 1995). Similarly, Kendall reportedly performs better when there are numerous tied ranks (Legendre and Legendre 1998).
Rank correlation methods are generally preferred when data have outliers because of their strong effects on the Pearson correlation coefficient. For example, one outlier (in red) is added to the data in the example below. Notice how much the outlier increases the Pearson correlation compared with the Spearman correlation.
x <- rnorm(10) y <- rnorm(10) plot(x, y, pch=16, xlim=c(-3,5), ylim=c(-3,5)) cor(x, y) # -0.49 cor(x, y, method="spearman") # -0.21 # Add one outlier x <- c(x, 5) y <- c(y, 5) points(x[11], y[11], pch=16, col="red") cor(x, y) # 0.76 cor(x, y, method="spearman") # 0.09
The insensitivity of Spearman to outliers comes from its use of ranks: no matter how large the largest value becomes, its rank never changes, and it’s still just the highest-ranked point. As always, plot your data first to know if outliers are an issue.
Data can also have strong relationships that are neither linear nor monotonic (for example, a sine wave). When this happens, the Pearson and the Spearman correlations may be near zero, even though a very simple relationship may underlie the data. Always plot your data first to understand what the relationship looks like, and always use the appropriate correlation measure to describe your data.
x <- seq(1, 10, by=0.2) y <- cos(x) plot(x, y, las=1, pch=16) cor(x, y, method="pearson") # -0.04 cor(x, y, method="spearman") # -0.07
Random walks commonly show a strong and statistically significant correlation with one another. As you might expect, increasing the time series length (that is, sample size, n) causes the correlation to have even greater statistical significance, that is, a lower p-value. As a result, two unrelated time series are often have a statistically significant and potentially large correlation. This behavior often arises when the two variables are correlated to a third variable, such as time
For example, the number of ministers and the number of alcoholics are positively correlated through the 1900’s. The number of each largely reflects the growing population during that time, so the two display a correlation, even though there is no direct link between the two. There are plenty of other examples.
The spurious correlation of two random walks is easily simulated by taking the cumulative sum (also called the running sum) of two sets of random numbers:
x <- cumsum(rnorm(25)) y <- cumsum(rnorm(25)) cor(x, y)
Because random walks are unrelated, you might expect a correlation test of random walks to produce a statistically significant result at a rate equal to α (e.g., 5%), that is that 5% of random walks would have a statistically significant correlation. This is known as the Type I error rate, the proportion of times that a true null hypothesis (here, no correlation) is rejected. Statistically significant correlations of random walks are far more common than that, meaning that the type I error ratencan be much larger than α, the significance level. You can see this by running the following code several times; p-values less than 0.05 occur far more frequently than the expected 1 out of 20 times.
x <- cumsum(rnorm(25)) y <- cumsum(rnorm(25)) cor.test(x, y)$p.value
A spurious correlation is a problem not only for time series but also for spatially correlated data.
You might think that you could address the problem of spurious correlation by collecting more data, but this worsens the problem because increasing the sample size will lower the p-value.
Spurious correlation can be solved by differencing the data, that is, calculating the change from one observation to the next, which reduces by one the size of the data set. Differencing the data removes the time (or spatial) correlation, removing the dependence of a value on the preceding value. Differenced data will not display spurious correlation because the trends are removed. Use the diff() function to calculate your differences quickly.
xdiff <- diff(x) ydiff <- diff(y) cor(xdiff, ydiff)
Whenever you encounter someone trying to correlate time series or spatial series, ask if the data were differenced. If they were not, be skeptical of any non-zero correlation and any claims of statistical significance, especially when the sample size is large.
Here is a simulation that illustrates the problem of spurious time-series correlation. Each plot shows 10,000 random simulations of the correlation coefficient of two data series. The upper plot shows the results for two uncorrelated variables, the middle plot shows two random walks (time series), and the bottom plot shows two differenced random walks (time series). Note how commonly two random walks can produce strong negative or positive correlations, and note how differencing returns the expectation to that of two uncorrelated variables.
par(mfrow=c(3,1))
trials <- 10000
n <- 25
r <- replicate(trials,
cor(rnorm(n), rnorm(n)))
hist(r, xlim=c(-1, 1), breaks=50, col="paleturquoise4", las=1,
main="Uncorrelated Variables")
r <- replicate(trials,
cor(cumsum(rnorm(n)), cumsum(rnorm(n))))
hist(r, xlim=c(-1, 1), breaks=50, col="paleturquoise4", las=1,
main="Random Walks")
r <- replicate(trials,
cor(diff(cumsum(rnorm(n))), diff(cumsum(rnorm(n)))))
hist(r, xlim=c(-1, 1), breaks=50, col="paleturquoise4", las=1,
main="First Differences of Random Walks")
Colwell, D.J., and J.R. Gillett. 1982. Spearman versus Kendall. The Mathematical Gazette 66:307–309.
Legendre, P., and L. Legendre. 1998. Numerical Ecology, 2nd English edition. Developments in Environmental Modeling 20. Elsevier, Amsterdam, 853 p.
Sokal, R.R., and F.J. Rohlf. 1995. Biometry, 3rd edition. Freeman, New York, 887 p.