Problem Sets

Home

Contact

Steven Holland

Problem Set 7: Correlation and Regression

In your text, read pages 108–113, which cover covariance and correlation, and chapter 8, which covers regression.

Format your homework in the standard way, regarding parts to the problem set, answering questions (following the length constraints as before), and saving plots as pdf files.

Part 1

Download and import the data set nashvilleCarbonates.csv, naming the data frame nashville.

Use the appropriate command to view the structure of this data frame.

Use the appropriate command that will allow you to call the variables by name without using dollar-sign notation.

The Nashville carbonates dataset consists of stable isotopic and major element concentrations from a series of limestone beds in the Ordovician of central Tennessee. The meter position in the stratigraphic column (that is, the stratigraphic position) is stored in stratPosition.

These rocks record a major change in paleoceanographic conditions, which we will investigate. In addition, a major unconformity (prolonged halt in deposition) separates these distinct oceanographic regimes. Rocks below this unconformity are included in the Carters Formation, and rocks above are placed in the Hermitage Formation. Complicating matters is an unusual bed overlying the unconformity, which has a distinct geochemical signature; this bed is called a transgressive lag.

In this data set, we are interested in magnesium, as an indicator of dolomitization, and aluminum, as an indicator of clay minerals. Specifically, we are interested in whether there is a relationship between these two and whether this relationship differs between the two oceanographic regimes. As we are interested in the strength and form of a relationship between two variables, this is a problem of correlation and regression. Because we are interested in prediction, we will use Model 1 regression, even though there is error in both variables.

Part 2

We will frequently want to select just the Carters beds (those below a stratigraphic position of 34.2 m) and just the Hermitage beds (those above 34.6 m). Note that these cutoffs will exclude two beds (at 34.285 and 34.555 m) belonging to the transgressive lag. Embedding logical operations throughout our code makes it difficult to read and error-prone: having the wrong value or operator is too easy. To avoid this, we will save those logical tests to objects with an easy-to-understand name and then use these objects whenever we need that logical test.

Make two vectors called carters and hermitage, each built from the logical test for the stratigraphic positions that conform to that formation. Both should be vectors of logical (TRUE and FALSE) value, and each should have the same length as the number of rows in the nashville data frame.

Make a third vector called transgressive for the beds not covered. So that this will work if we collect more data, make a boolean vector that represents all stratigraphic positions above 34.2 and below 34.6.

Use these two objects anywhere you need to select data from the Carters or Hermitage formations.

Part 3

When we make our plots, we will want to consistently use a dark shade of blue ("dodgerblue4") for the Carters (as they are interpreted to be warm-water carbonates) and a light shade of blue ("lightskyblue2") for the Hermitage (thought to be cool-water carbonates). For the transgressive lag, we will use a shade of bluish-gray ("azure4"). Rather than embedding strings for these colors throughout our code, we will create descriptively named objects for these colors.

Create three objects named cartersColor, hermitageColor, and transgressiveColor, and assign the names (as strings) of the appropriate colors to them.

Use these named colors from here on, making your code easier to read and less error-prone. If this was in your own work and you wanted to change a color, you would need to do it in only one place.

Part 4

Elemental abundances must be positive numbers and often follow lognormal distributions (for example, we measure hydrogen as pH, which is a log scale). Such variables are commonly placed on a log scale through a log transformation, which we will do here.

Create two objects, logAl and logMg, that hold the base-10 logs of Al and Mg; use the simplest command for making a base-10 log. Use these from here on; don’t repeat yourself by taking the log of Al or Mg again.

Part 5

Create a bivariate plot of logAl vs. logMg. Rotate the y-axis labels, and give the x and y labels good names, not the names of your objects. Because we will separately add the points for the Carters and Hermitage, do not plot any points here. Do not save this plot as a pdf file yet; we will do that later.

Add the points for the Carters Formation as filled circles of the appropriate color. Remember to use your up arrow to repeat the previous command and make the appropriate changes. This can save you a lot of typing; look for opportunities to do this when you code.

Add the points for the Hermitage Formation as filled circles of the appropriate color.

Add the points for the transgressive lag as open circles of the appropriate color. Although an open circle is the default symbol, we will specify 1 as the pch so that our code matches the previous two lines. If we didn't do this, it might look like we had forgotten to set the symbol. We are making the transgressive lag symbols open circles so that they do not have as much visual weight as those of the Carters and the Hermitage, which are both filled circles.

These three lines of code should be relatively short and self-explanatory. Notice how easier it is to spot errors than if you had not used the objects you created in steps 2–4.

Examine the plot and imagine the placement of a line describing the relationship for the Carters (the dark blue). Now do the same for the Hermitage (light blue), and for the transgressive lag (open gray circles).

Question 1: Do any of these relationships look similar or not?

Part 6

Since all three units appear to form a linear trend, let’s evaluate the strength of those relationships with correlation.

In one simple command, test the strength of the linear correlation between log(Mg) and log(Al) for the Carters Formation. Your result should display the correlation, its confidence limits, and a p-value, plus some other information.

Do the same for the Hermitage Formation.

Do the same for the transgressive lag.

Question 2: Add a comment describing the results of this test for the Carters Formation, following the format on Stating Statistical Results. Your statement should focus on the correlation and the uncertainty in that estimate, not the p-value. Use an appropriate number of significant figures. Follow the examples carefully; don’t try to be creative with your phrasing.

Question 3: Add a similar comment describing the results of the statistical test for the Hermitage.

Question 4: Add a similar comment describing the results of the statistical test for the transgressive lag.

Question 5: Which stratigraphic unit has the strongest linear correlation, and which has the weakest? Your answer should indicate which part of the result tells you this. Your answer should specify that this is a linear correlation.

Question 6: Which stratigraphic unit has the most well-constrained correlation, and which has the most poorly constrained correlation? Your answer should indicate which part of the result tells you this.

Part 7

In three commands, separately test the strength of the monotonic correlation for the Carters Formation, the Hermitage Formation, and the transgressive lag. Notice that our code always follows these three in the same order; that consistency will make your code easier to read. Note but do not be concerned about any warnings about ties, as this is common with nonparametric methods. The main effect is that the p-value may be somewhat off, but this is of no concern if you are not overly focused on the exact value, as you shouldn’t be for reasons we’ve discussed.

Question 7: Which stratigraphic unit has the strongest monotonic correlation, and which has the weakest? Your answer should indicate which part of the result tells you this. Your answer should specify that this is a monotonic correlation.

Part 8

Now we will describe the linear relationships of log(Mg) and log(Al) in the three stratigraphic units; in other words, we will calculate the slope and y-intercepts.

Perform a linear regression of logAl and logMg for the Carters Formation. Treat logAl as the dependent variable and logMg as the independent variable. Assign these to objects named cartersRegression.

Display the results of the regression using the summary() function. This will display many things, including the estimates of slope and y-intercept, significance tests (p-values for them), and several measures and tests of goodness of fit of the regression line.

Calculate 95% confidence intervals on the regression (the slope and y-intercept) using the confint() function.

Question 8: Write the equation of each line in the form of Y = b1 X + b0, replacing X and Y with their correct object names, and replacing b1 and b0 with their correct values, to a reasonable number of significant figures. Write the equation simply; do use parentheses as none are needed. Your equation should look something like
numFrogs = 0.03 * numCrickets - 1.4.

Skip a line, then repeat the same four steps for Hermitage Formation. Use a similar name for the regression object. Your comment will be question #9. These blank lines help to group related code and separate it from other groups of code.

Skip a line, then repeat the same four steps for the transgressive lag. Use a similar name for the regression object. Your comment will be question #10.

Question 11: Skip a line. Which of the three stratigraphic units has the most tightly constrained slope, and which has the most weakly constrained slope? Your answer should indicate which part of the results tells you this.

Part 9

Using abline(), add the regression for the Carters Formation to your plot, using the line color for the formation. Use the simplest command possible for adding the regression lines.

Do the same for the Hermitage Formation.

Do the same for the transgressive lag.

Part 10

With three calls to the text() command, add the following labels to the plot: “Carters”, “Hermitage”, “transgressive lag”. Use the appropriate colors for each.

Place each label near the corresponding cloud of points on the plot, but the labels should not cover any data points. This will take some experimentation, and the locator() function can help. Hard code these coordinates in your code with a reasonable number of significant figures (think about how much a significant figure would move the label). Do not make the user (me!) use the locator() function because that would halt the code, and the user will not know what to do.

Part 11

Our plot is finished, so let’s create an exact copy of this plot as a pdf file as plot 1. Starting with the pdf(), repeat all the plotting calls used to generate your plot: all the calls to plot(), points(), abline(), and text(), with no blank lines in between. Remember to close the pdf plot with the appropriate command. As always, verify that you can open the pdf file.

Part 12

We need to evaluate the assumptions of the regressions. To not make the problem set too long, we will do this for only one of the regressions. If this was for your own analysis, you would do these steps on every regression.

This will be plot 2 (i.e., a pdf file). Create a new plot window, and set it up using par and mfrow to create a 2x2 arrangement of plots. We did something similar in problem set 2. To give the plots some room, make the plot 10" by 10".

Call plot() on cartersRegression to show all four diagnostic plots.

Question 12: Based on these plots, is there any systematic relationship of the residuals with the fitted values? Don’t be concerned with perfection; focus on large deviations on this and the next three questions.

Question 13: Are the residuals normally distributed?

Question 14: Does the size of the residuals change markedly with the position along the regression (i.e, are the residuals heteroscedastic or are they homocscedastic)?.

Question 15: Do any data points have an unusually large effect on the regression (in other words, are any residuals simultaneously large and have high leverage)?

Question 16: Considering your answer to these four questions, are the assumptions of the least-squares regression met for the Carters Formation? Explain, but be succinct.

Finally, undo the command you used in Part 1 that allowed you to use dollar-sign notation.

Submitting your problem set

Quit R and re-run everything in your commands file. Verify that when you paste the code into the terminal all of the code runs without any interaction from you. Verify that your code runs without errors or warnings other than those computing exact p-values with ties. Verify that you can open the two pdf files and they are correctly named. Always test your code this way. Future you (and your colleagues) will appreciate that your code is reliable.

Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 7. Do not send me the data file, as I have it already. This problem set is due on 26 October.