In your text, read pages 108–113, which cover covariance and correlation, and chapter 8, which covers regression.
Format your homework in the standard way, regarding formatting, answering questions (same length constraints as before), and saving plots as PDF files.
Download and import the data set nashvilleCarbonates.csv, naming the data frame nashville.
Use the appropriate command to view the structure of this data frame.
Use the appropriate command that will allow you to call the variables by name without using dollar-sign notation. Remember to undo this command as your final step to this problem set.
The Nashville carbonates dataset consists of stable isotopic and major element concentrations from a succession of limestone beds in the Ordovician of central Tennessee. The meter position in the stratigraphic column (that is, the stratigraphic position) is stored in stratPosition.
These rocks record a major change in paleoceanographic conditions, which we will investigate. In addition, a major unconformity (prolonged halt in deposition) separates these distinct oceanographic regimes. Rocks below this unconformity are included in the Carters Formation, and rocks above are placed in the Hermitage Formation. Complicating matters is an unusual bed overlying the unconformity, which has a distinct geochemical signature; this bed is called a transgressive lag.
In this data set, we are interested in magnesium, as an indicator of dolomitization, and aluminum, as an indicator of clay minerals. Specifically, we are interested in whether there is a relationship between these two and whether this relationship differs between the two oceanographic regimes. As we are interested in the strength and form of a relationship between two variables, this is a problem of correlation and regression. Because we are interested in prediction, we will use Model 1 regression, even though there is error in both variables (i.e., we did not fix one of these variables and measure the response in the other one).
We will frequently want to select just the Carters beds (those below a stratigraphic position of 34.2 m) and just the Hermitage beds (those above 34.6 m). Note that these cutoffs will exclude two beds (at 34.285 and 34.555 m) belonging to the transgressive lag. Embedding logical operations throughout our code makes it difficult to read and error-prone. Moreover, repeatedly coding these cutoffs makes errors more likely. To avoid this, we will save those logical tests to objects with an easy-to-understand name and then use these objects whenever we need that logical test.
Make two vectors called carters and hermitage, each built from the logical test for the stratigraphic positions that conform to that formation.
Make a third vector called transgressive for the beds that lie above the cutoff for the Carters and below the cutoff for the Hermitage.
Use these two objects anywhere you need to select data from the Carters or Hermitage formations.
When we make our plots, we will want to use consistent colors for these stratigraphic units. A common convention is to use blue for carbonate rocks, so we will use a darker shade of blue ("blue3") for the warm-water carbonates of the Carters and a lighter shade of blue for the cooler-water carbonates of the Hermitage ("deepskyblue"). To make the transgressive lag stand out, we will use "black". These colors will work well for those with color blindness, too. Rather than embedding strings for these colors throughout our code, we will create descriptively named objects for these colors.
Create three objects named cartersCol, hermitageCol, and transgressiveCol, and assign the names (as strings) of the appropriate colors to them. Because col is a common argument in plots, it is fine to use the shortened version of Col instead of the full name, Color.
Use these named colors from here on, making your code easier to read and less error-prone. If this were in your own work and you wanted to change a color, you would need to do it in only one place.
Elemental abundances must be positive numbers, and they commonly have lognormal distributions (for example, we measure hydrogen as pH, which is a log scale). To pull in the long right tail common to lognormally distributed data, we will apply a log transformation.
Create two objects, logAl and logMg, that hold the base-10 logs of Al and Mg; use the simplest command for making a base-10 log. Use these two objects from here on; do not repeat yourself by taking the log of Al or Mg again.
Create a bivariate plot of logAl vs. logMg. Rotate the y-axis labels, and give the x and y labels good names, not the names of your objects. Because we have taken the logs, it is okay to exclude the units in the axis labels. We will separately add the points for the Carters and Hermitage, so do not plot any points here. This will be plot 1.
Add the points for the Carters Formation as filled circles of the appropriate color.
Add the points for the Hermitage Formation as filled circles of the appropriate color. Remember to use your up arrow to repeat the previous command and make the appropriate changes. Not only can this save you a lot of typing, but it also makes your code more consistent and therefore easier to read. Look for opportunities to do this when you code.
To help the points for the transgressive lag stand out even more, we will use open circles of the appropriate color. Although an open circle is the default symbol, specify 1 as the pch so that our code matches the previous two lines. If we didn't do this, it might look like we had forgotten to set the symbol. By making the transgressive lag symbols open circles, we are also decreasing their visual weight to emphasize our primary narrative, which will be about contrasting the Carters and Hermitage. Plots are made to advance certain ideas, and that influences their design.
These four lines of code should be relatively short and self-explanatory. Notice how easier it is to spot errors than if you had not used the objects you created in steps 2–4.
Examine the plot and imagine the placement of a line describing the relationship for the Carters (the dark blue). Now do the same for the Hermitage (light blue), and for the transgressive lag (open gray circles).
Question 1: Do any of these relationships look similar or not?
Since all three units appear to form a linear trend, let’s evaluate the strength of those relationships with correlation.
In one simple command, test the strength of the linear correlation between log(Mg) and log(Al) for the Carters Formation. Your result should display the correlation, its confidence limits, and a p-value, plus some other information.
Do the same for the Hermitage Formation, then the transgressive lag.
Question 2: Add a comment describing the results of this test for the Carters Formation, following the format on Stating Statistical Results. Your statement should focus on the correlation and the uncertainty in that estimate, not the p-value. Use an appropriate number of significant figures. Follow the examples carefully; don’t try to be creative with your phrasing.
Question 3: Add a similar comment describing the results of the statistical test for the Hermitage.
Question 4: Add a similar comment describing the results of the statistical test for the transgressive lag.
Question 5: Which stratigraphic unit has the strongest linear correlation, and which has the weakest? Your answer should indicate which part of the result tells you this. Your answer should specify that this is a linear correlation. Because you reported the numerical relationships in questions 2–4, you should not repeat them here.
Question 6: Which stratigraphic unit has the most well-constrained correlation, and which has the most poorly constrained correlation? Your answer should indicate which part of the result tells you this. As in question 5, you should not repeat the numerical support behind your answer.
In three commands, separately test the strength of the monotonic correlation for the Carters Formation, the Hermitage Formation, and the transgressive lag. Notice that our code always follows these three in the same order; that consistency makes your code easier to read. Note the warnings about ties when you run these commands, but do not be concerned by them, as they are common with nonparametric methods based on ranks. The main effect is that the p-value may be somewhat off, but this is of no concern if you are not overly focused on the exact value, as you shouldn’t be for reasons we’ve discussed.
Question 7: Which stratigraphic unit has the strongest monotonic correlation, and which has the weakest? Your answer should indicate which part of the result tells you this. Your answer should specify that this is a monotonic correlation.
Now we will describe the linear relationships of log(Mg) and log(Al) in the three stratigraphic units; in other words, we will fit lines to the data and calculate the slope and y-intercepts that describe them.
Perform a linear regression of logAl and logMg for the Carters Formation. Treat logAl as the dependent variable and logMg as the independent variable. Assign these to objects named cartersRegression.
Display the results of the regression using the summary() function. This will display many things, including the estimates of slope and y-intercept, significance tests (p-values for them), and several measures and tests of goodness of fit of the regression line.
Calculate 95% confidence intervals on the regression (the slope and y-intercept) for the Caarters using the confint() function.
Question 8: Write the equation of each line in the form of Y = b1 X + b0, replacing X and Y with their correct object names, and replacing b1 and b0 with their correct values, to a reasonable number of significant figures, which you can determine from the data. Write the equation simply; no parentheses are needed. Your equation should look something like
numFrogs = 0.03 * numCrickets - 1.4.
Skip a line, then repeat the same steps for the Hermitage Formation. Use a similar name for the regression object. Your comment will be question #9. The blank lines help to group related code and separate it from other groups of code.
Skip a line, then repeat the same steps for the transgressive lag. Use a similar name for the regression object. Your comment will be question #10.
Question 11: Skip a line. Which of the three stratigraphic units has the most tightly constrained slope, and which has the most weakly constrained slope? Your answer should indicate which part of the results tells you this.
Using abline(), add the regression for the Carters Formation to your plot, using the line color for the formation. To give a bit more weight to the line, double its width. Use the simplest command possible for adding the regression lines.
Do the same for the Hermitage Formation.
Do the same for the transgressive lag, but to deemphasize the line, make it dashed and of the default normal width. There are a couple of ways to make a dashed line, but do it in the way that makes clear the intent of your code.
With three calls to the text() command, add the following labels to the plot: “Carters”, “Hermitage”, “transgressive lag”. Use the appropriate colors for each.
Place each label near the corresponding cloud of points on the plot, but the labels should not cover any data points. This will take some experimentation, and the locator() function can help. Hard-code these coordinates in your code with a reasonable number of significant figures (think about how much a significant figure would move the label). Do not make the user (me!) use the locator() function because that would halt the code, and the user will not know what to do.
Plot 1 is now finished.
We need to evaluate the assumptions of the regressions. To not make the problem set too long, we will do this for only one of the regressions. If this were for your own analysis, you would follow these steps for every regression.
This will be plot 2. Set this up with par and mfrow so that it will form a 2x2 grid of plots, similar to what you did in Problem Set 2. To give the plots some room, make the plot 10" by 10".
Call plot() on cartersRegression to show all four diagnostic plots.
Question 12: Based on these plots, is there any systematic relationship of the residuals with the fitted values? Don’t be concerned with perfection; focus on large deviations on this and the next three questions.
Question 13: Are the residuals normally distributed?
Question 14: Does the size of the residuals change markedly with the position along the regression (i.e, are the residuals heteroscedastic or are they homoscedastic)?.
Question 15: Do any data points have an unusually large effect on the regression (in other words, are any residuals simultaneously large and have high leverage)?
Question 16: Considering your answer to these four questions, are the assumptions of the least-squares regression met for the Carters Formation? Explain, but be succinct.
Finally, undo the command you used in Part 1 that allowed you to use dollar-sign notation.
Quit R and re-run everything in your commands file. Verify that when you paste the code into the terminal, all the code runs without any interaction from you. Verify that your code runs without errors or warnings other than those computing exact p-values with ties. Verify that you can open the two PDF files and that they are correctly named. Always test your code this way. Future you (and your colleagues) will appreciate that your code is reliable.
Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 7. Do not send me the data file, as I have it already. This problem set is due by 2:00 PM, Wednesday, 22 October. You get a couple extra days to do this one because I will be at the GSA meeting and unable to grade them at my usual time. Still, I am aiming to get this back to you before your final problem set is due.