Problem Sets

Home

Contact

Problem Set 6: ANOVA

Conrad Labandeira and Nigel Hughes independently studied the Late Cambrian trilobite Dikelocephalus for their master’s theses. Each came to the same conclusion, that Ulrich and Resser, who described many species of that genus in the early 1900’s, had vastly oversplit the genus and that many of the species could not be distinguished. For this week’s problem set, we will examine a small portion of the data they analyzed (Journal of Paleontology 68:492–517).

In the trilobites2 data set are two measurements of the free cheek (the side of the head) for four species of the trilobite genus Dikelocephalus. Labandeira and Hughes called these measurements omega and sigma; both are measured in millimeters (mm). They measured many other aspects of these trilobites and for many other species, but to keep this problem tractable, we will examine only sigma and only on four of the many species (edwardsi, gracilis, raaschi, and subplanus).

In this problem set, we want to evaluate whether these four species can be distinguished on the basis of this one measure. This is a problem of central tendency: how different are these species?. If the central tendencies of these measures are very similar, it will be hard to distinguish the four species, but if the central tendencies are substantially different, we will be able to we will be able to tell them apart. It is possible that some species are distinguishable, but others are not.

You will do a series of analyses on sigma, and you will use the results to evaluate whether these trilobites can be distinguished using sigma. When you are asked to state the results of a statistical test, be sure to follow standard practice, and state your answer as a comment, as on the previous problem set. Likewise, plots should be saved as pdf files, following our naming conventions, and answers to questions should follow the format and constraints established in the last problem set. Finally, remember always that if you are describing what can be concluded from a test, you must include the numerical test results following the examples on Stating statistical results.

Part 1

Import the data set trilobites2 and save it to an object named trilobites. One of the columns in this data are species names saved as strings. When there are a small set of consistently used strings (like here), it is better to import them as a factor. When you import these data, do this by setting stringsAsFactors=TRUE.

Use the appropriate command to view the structure of this data frame.

Use the appropriate command that will allow you to call the variables by name without using dollar-sign notation.

Last, we want to see the names of all species; use levels() to see all the values of a factor.

Part 2

Visualize how sigma is distributed for each species, using one plot constructed in one line of code. This is plot 1. Use stripchart(), with the data (sigma) grouped by species. Hint: stripchart(y~x) will plot the variable y grouped by the variable x (often a factor). Use solid black circles for the plotting symbol. Give an appropriate x-axis label following the convention for indicating the units.

You do not need to rotate the y-axis labels, but if you are able to do this without the labels being truncated, there is a +1 bonus. Hint: examine the entries for mar or mai on the par help page. If you do this, it will require an extra line of code before your stripchart() call.

Question 1: Let’s think about the plot before doing our analyses. Does the mean value for all four species appear to be about the same, or do any of the species look different? Eyeball any estimates of the means.

Part 3

A common way in which data like this is evaluated is to test in one step whether the means of all of the species are statistically indistinguishable, and this is usually done with an ANOVA. Before you can run an ANOVA (or any statistical test), you must verify its assumptions. Remember, assumptions are requirements of the tests; they are things you must demonstrate, not things that you should assume.

Question 2: The first requirement of an ANOVA is that the data are normally distributed. Based on a visual examination of your stripchart, and considering the small sample size, is sigma roughly symmetrically distributed for each species or is it clearly asymmetrical for some species (if so, which ones)? Because the data set is small, we should be concerned only about strong departures from symmetry, not things like bimodality.

The second requirement of an ANOVA is that the variance for each species is the same, in other words, that the scatter for each species is approximately the same. Use the appropriate test that lets you compare the variances of sigma on more than two groups in a single and simple line of code.

Question 3: Should you accept or reject the null hypothesis that the variances of all the species are the same?

Question 4: State whether the assumptions of normality and homoscedasticity (equal variances) been met for an ANOVA on sigma.

Part 4

Normally, you would proceed to the ANOVA only if the assumptions of the test were met. In this case, I want you to perform the ANOVA regardless of whether you think the assumptions were met.

Run an ANOVA using aov() on sigma as a function of species. So that we can see the full ANOVA table, save the results to an object called sigmaANOVA, then display this ANOVA table with the appropriate command.

Question 5: Do the results indicate that all four species are indistinguishable with regard to sigma? Also, if you concluded in question #3 that the assumptions of the ANOVA were invalid, explain how this affects the interpretation of the test results. Although you may not need it (be concise!), you can use up to double the normal length for an answer to a question.

Part 5

If there are significant differences between species, there are two ways we can determine which species are distinguishable. The first is to run TukeyHSD() on the ANOVA results. Do that.

Question 6: Based on this test, which pairs of species can be distinguished using sigma, and which cannot? Given the number of test results here, do not provide the test results parenthetically as you would normally do. If this was for a manuscript, you would show the test results in a table, and cite the table once parenthetically, but we will not do that here. State these results as simply as possible; in particular, you may be able to simplify the answer while not explicitly listing every possible comparison of species.

The problem with the Tukey HSD approach is that it still does not tell us what we are most interested in, the mean sigma for each species and our uncertainty in those means. To generate these confidence intervals, we will use the t.test() function. Call this function on sigma for each species individually (so, in four lines of code), assigning the result for each to an object named for the species. Next, display the estimate and the confidence interval for one of the species, using two lines of code. Refer to the help page for t.test() to see how to extract (using $ notation) the estimate and the confidence interval from the result (called the value on the help page). Do the same for the remaining three species (in other words, display the estimate then the confidence interval for one species, then the same for the next species, and so on).

Question 7: Compare the four sets of confidence intervals on mean sigma you just calculated. Based on whether these confidence intervals overlap, which pairs of species can be distinguished based on sigma? I recommending sketching these to visualize how they compare. State your answer succinctly, similar to how you stated the answer to the Tukey HSD question.

Note that using confidence intervals like this is a conservative approach to testing for differences, that is, it is biased against finding a difference. Conservative tests are called that because their bias is against rejecting the null. If the confidence intervals do not overlap, then you can be confident that the values are distinguishable, and a t-test would give you a significant p-value. Similarly, if the confidence intervals overlap and the estimates fall within the other confidence intervals, then that almost always means that the values are not significantly different (again, via a p-value approach). However, the intermediate case is ambiguous: when confidence intervals overlap, but the estimates for individual species fall outside the confidence intervals of other species. In this case, it is not clear whether a t-test would show a significant difference, and if you really wanted to know, you would have to run that t-test. For this problem, do not consider these nuances. Instead, only ask whether the confidence intervals do not overlap: if they do not overlap, you have strong evidence that those species are distinguishable.

Question 8: Do these two approaches (Tukey HSD vs. confidence intervals) lead you to the same conclusion about which species are distinguishable?

Bonus (+1 to +5 points)

It is tedious to compare confidence intervals by their numerical values, and a plot would make these comparisons much easier. If you were writing a manuscript, a plot would be the best way to show the results. For a bonus, construct a plot that shows the four confidence intervals and the estimates on a single plot, with the species for each indicated. This is plot 2.

Here's a few hints to get you started. Show the estimates and confidence intervals of sigma along the x-axis. Show the species along the y-axis, using a placeholder variable (a number from 1 to 4, with each value corresponding to one species). Use the points() command to show the estimate as a solid black circle; cex helps to make this point more obvious. Use the segments() command to add the confidence intervals as solid black lines. Species names can be shown along the y-axis or just above or below each estimate. Throughout, remember not to hard-code the values for the estimates and confidence limits; access them from the objects you made in step 5.

To give you a sense of how much code is needed, I could make this plot in 13 steps, one for plot(), and four each for points(), segments(), and text(). You might be able to do this in more or fewer steps. My version would be fine for me, but I would need more code to make it publication-ready. I also made a publication-ready version, and it required a loop. Just some ideas for you to consider.

Part 6

Undo the command you used in step 1 that allowed you to avoid using dollar-sign notation.

Submitting your problem set

When you run all of your commands, you should have one or two windows open, depending on whether you did the bonus problem.

Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 6. Do not send me the data file, as I have it already. This problem set is due Thursday, 19 October. Note that this is two days later than normal, as I will be at GSA during 14–18 October.

Data Analysis in the Geosciences

GEOL 8370