Opening files and accessing portions of data frames are essential to work in R. You cannot use R effectively unless you master these skills. I recommend working in pairs on this problem set, at least initially if you are struggling. After the homework, try to open other data files from the website to build your speed. By the exam, you should be able to open any file quickly and in a single command, and you should be able to recognize likely problems in commands.
From a web browser, download the following files listed under Data from the 8370 website without changing their file name:
Do not modify the contents of these files in any way or change their name. To do so will guarantee errors when I run your code on the original files. In general, you should have one data file for each project, and any culling of that data should be done in R. Having multiple copies of the same data files, or similar versions, is a recipe for trouble. Remember, there should be a single source of truth.
In the order listed above, open each of these files using only read.table() or scan(), as appropriate. As you open each file, assign the results to an object using a name of your choice. Use a simple descriptive name that makes sense, not something like spongeBobSquarePants. Use head() to view the first few lines of that object to verify that it was imported correctly. Use str() to see the structure of your object. In the fourth line of code, delete the object you just created.
Consult the help pages for these two functions if you have any questions, rather than doing a web search or a GPT. The exam will target your ability to use help files.
Your goal is to learn how to open common types of files quickly. Opening a file should take you less than a minute.
Import the Culled Dinosaurs of the Morrison Formation data (morrisonCulled.csv) and assign it to a data frame called morrison.
Use the appropriate command to display the structure of the data set. The first three columns should have sequence (a division of the stratigraphy), longitude, and latitude and the last three report the abundance (number of specimens) of Allosaurus, Stegosaurus, and Diplodocus. Some sites have none of these genera.
For each of these problems, verify for yourself that the result is correct, but do not submit any commands or comments that I don”t specifically ask for.
Solve each problem below with a single command, except where the instructions say that you will need multiple commands. For each problem, display the results, the data that correspond to what is asked; do not assign the results of your commands to objects.
Perform the following using $ notation to access columns; do not use column numbers, column names in quotes, or attach(). Examine the output to confirm for yourself that each command works as expected.
1. Show the data for all samples (rows) where the sequence is B3.
2. Show the data for all samples that are north of 44°.
3. Show the data for all samples that contain Stegosaurus (i.e., its abundance greater than zero).
4. Show the data for all samples east of -105°. Consult an online reference if you are unsure what a negative longitude means.
5. Show the data for all samples in which Allosaurus has an abundance of at least 3.
Often we need to query our data to find samples that match certain criteria. Let’s explore how to do that.
6. Extract the row names of morrison, and assign them to an object called samples, and in a second line of code, use the appropriate command to examine the structure of samples. In a third line of code, delete only the samples object, as duplicating your data is generally not recommended. We do this just so that you can see what rownames() returns, which will help you do the next several problems.
For each of the following, your goal is to display the names of samples that match particular criteria, not all the data for each sample. We have seen how we can get the sample names with rownames(), now we will follow that with square brackets and a logical test to extract particular samples. Your commands for the following should look something like rownames(myData)[logicalTest].
7. Show the names of sites (row names) that contain Diplodocus
8. Show the names of sites that are in the C5 or C6 sequence. Use | to handle the logical OR. On most keyboards, this is on the backslash key, just above the Return key.
9. Show the names of sites that are north of 42° and east of -105°. Logical AND is handled with &.
10. Show names of sites from the C5 sequence that are north of 43°.
As you can see, typing morrison$ is laborious. Luckily, you didn’t name the object morrisonDinosaurFaunaUnitedStates, or you would have hand cramps. Run the appropriate command on morrison so that you can avoid $ notation.
11–15. Do problems 1–5 without $ notation or row numbers.
16–19. Do problems 7–10 without $ notation or row numbers.
20. Undo the command that you used before problem #11 that allowed you to avoid $ notation. We will start using $ notation again.
When you are done, you should have a sense of two things. First, you want object names to be descriptive and obvious, but not too long. Second, attach() can save you much typing.
We will frequently want to analyze a particular variable from a data frame, so extracting one is an essential skill.
21. Using $ notation, show the Allosaurus column.
22. Using $ notation, show the latitude column.
23. Using $ notation, show the sequence column.
24. In one line of code that uses $ notation, show the unique values in the sequence column. Conveniently, there is a command called unique().
25. The sequences in the previous problem are listed in an unhelpful order. Wrap that command with the sort() command to put the sequences in alphabetical and numeric order.
26–28. Do questions 21–23 with column-number notation instead of $ notation.
29. Show columns 4 through 6.
30. Show columns 4 and 6 (but not 5).
The following problems handle common cases where you want to analyze a certain matching subset of data: you want samples that meet certain criteria, but you also need only certain variables.
31. Using row and column numbers (not names, not $ symbols), show sequence, longitude, and latitude for rows 11 to 15.
32. In the same way, show those same columns for row 12 and rows 19–22.
33. Using $ notation for the column followed by row-number notation, show the sequence for rows 11–15.
34. Using $ notation and logical test (no numbers, no names in strings), show the abundance of Allosaurus for samples from the C5 sequence.
35. In the same way, show the sequence for samples that contain Diplodocus.
36. Show the latitude and longitude of samples from the B4 sequence. Hint: specify the desired rows of the data frame with a logical test and the desired columns by number.
37. Show the abundances of all three dinosaurs at sites that are north of 40°. Your approach should be the same as #36.
38. Show the abundances of Stegosaurus and Diplodocus at sites north of 40;deg; where Allosaurus is absent. This will be like 36 but your logical test will consist of two logical tests joined by the symbol for logical AND, and you are selecting different columns.
39. Whew! We are done, so clean up by deleting the morrison object.
Carefully follow these instructions for formatting your commands file.
Put a comment like "# Part 1" before each of the two main sections of your code.
For Part 1, you should have ten blocks of code, each with four consecutive lines of code (opening the file, viewing the first few lines, viewing the structure, and deleting the object). Each block should be followed by one blank line to separate it from the following block.
For Part 2, each problem should consist of the code, followed by a # sign and the problem number. Each problem should be separated by a blank line. This format should look like this:
someCommand(...) # 1 anotherCommand(...) # 2 aCommand(...) # 3 anotherCommand(...) aCommand(...) #4
E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 3. Do not email the data files, as I have them already. This problem set is due 9 September.