Problem Sets

Return Home

Contact Us

Problem Set 3: Opening Files and Accessing Data Frames

Part 1: Opening files

Opening files and accessing portions of data frames are essential to work in R. You can’t use R effectively unless you master these skills. I recommend working in pairs on this problem set, at least initially if you are struggling. After the homework, try to open other data files from Crawley or the website to build your speed. By the exam, you should be able to open any file quickly and in a single command.

From a web browser, download the following 12 files listed under Data from the 8370 website without changing their file name (note that the names listed below are not the file names):

Brines
Brines 2
Cadmium
Cadmium 2
Cadmium 3
Gasoline
Geophysics
Kentucky counts
Ozone
Paleosols
Temperature sensors
Trilobites 3

Do not modify the contents of these files in any way or change their name. To do so will guarantee errors when I run your code on the original files.

In the order listed above, open each of these files using only read.table() or scan(), as appropriate. As you open each file, assign the results to an object using a name of your choice. Use a simple descriptive name that makes sense, not something like spongeBob. Use head() to view the first few lines of that object to verify that it was imported correctly. Use str() to see the structure of your object. In the fourth line of code, delete the object you just created.

For data frames, pay attention to whether there is a header line (a first row that lists variable names), the type of separator (usually a comma, semicolon, space, or tab), and whether there are row names (usually in the first column). Row names will be unique identifiers for each sample, typically a sample name or sample number. Set the arguments to read.table() as appropriate. A file may have initial lines that should not be read, such as an explanation of the contents. If you encounter these, check the help page for read.table() for how to skip these lines.

For vectors, watch for cases in which the variable name is listed, which you should skip when reading. Also watch for cases in which the values are not on separate lines but are separated by some other character (usually a comma, semicolon, space, or tab); for these, you will need to specify the separator. Check the help page for scan() for details.

Your goal is to learn how to open common types of files quickly. Opening a file should take you less than a minute. You will also learn how to use the help pages for guidance on handling files with unusual features, such as lines that need to be skipped.

Part 2: Accessing data

Using read.table(), open the Crawley’s worms dataset you used in problem set 2 and assign it to a data frame called worms. Unlike last week, this week we will specify that it does have row names (which Crawley should have done). If you do not specify that the data frame has row names, many of your answers to the following will be incorrect.

Solve each problem below with a single command, except for #3, which will require two commands. Do not assign the results to a new object; simply display the results. You will extract data from worms for each of these, so what you display in most cases (except 25–27, 31) will be a data frame. For all but problems 25–27 and 31, your command should be in the form of worms[rows, columns], where you substitute values for rows and columns as necessary. Your goal is to display the data corresponding to particular conditions, not the row or column indices matching those conditions. Use column names where I specify their name; use column numbers where I specify their number.

Particular rows

Perform the following using $ notation. Do not use attach().

1. Show all rows where the vegetation is Grassland.

2. Show all rows where the slope is 0.0.

3. Show the rows named Oak.Mead. You will do this in two ways, in two lines of code. First, use a logical test that combines the rownames() command with the name of the row you are seeking. Second, use the row name "Oak.Mead" in row-column notation, without a logical test. In both cases, do not embed the row number as a magic number.

4. Show all rows where the area is greater than 3.

5. Show all rows that are damp.

6. Show all rows with a worm density greater than or equal to 3.

7. Show all rows where the slope is not equal to 0. Be sure to test for nonequality instead of positive values.

8. Show all rows where the vegetation is Grassland or Meadow. Hint: use | to handle the logical or. This is on the backslash key, just above the Return key on most keyboards.

9. Show all rows where the area is greater than or equal to 2 and the slope is greater than or equal to 3. Hint: use & for the logical and.

10. Show all rows where the vegetation is Grassland, and the soil is not damp.

For problems 11–20, repeat problems 1–10 after running the attach() command on worms. Do not use $ notation and do not use row numbers. When you have finished problems 11–20, detach the worms data frame before continuing to question 21.

Use row-number notation for the next four questions (21–24). Do not use $ signs and do not use attach(). Use c() only if necessary.

21. Show rows 11 to 20.

22. Show row 3 and rows 7–9.

23. Show all rows except for the first one. Your answer should not presume knowledge of how many rows there are in the data frame. Hint: The answer is quite simple and involves a minus sign.

24. Show all rows except 8–12, with the same restrictions as problem 23.

Particular columns

Do questions 25–27 with $ notation. Do not use attach().

25. Show the Vegetation column.

26. Show the Area column.

27. Show the Damp column.

Do questions 28–31 with column-number notation. Do not use $ signs or attach().

28. Show columns 3 through 5.

29. Show columns 3 and 5 (but not 4).

30. Show all columns except column 4. Your answer should not presume to know how many columns there are in this data frame. Hint: You solved a similar problem when accessing rows.

31. Show all the row names.

Combinations of specified rows and columns

For questions 32–36, use logical operations to find particular rows; do not specify hard-coded row numbers. Do not use attach().

32. Show columns 1 and 3 for the rows for which the slope is greater than 9.

33. Show columns 2 and 6 for the cases where the vegetation is Scrub.

34. Show columns 4 and 5 for cases where the area is greater than 3, and the ground is not damp.

35. Show columns 2 and 3 for which the vegetation is not Arable.

36. Show all columns except column 1 for which the vegetation is not Grassland.

Delete the worms object.

Submitting your problem set

Carefully follow these instructions for formatting your commands file.

Put a comment like "# Part 1" before the two main sections of your code.

For Part 1, you should have 12 blocks of code, each with 4 consecutive lines of code (opening the file, viewing the first few lines, and deleting the object). Each block should be followed by one blank line to separate it from the following block. Do not include a comment identifying each block.

For Part 2, each problem should consist of a blank line, followed by a comment with the problem number (e.g., “# 1”, “# 2”, etc.), followed by a line with the R command.

E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 3. Do not email the data files, as I have them already. This problem set is due 14 September.

Data Analysis in the Geosciences

GEOL 8370