Important note for RStudio users
RStudio currently has a bug/limitation that it will not allow you to open multiple plot windows, which is required in this problem set. If you use RStudio, you have two options:
Generate 10000 random numbers from an exponential distribution with a rate of 2.1, and assign them to an object with a name of your choosing. Hint: the tutorial describes how to make random numbers from a normal distribution, and the help pages often have a section called “See Also”, which provides links to related functions.
Plot a frequency distribution (histogram) of these data. Suggest 100 breaks, color the bars green, rotate the y-axis labels, and do not show a main title.
For the next three plots, we will use our textbook by Michael J. Crawley, “Statistics: An Introduction using R”, which I will just call Crawley from here on. Read Crawley’s Chapters 1, 2, and the Appendix (Essentials of the R Language).
Download the worms data set (the URL for the book website in the preface), and leave the file name unchanged (that is, it should stay as worms.csv). Open the file in your text editor, something you should always do, but do not edit it in any way. In your text editor, you can see that the first column (Field.Name) are sample names, so we will treat them as row names. Read the data into R using read.table() and assign it to an object named worms. Display the data frame; it should look similar to what is shown on the bottom of page 25 in Crawley (although note that he did not treat that first column as row names).
Let’s practice accessing particular rows and columns.
Using bracket notation, display columns 1 and 4 for all rows. Hint: use c() to specify the columns. If you’ve imported the data correctly, this will be the Area and Soil.pH columns.
In one command, display columns 1 and 2 for the samples for which the Damp field is TRUE. Because 1 and 2 are adjacent numbers, you could use a colon instead of c(), but either will work fine. When you have multiple options in R, the simplest way is often the clearest and easiest to understand.
Make a new plot window, using a command that will work on all operating systems.
In that window, make a scatterplot of worm density versus slope using a single line of code, as follows:
• The plot will be viewed by a human, not a code compiler, so use normal language for the x and y axes, which are usually not the names of your R objects.
• To make them easier to read, rotate the y-axis values so that they are horizontal.
• Filled symbols are much easier to see, so we will use small solid circles for the data points, which is plotting character (symbol) 16. Symbol 19 is slightly larger, and symbol 20 is slightly smaller, but symbol 16 is a good starting point that works in most cases. Make these symbols black.
• Remember that “A vs. B” means that A is the dependent variable and is shown on the y axis, and that B is the independent variable and is displayed on the x axis; don’t reverse these.
• There is no reason here to make new objects for the two variables, or to use attach(), so just use $ notation to get the vectors you need by name.
• What is being shown on the plot is obvious, so do not add a main title to the plot.
We fill follow these same conventions on every plot: normal-language labels, rotated y-axis labels, solid circles for plot symbols, and adding a title only when it conveys information not on the plot. Also, as we write code, we will avoid making objects unless they simplify the code, make it more readable, or reduce redundant work.
Next, add a regression line to the plot in one line of code, making sure that you get the dependent and independent variables correct. Hint: see the tutorial for how to add a regression line to a plot. The line should be red and dashed (see lty on the help page for par). Verify that the line makes sense for the data; if it does not, you likely have the dependent and independent variables mixed.
Once you have made your plot, create a 7"x7" pdf file with the pdf() function, and recreate your plot. Save this plot as xxxxScatter.pdf, where xxxx is your last name, lowercase (e.g., hollandScatter.pdf).
Make a new plot window, again plotting worm density versus slope, following the conventions described above, but do not plot any points.
This time we will use colored symbols based on the vegetation type rather than black symbols, and we will add each set of colored points individually. The points() is one good way to do this; there are others, but use points() here so that you know how to do this. Set the cex argument to 1.2; this will make the points 1.2x bigger, making it easier to distinguish the colors. Use the following named colors:
Arable: darkgoldenrod2
Grassland: darkcyan
Meadow: darkgoldenrod4
Orchard: darkolivegreen3
Scrub: darkolivegreen4
Add these points using logical operations, not with row numbers. You are going to want to find the rows that match particular vegetation types, so you will need to use logical operations like worms$Vegetation=="Orchard". You will then use those to select the appropriate rows for each set of x and y values, like worms$Slope[worms$Vegetation=="Orchard"]. You should be able to make this plot and add all the symbols in six lines of code: one line for the base plot and five lines to handle the five vegetation types.
Much of this code is repetitive, making typing errors likely. Remember that your up-arrow key, followed by selective editing of that line of code, can save you a lot of time and cut down on errors.
Bonus +2: Use legend() to add a key in one line of code. Be sure not to obscure any data points, which may take some experimentation.
Make a new plotting window, specifying the dimensions to be 4" wide and 7" tall. Use the mfrow argument for par() to create six plotting areas in this window, with the plots in two columns and three rows. You will fill those plotting areas with six plots of worm density versus pH, in this order:
• all of the data, in black • only arable data, in darkgoldenrod2 • only grassland data, in darkcyan • only meadow data, in darkgoldenrod4 • only orchard data, in darkolivegreen3 • only scrub data, in darkolivegreen4
Since these are small multiples, we need to consider three things:
1) When you have small multiples, the axis labels on the plots should be identical.
2) Because small multiples plot the same things, labels are needed to indicate what is shown on each plot. This is a good situation for adding main labels to each plot; use appropriate terms for each (All Types, Grassland, etc.).
3) Small multiples should have axes that span a consistent range, usually a range that spans the entire data set. To do this, we will need to supply arguments for xlim and ylim. The range() function is the simplest way to get the minimum and maximum value for each axis. Rather than make these same calls to range() for each plot, it is simpler to calculate these for each axis once, saving those values in an object, then using that object for setting xlim and ylim.
You should be able to create the entire set of plots in ten lines of code or fewer.
Make sure that when your code runs, all four plots are open and in separate windows.
Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu, following the standard instructions. The subject of your email should be 8370 problem set 2. This problem set is due 7 September at 2:00 PM.
Do not email the data file or the pdf file to me. I already have the data file, and your code will generate the pdf file.