Avoid testing a logical vector against TRUE, because the result of the test is identical to the original Boolean vector. For example, if isOldGrowth is a logical vector (that is, contains values of TRUE or FALSE), isOldGrowth and isOldGrowth==TRUE are exactly the same, so there is no reason to test a logical vector against TRUE. For the same reason, forests[isOldGrowth, ] and forests[isOldGrowth==TRUE, ] do exactly the same thing, so use the simpler code that omits the unnecessary logical test against TRUE.
Rather than test against FALSE, use the negation operator (!). Negation works by turning all FALSE values to TRUE, and vice versa. For example, forests[!isOldGrowth, ] and forests[isOldGrowth==FALSE, ] produce exactly the same result, so opt for the simpler code that uses the negation operator (!) rather than the one with logical test against FALSE.
Finally, TRUE and FALSE are Boolean (logical) values; they are not strings and should not be wrapped in quotes. Because TRUE and FALSE are easier to debug than T and F, write out the words rather than using just one letter.
When writing code, avoid doing the same operation more than once, particularly when the operation is computationally expensive. Save that operation as a named object, then use that object wherever you need it. Scour your code for anything repeated and restructure it so that you don’t repeat yourself. For example, suppose you are extracting data from a data frame based on some condition. Here, we are getting all the values above (greater than) a particular stratigraphic position:
plot(geochem[geochem$stratigraphicPosition>32.2, "Fe"], geochem[geochem$stratigraphicPosition>32.2, "Al"])
Our logical test appears twice, and it’s also quite long. We can extract the logical test into a named object that also conveys the purpose of the test, and use that value wherever we need it.
ohioShale <- geochem$stratigraphicPosition>32.2 plot(geochem[ohioShale, "Fe"], geochem[ohioShale, "Al"])
This accomplishes several good things at once. First, we no longer repeat the logical test. If we need to change the test in the future, we can change it once and everything will work. Second, we assigned the test to an object with a meaningful name; all the stratigraphic positions above meter 32.2 correspond to the Ohio Shale. Informative names like this explain our intent and remove the need for a comment. Third, our logical test is isolated, so it is more easily tested to make sure it works. Fourth, by using this logical test, our call to plot() becomes shorter and more easily understood. Finally, because logical tests can be complicated, it is easy to make an error in one test but not another. By extracting the test and writing that code once, it becomes less error-prone.
DRY also applies to the data. Suppose we wanted to add a rectangle to our plot with specified corners, and we wanted to add a text label in the middle of the rectangle. Beginners are often tempted to do something like this:
rect(xleft=2, ybottom=3, xright=7, ytop=8) text(x=(2+7)/2, y=(3+8)/2, "safe zone")
There are three problems with this. First, it is unclear what these numbers represent. Second, the numbers are repeated in two places: they are used to specify the bounds of the rectangle, and they are used to calculate the center of the text label. Some might be tempted to avoid the repetition by calculating the arithmetic by hand and hard-coding two new constants as x=4.5, y=5.5. Third, if the values change, you must remember to change every one. If you miss one, you will have an error, and may be hard to detect it. Named constants solve all these problems.
feMin <- 2 feMax <- 7 alMin <- 3 alMax <- 8 rect(xleft=feMin, xright=feMax, ybottom=alMin, ytop=alMax) text(x=(feMin+feMax)/2, y=(alMin+alMax)/2, "safe zone")
Although the code is longer, it is now robust to changes, because the code can be updated in one place and the changes will be propagated everywhere that value is needed.
One important but often overlooked aspect of programming is the importance of giving objects meaningful names. This takes practice. One sound approach is to give a name that reflects what the function produces. For example, plot() create a plot, mean() returns the mean, range() returns the range. These are all intuitive names. Note that these all begin with lowercase letters, a convention in R and many other languages.
If you make a function that plots Mg vs. Fe, a name like spongeBobSquarePants() might seem fun at the moment, but it will later be a problem because it doesn’t describe what the function does. The problem amplifies when you add this to your collection of other cryptically named functions achyBreakeyHeart(), itsMillerTime(), and badLifeChoices(). Instead, consider a name that indicates that it produces a plot, and include what is plotted, such as mgFePlot().
The same is true when naming constants: give them names that reflect what they are. For example, consider this code:
xleft <- 2 xright <- 7 ybottom <- 3 ytop <- 8 rect(xleft=xleft, xright=xright, ybottom=ybottom, ytop=ytop) text(x=(xleft+xright)/2, y=(ybottom+ytop)/2, "safe zone")
What these numbers represent — where they came from — is unclear. Give these objects names that reflect what they are, not what you intend to use them for. Although you intend to use them as coordinates, that becomes clear when pass them to arguments that describe coordinates.
feMin <- 2 feMax <- 7 alMin <- 3 alMax <- 8 rect(xleft=feMin, xright=feMax, ybottom=alMin, ytop=alMax) text(x=(feMin+feMax)/2, y=(alMin+alMax)/2, "safe zone")
In this example, the code is easy to read and your intent is clear. For example, you can read the last line as “Set the left of the rectangle to the minimum iron value, the right to the maximum iron value, the bottom to the minimum aluminum value”, and so on. Simply reading the code that way will make errors more obvious and your code will be more self-explanatory.
To sum up, objects generally start with lowercase letters. When you give objects meaningful names, the intent of your code will be clearer to you and anyone you share it with. Objects with meaningful names greatly help detecting errors and in debugging.
Get in the habit of quoting only strings (words, etc.), but avoid quoting boolean values (TRUE and FALSE) and numeric values. Although quoting boolean and numerical values can produce code that works (and sometimes it won’t), the quotes are confusing and they needlessly complicate the code. Keep it simple.
Objects are best named with short descriptive names. Doing this not only makes your code easier to read, but it can also greatly help in debugging, especially as code becomes more complex. Names should be readable, so avoid tricks like removing vowels to make the name shorter: count is better than cnt. Avoid adding a single letter to an object: instead of rGauge, riverGauge, or even gauge would be better, especially if there is only one type of gauge. Try to trim unnecessary parts from names; instead of temperatureSensor, sensor would be fine, especially if you are dealing with only one type of sensor. Avoid adding unnecessary parts to the name like “data”, “vector”, “matrix”, etc; brine is better than brineData, for example.
Remember, you have to type these, so long names take longer and create more opportunities for misspelling. Even so, there will be times that you need longer names, but generally try to keep them short but understandable to anyone.
In most cases, avoid setting function arguments to their default values when you call a function, as this needlessly complicates the function call. One case where you would include the default is when you call a function multiple times, sometimes with the default and sometimes not. If you include the default value for an argument, it makes it clear that this was intentional, rather than simply forgetting to include the argument name.
Another case where it is good practice to include the defaults is using read.table(). You want to be in the habit of always specifying the delimiter (sep) and header arguments explictly as well as the skip and row.names arguments when they apply. For row.names, there are cases where read.table can infer if they are there, but you should always practice safe coding by setting this explicitly. Similarly, avoid using the default argument for sep, which works on any white space (space, tab, return). If you have strings that contain spaces, the default argument will cause you trouble.
Long lines of code can be hard to read, and one is tempted to force the line to break into several shorter lines. Although there are cases where this can help to see the structure of the code, embedding the line breaks will often cause problems. For example, you may modify the line of code later, but now the line breaks are in the wrong places, so you will want to clean it up and put in a fresh set of line breaks. If you change the code again, you will need to clean up those line breaks again. It’s just too much work; we want to aim for code that is easy to maintain.
The second thing that can happen is you might put the line return inside a string, such as for an axis label or a main label. Line returns and tabs in those labels are not easily interpreted by R, and R may put a small square as a placeholder for the symbol it does not recognize.
In general, avoid embedding line returns in long lines of code. Let your text editor handle the wrapping of those long lines.
Watch for row names in data files. These will almost always be in the first column, and they will typically be sample names or sample numbers. Sometimes, they will just be sequential numbers, but not a variable, like time, which might be listed in sequential order. They will also be unique identifiers, with no two being the same. If the first column looks like analyzable data, often with a variable name that suggests it is data, it is not a row name.
Avoid embedding magic numbers (hard-coded numeric constants) in your code. For example, if you want to plot a subset of your data, but you want the axis to span the maximum and minimum of your data, don't find the maximum and minimum and embed those numbers in xlim, like this xlim=(2.3, 8.9). If the maximum and minimum change, your code will no longer work. Instead, call range() where you need it, for example:
plot(x[subset], y[subset], xlim=range(x))
If you need to use these repeatedly, save them as an object. This saves you from repeating yourself, decreases the likelihood of errors, and makes your code more self-explanatory. For example, if you wanted three plots to share the same axis, this is the safe and concise way to do it.
xlimits <- range(x) plot(x[groupA], y[groupA], xlim=xlimits) plot(x[groupB], y[groupB], xlim=xlimits) plot(x[groupC], y[groupC], xlim=xlimits)
It is usually best to assign arguments to functions by name rather than by position. Calling by name is not only safer, it is self-commenting, with the only drawback being a little more typing. For example, rlnorm(n=1000, meanlog=5.1, sdlog=1.2) will make much more sense to you later than rlnorm(1000, 5.1, 1.2). There are some common exceptions, where you can leave off the name. For example, functions that use file names (read.table(), read.csv(), etc.) generally put the file name in the first position. Common calculations generally put the input data as the first parameter, such as mean(), median(), etc. In these and other similar cases where functions follow conventions, you can often skip naming the first parameter. plot() is another example, where the x and y vectors are the first two arguments.
There are some exceptions to this, particularly for commonly used functions. For example, the first argument to hist(), barplot(), and many other functions is commonly called x. In these cases, you would typically omit x= and call the argument by position. Another common example is plot(), where the x and y values are the first two arguments. Again, the convention is to call those by position rather than name. There is nothing wrong with calling by name per se; it is just unnecessarily verbose. Over time, you will become familiar with cases where you can safely call an argument by position.
Use informative labels such as for main and your axes. In particular, do not use names of objects, like granite.SiO2 or myRandomNumbers. Follow normal punctuation rules for spaces, etc. Also, do not use the main label to repeat what is on your axes. For example, if you plot pH vs. alkalinity, having a main title that says “pH vs. alkalinity” is redundant, and it should be removed.
If you need a legend for a plot, put it in an out of the way place so that it does not overlap your data. Keep the font size in the legend the same size as your axis labels, since it has the same level of importance. Text should be black in most cases. Points in the legend should match those in the plot in size, color, and shape.
Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work, but do not include them in the file you turn in. If I ask you to display a vector or data frame, though, you need to show them.
Don’t generate additional objects unless you need them or unless they clarify the work, such as in long function calls. For example, if you are using a logical test to retrieve a subset of a vector or data frame, your code will generally be easier to read if you save that logical test to an object with a short descriptive name, then use that object in selecting your data.
Although it is true that the compiler or interpreter doesn’t care about things like spacing, capitalization, or object names, following conventions on these will make your code easier to read and debug. For object names, you want a balance between a name that is informative but not too long. For example, in importing the Kahmann data set on paleosols, paleosols would be the clearest naem to assign to the object, as it is short and unambigous. kahmann would be another alternative, but using it would make the most sense if you were analyzing multiple paleosol data sets and wanted to keep them straight by making the author’s name the name of the object.
It would be hard to overstate the importance of good names for your object. I have seen many cases where bugs were hard to detect because of a poorly chosen name.
Include only those steps that are necessary to generate the answers to the problems. You will need to edit your commands down to what is needed.
Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work — and this is an important part of learning R to convince yourself that your code is working — but do not include them in the file you turn in or files that you share with others. If I ask you to display a vector or data frame, though, you need to show them.
If a function returns a vector, don’t wrap the result in c(). Likewise if you have just one object, do not wrap it with c(), as it is already a vector. For example, c() is unnecessary in cases like c(8:12), c(rnorm(25), c(seq(from=2, to=20, by=2), c(rep(25, 2)). If you are unsure if c() is needed, delete it and see if the code works correctly.
When accessing a subset of a data frame with a logical test, don’t wrap the logical test in a which() statement. The which() command is needed when you specifically need the index of the matching value. If you are unsure if which() is necessary, delete it and see if the code works correctly.
Use parentheses to simplify equations only when necessary. Remember and use the order of operations to simplify your code. If you are unsure of parentheses are needed, delete them and see if the code works correctly (you should be seeing a pattern here…).
Include only the comments that I ask for. Novice coders tend to include too many comments, particularly ones that state what code does in cases where the operation is clear. In your work, good comments should indicate your intent, the why of code, not the what.
Do not include the command setwd(), even if you comment it out. You will likely use setwd() when you work, but just delete it before sharing your code.
Do not embed a path in your code, such as calls to scan() and read.table(); it will automatically generate errors on anyone else’s computer.
Using spaces in our writing to make it more readable; good programmers do the same with their code. In particular, spaces help separate the elements of code, and a lack of spaces helps keep related elements adjacent. There are several places you should use spaces:
There are several places where you should not put a space:
Similarly, use blank lines to group related parts of your code. If several steps go together, do not separate them with blank lines. Instead, keep those lines of code together, but separate that block of code from preceding and following blocks of code by a single blank line. Multiple blank lines rarely help, and if you aren’t consistent about using them, they make your code look haphazard. One place to use multiple blank lines is if you are separating even larger-scale blocks of code (think of your code as being organized into sentences, paragraphs, chapters, and so on); just be consistent in the number of lines.
Use single quotes or use double quotes, but don’t switch between them in your code, because that makes the reader think that you are trying to convey something when you aren’t.
When writing equations, include only those parentheses that are necessary. Including too many overcomplicates your code and makes it more error-prone.
Misspelled words convey carelessness.
Similarly, check grammar and punctuation in your comments.
Use comments where necessary to identify the intent behind a block of code does, or to explain a critical or confusing step. Avoid commenting every line of code, or even most lines of code. Also, avoid commenting when the purpose is obvious; for example, if you are importing data from a file, you do not need to say that in a comment. In these problem sets, include a comment signaling every labeled part of the assignment (e.g., # Part 1)
Use blank lines to separate groups of related commands. It is hard to read code that lacks blank lines, and too many blank lines is just as hard. Think of blank lines in code as the paragraph breaks in your writing; they are there to help you read. You wouldn’t make every sentence in an essay its own paragraph, so don’t do the equivalent in code by surrounding every statement with a blank line. For the problem sets, treat each numbered part (e.g., “Part 1”) in the assignment as a block of code, and put one blank line before that block of code and one blank line after it.
Following these principles, your code should look like this:
# Part 1 someCommand anotherCommand aComment # Part 2 aCommand aComment anotherCommand # Part 3 ...
For your own work, you would use descriptive comments instead of Part 1, Part 2, etc., such as Read data sets, Cull the data, Fe vs. Mg plot, Fe vs. Mg regression analysis, etc.
The convention in R is to use <- for assignments at the beginning of a line rather than =, and we will adhere to that convention in this course. The only place you should use = for assignment is for assigning arguments in function calls. Here’s an example that illustrates both, as well as spacing around commas:
evenNumbers <- seq(from=0, to=100, by=2)