Problem Sets

Return Home

Contact Us

Steven Holland

Becoming a better coder

27 September

Handle Boolean values more simply

Avoid testing a logical vector against TRUE, because the result of the test is identical to the original Boolean vector. For example, logAl[carters, ] and logAl[carters==TRUE, ] give exactly the same result, so use the simpler code that omits the unnecessary logical test against TRUE.

A simpler way of testing FALSE is to use the negation operator (!). Negation works by turning all FALSE values to TRUE, and vice versa. For example, logAl[!carters, ] and logAl[carters==FALSE, ] produce exactly the same result, so opt for the simpler code that uses the negation operator (!) rather than the one with logical test against FALSE.

Finally, TRUE and FALSE are Boolean (logical) values; they are not strings and should not be wrapped in quotes. Because TRUE and FALSE are easier to debug than T and F, write out the words rather than using just one letter. Of course, as we’ve just shown, you often do not need to use these values at all.

22 September

DRY: Don’t Repeat Yourself

When writing code, avoid doing the same operation more than once, particularly when the operation is computationally expensive. Save that operation as a named object, then use that object wherever you need it. Scour your code for anything repeated and restructure it so that you don’t repeat yourself. For example, suppose you are extracting data from a data frame based on some condition. Here, we are getting all the values above (greater than) a particular stratigraphic position:

plot(geochem[geochem$stratigraphicPosition>32.2, "Fe"], geochem[geochem$stratigraphicPosition>32.2, "Al"])

Our logical test appears twice, and it’s also quite long. We can extract the logical test into a named object that also conveys the purpose of the test, and use that value wherever we need it.

ohioShale <- geochem$stratigraphicPosition>32.2 plot(geochem[ohioShale, "Fe"], geochem[ohioShale, "Al"])

This accomplishes several good things at once. First, we no longer repeat the logical test. If we change the test in the future, we need change it only once. Second, we assign the test to an object with a meaningful name; all the stratigraphic positions above meter 32.2 correspond to the Ohio Shale. Informative names like this explain our intent and remove the need for a comment. Third, by using this logical test, our call to plot() becomes shorter and more easily understood. Finally, because logical tests can be complicated, it is easy to make an error in one test but not another. By extracting the test and writing that code once, it is less error-prone, and our code is more easily testable.

DRY also applies to the data. Suppose we wanted to add a rectangle to our plot with specified corners, and we wanted to add a text label in the middle of the rectangle. Beginners are often tempted to do something like this:

rect(xleft=2, ybottom=3, xright=7, ytop=8) text(x=(2+7)/2, y=(3+8)/2, "safe zone")

There are three problems with this. First, it is unclear what these numbers represent. Second, the numbers are repeated in two places: they are used to specify the bounds of the rectangle, and they are used to calculate the center of the text label. Some might be tempted to avoid the repetition by calculating the arithmetic by hand and hard-coding two new constants as x=4.5, y=5.5. Third, if the values change, you must remember to change every one. If you miss one, you will have an error, and may be hard to detect it. Named constants solve all these problems.

feMin <- 2 feMax <- 7 alMin <- 3 alMax <- 8 rect(xleft=feMin, xright=feMax, ybottom=alMin, ytop=alMax) text(x=(feMin+feMax)/2, y=(alMin+alMax)/2, "safe zone")

Although the code is longer, it is now robust to changes, because the code can be updated in one place and the changes will be propagated everywhere that value is needed. By choosing object names wisely, our code also becomes self-commenting, and errors become easier to recognize.

Naming objects

One important but often overlooked aspect of programming is the importance of giving objects meaningful names. This takes practice. One sound approach is to give a name that reflects what the function produces. For example, plot() create a plot, mean() returns the mean, range() returns the range. These are all intuitive names. Note that these all begin with lowercase letters, a convention in R and many other languages.

If you make a function that plots Mg vs. Fe, a name like spongeBobSquarePants() might seem fun at the moment, but it will later be a problem because it doesn’t describe what the function does. The problem amplifies when you add this to your collection of other cryptically named functions achyBreakeyHeart(), ginAndTonic(), and myPersonalMisery(). Instead, consider a name that indicates that it produces a plot, and add what is plotted, such as mgFePlot().

The same is true when naming constants: give them names that reflect what they are. For example, consider this code:

xleft <- 2 xright <- 7 ybottom <- 3 ytop <- 8 rect(xleft=xleft, xright=xright, ybottom=ybottom, ytop=ytop) text(x=(xleft+xright)/2, y=(ybottom+ytop)/2, "safe zone")

What these numbers represent — where they came from — is unclear. Give these objects names that reflect what they are, not what you intend to use them for. Although you intend to use them as coordinates, and that becomes clear when you use them as arguments that describe coordinates. You can read this as “Set the left of the rectangle to the minimum iron value, the right to the maximum iron value, the bottom to the minimum aluminum value, and so on. Simply reading the code that way will make errors more obvious and your code will be more self-explanatory.

feMin <- 2 feMax <- 7 alMin <- 3 alMax <- 8 rect(xleft=feMin, xright=feMax, ybottom=alMin, ytop=alMax) text(x=(feMin+feMax)/2, y=(alMin+alMax)/2, "safe zone")

To sum up, objects generally start with lowercase letters. If you give objects meaningful names, it will be more obvious to you and anyone you share your code with what your intent is. Objects with meaningful names can also help you in debugging your code.

Avoid embedding returns in long lines of code

Long lines of code can be hard to read, and one is tempted to force the line to break into several shorter lines. Although there are cases where this can help to see the structure of the code, embedding the line breaks will often cause problems. For example, you may modify the line of code later, but now the line breaks are in the wrong places, so you will want to clean it up and put in a fresh set of line breaks. If you change the code again, you will need to clean up those line breaks again. It’s just too much work; we want to aim for code that is easy to maintain.

The second thing that can happen is you might put the line return inside a string, such as for an axis label or a main label. Line returns and tabs in those labels are not easily interpreted by R, and R may put a small square as a placeholder for the symbol it does not recognize.

In general, avoid embedding line returns in long lines of code. Let your text editor handle the wrapping of those long lines.

15 September

Quotes are for strings

Get in the habit of quoting only strings (words, etc.), but avoid quoting boolean values (TRUE and FALSE) and numeric values. Although quoting boolean and numerical values still results in code that works sometimes, the quotes are confusing and they needlessly complicate the code. Keep it simple.

Naming objects

Objects are best named with short descriptive names. Doing this not only makes your code easier to read, it can greatly help in debugging, especially as code becomes more complex. Names should be readable, so avoid tricks like removing vowels to make the name shorter: count is better than cnt. Similarly avoid adding unnecessary parts to the name like “data”, “vector” “matrix”, etc: brine for example is better than brineData. Remember, you have to type these, so long names take longer and create more opportunities for misspelling. Even so, there will be times that you need longer names, but generally try to keep them short yet readable.

Default function arguments

In most cases, avoid setting function arguments to their default values when you call a function as this needlessly complicates the function call. There are uncommon cases when you would include the default, such as when you call a function multiple times, sometimes with the default and sometimes not. Including the default value in this case makes it clear that this was intentional, rather than simply forgetting to include the argument name.

Debug commands by unpacking them

R code can quickly become complicated by parentheses and brackets. When those lines of code fail, it can be hard to understand why. The solution is to unpack this code, to examine everything contained within a set of brackets or parentheses to see its value. You can start at the outermost level and work inward, or start at the innermost level and work outwards, and both approaches can reveal the problem. Let’s start by looking at a line of code that works, and we’ll unpack it to see why it works.

> worms[worms$Slope>6, ] Area Slope Vegetation Soil.pH Damp Worm.density Nashs.Field 3.6 11 Grassland 4.1 FALSE 4 Garden.Wood 2.9 10 Scrub 5.2 FALSE 8 Cheapside 2.2 8 Scrub 4.7 TRUE 4 Farm.Wood 0.8 10 Scrub 5.1 TRUE 3

First, when we examine the worms$Slope object, we see that it gives us a set of numeric values. This tells us that it is valid to test whether they are larger than another number (6), so we know that the logical test ought to succeed.

> worms$Slope [1] 11 2 3 5 0 2 3 0 0 4 10 1 2 6 0 0 8 2 1 10

If we step out one level and examine the logical test, we can see that it works. Now we know we can use it to find all the matching rows in worms.


Now let’s look at a case where a line of code fails. The question is why.

> worms[rownames(worms$Ashurst == TRUE), ] [1] Area Slope Vegetation Soil.pH Damp Worm.density <0 rows> (or 0-length row.names)

To debug this, we’ll start at the outermost level and work inwards. Let’s look at the code for specifying the rows we want.

> rownames(worms$Ashurst == TRUE) NULL

This returns NULL, when we should be expecting a vector of TRUE and FALSE values, so we know that the problem must lie inside of the rownames() call. Working inward, let’s examine that one argument to rownames().

> worms$Ashurst == TRUE logical(0)

Again, we see a problem: a logical test ought to return a vector of TRUE and FALSE values. It is returning a logical vector, but it has a length of zero; in other words, it has no contents. That tells us that the problem must be with one side of the logical test, the side with worms$Ashurst, so let’s display that.

> worms$Ashurst NULL

Now we recognize the problem. worms$Ashurst returns nothing (NULL), and that’s because Ashurst is not a column in the worms data set. We need to rewrite how we find the row names that match Ashurst. Rewriting it as follows gives us the correct result.

> worms[rownames(worms) == "Ashurst", ] Area Slope Vegetation Soil.pH Damp Worm.density Ashurst 2.1 0 Arable 4.8 FALSE 4

We can know that it works by first typing rownames(worms) to see if that successfully gives us the row names (it does). We could then see if the logical test successfully gives us a vector of TRUE and FALSE values (it does). Knowing that, we can use that logical test to retrieve the matching rows from worms. Building upwards like this is a nearly foolproof way of writing longer, more complicated lines of code.

> rownames(worms) [1] "Nashs.Field" "Silwood.Bottom" "Nursery.Field"
[4] "Rush.Meadow" "Gunness.Thicket" "Oak.Mead"
[7] "Church.Field" "Ashurst" "The.Orchard"
[10] "Rookery.Slope" "Garden.Wood" "North.Gravel" [13] "South.Gravel" "Observatory.Ridge" "Pond.Field" [16] "Water.Meadow" "Cheapside" "Pound.Hill" [19] "Gravel.Pit" "Farm.Wood"> > > rownames(worms) == "Ashurst" [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

8 September

Future-proof your code

Avoid embedding magic numbers (hard-coded numeric constants) in your code. For example, if you want your plot limits to be the maximum and minimum of your data, don't find the maximum and minimum and embed those numbers in xlim, like this xlim=(2.3, 8.9). If the maximum and minimum change, your code will no longer work. Instead, call max() and min() where you need them, for example:

plot(x, y, xlim=c(min(x), max(x)))

If you need to use these repeatedly, save them as an object. This saves you from repeating yourself, decreases the likelihood of errors, and makes your code more self-explanatory.

xlimits <- c(min(x), max(x)) plot(x, y, xlim=xlimits) plot(x, w, xlim=xlimits) plot(x, z, xlim=xlimits)

Practice safe coding

It is usually best to assign arguments to functions by name rather than by position. Calling by name is not only safer, it is self-commenting, with the only drawback being a little more typing. For example, rlnorm(n=1000, meanlog=5.1, sdlog=1.2) will make much more sense to you later than rlnorm(1000, 5.1, 1.2). There are some common exceptions, where you can leave off the name. For example, functions that use file names (read.table(), read.csv(), etc.) generally put the file name in the first position. Common calculations generally put the input data as the first parameter, such as mean(), median(), etc. In these and other similar cases where functions follow conventions, you can often skip naming the first parameter. plot() is another example, where the x and y vectors are the first two arguments.

It is best to be explicit when you use row-column notation by inserting comma to make it clear you want all the rows. For example, if you want all rows for column 3, you should write the command as myDataFrame[ , 3], rather than myDataFrame[3]. The space before the comma is optional, but it helps to draw attention to the comma. As always, follow every comma with a space for clarity.

Improve your plots

Use informative labels such as for main and your axes. In particular, do not use names of objects, like granite.SiO2 or myRandomNumbers. Also, do not use the main label to repeat what is on your axes. For example, if you plot pH vs. alkalinity, having a main title that says “pH vs. alkalinity” is redundant, and it should be removed.

If you need a legend for a plot (and we’ve talked about strategies for avoiding them), put the legend in out of the way place so that it doesn’t overlap your data. Keep the font size the same size as your axis labels, since it has the same level of importance. Text should be black in most cases. Points in the legend should match in size, color, and shape the points on the plot.

Include only what is necessary

Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work, but do not include them in the file you turn in. If I ask you to display a vector or data frame, though, you need to show them.

Don’t generate additional objects unless you need them or unless they clarify the work, such as in long function calls.

When using a logical value, don’t put in quotes: it is not a string. Use TRUE and FALSE, not T or F, as they make it much easier to detect mistakes.

1 September

Include only what is necessary

Include only those steps that are necessary to generate the answers to the problems. You will need to edit your commands down to what is needed.

Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work, but do not include them in the file you turn in. If I ask you to display a vector or data frame, though, you need to show them.

If a function returns a vector, don’t wrap the result in c(). Likewise if you have just one object, do not wrap it with c(), as it is already a vector.

When accessing a subset of a data frame with a logical test, don’t wrap the logical test in a which() statement.

Use parentheses to simplify equations only when necessary. Remember and use the order of operations to simplify your code.

24 August

Avoid paths

Do not include the command setwd(), even if you comment it out. You may use setwd() when you work, although it is better and simpler to set your default working directory in R. If you call setwd(), be sure to delete it from the code that you send to me.

Do not embed a path in your code, such as calls to scan() and read.table(); it will automatically generate errors on anyone else’s computer.

Make your code readable

We use spaces in our writing to make it more readable; good programmers do the same with their code. In particular, spaces help to separate the elements of code, and a lack of spaces helps to keep related elements adjacent. There are several places you should use spaces:

There are several places where you should not put a space:

Similarly, use blank lines to group related parts of your code. If several steps go together, do not separate them with blank lines. Instead, keep those lines of code together, but separate that block of code from preceding and following blocks of code by a single blank line. Multiple blank lines rarely help, and if you aren’t consistent about using them, they make your code look haphazard. One place to use multiple blank lines is if you are separating even larger-scale blocks of code (think of your code as being organized into sentences, paragraphs, chapters, and so on); just be consistent in the number of lines.

Use single quotes or use double quotes, but don’t switch between them in your code, because that makes the reader think that you are trying to convey something when you aren’t.

When writing equations, include only those parentheses that are necessary. Including too many overcomplicates your code and makes it more error-prone.

Have a sense of style

Use comments where necessary to identify the intent behind a block of code does, or to explain a critical or confusing step. Avoid commenting every line of code, or even most lines of code. Also, avoid commenting when the purpose is obvious; for example, if you are importing data from a file, you do not need to say that in a comment. In these problem sets, include a comment signaling every labeled part of the assignment (e.g., # Part 1, # Part 2, etc.)

Use blank lines to separate groups of related commands. It is hard to read code that lacks blank lines, and too many blank lines is just as hard. Think of blank lines in code as the paragraph breaks in your writing; it is there to help you read. You wouldn’t make every sentence in an essay its own paragraph, so don’t do the equivalent in code by surrounding every statement with a blank line. For the problem sets, treat each numbered part (e.g., “Part 1”) in the assignment as a block of code, and put one blank line before that block of code and one blank line after it.

Following these principles, your code should look like this:

# Part 1 someCommand anotherCommand aComment # Part 2 aCommand aComment anotherCommand # Part 3 ...

For your own work, you would use descriptive comments instead of Part 1, Part 2, etc., such as Read data sets, Cull the data, Fe vs. Mg plot, Fe vs. Mg regression analysis, etc.

Put a space before and after the assignment (<-) operator, but not around the assignment operator used in function calls (=). Follow every comma with a space, just as you would in normal writing. Avoid extraneous spaces, but use them when they clarify complex code by breaking it into logical sections.

Similarly, do not precede lines of code with spaces or tabs, unless you are inside of an if statement, a while statement, a for loop, or a function definition. Indents in these cases clarify the code, and adding them elsewhere is unnecessary and causes confusion over your intent.

The convention in R is to use <- for assignments at the beginning of a line rather than =, and we will adhere to that convention in this course. The only place you should use = for assignment is for assigning arguments in function calls. Here’s an example that illustrates both, as well as spacing around commas:

evenNumbers <- seq(from=0, to=100, by=2)