Get in the habit of quoting only strings (words, etc.), but avoid quoting boolean values (TRUE and FALSE) and numeric values. Although quoting boolean and numerical values can produce code that works (and sometimes it won’t), the quotes are confusing and they needlessly complicate the code. Keep it simple.
Objects are best named with short descriptive names. Doing this not only makes your code easier to read, but it can also greatly help in debugging, especially as code becomes more complex. Names should be readable, so avoid tricks like removing vowels to make the name shorter: count is better than cnt. Avoid adding a single letter to an object: instead of rGauge, riverGauge, or even gauge would be better, especially if there is only one type of gauge. Try to trim unnecessary parts from names; instead of temperatureSensor, sensor would be fine, especially if you are dealing with only one type of sensor. Avoid adding unnecessary parts to the name like “data”, “vector”, “matrix”, etc; brine is better than brineData, for example.
Remember, you have to type these, so long names take longer and create more opportunities for misspelling. Even so, there will be times that you need longer names, but generally try to keep them short but understandable to anyone.
In most cases, avoid setting function arguments to their default values when you call a function, as this needlessly complicates the function call. One case where you would include the default is when you call a function multiple times, sometimes with the default and sometimes not. If you include the default value for an argument, it makes it clear that this was intentional, rather than simply forgetting to include the argument name.
Another case where it is good practice to include the defaults is using read.table(). You want to be in the habit of always specifying the delimiter (sep) and header arguments explictly as well as the skip and row.names arguments when they apply. For row.names, there are cases where read.table can infer if they are there, but you should always practice safe coding by setting this explicitly. Similarly, avoid using the default argument for sep, which works on any white space (space, tab, return). If you have strings that contain spaces, the default argument will cause you trouble.
Long lines of code can be hard to read, and one is tempted to force the line to break into several shorter lines. Although there are cases where this can help to see the structure of the code, embedding the line breaks will often cause problems. For example, you may modify the line of code later, but now the line breaks are in the wrong places, so you will want to clean it up and put in a fresh set of line breaks. If you change the code again, you will need to clean up those line breaks again. It’s just too much work; we want to aim for code that is easy to maintain.
The second thing that can happen is you might put the line return inside a string, such as for an axis label or a main label. Line returns and tabs in those labels are not easily interpreted by R, and R may put a small square as a placeholder for the symbol it does not recognize.
In general, avoid embedding line returns in long lines of code. Let your text editor handle the wrapping of those long lines.
Watch for row names in data files. These will almost always be in the first column, and they will typically be sample names or sample numbers. Sometimes, they will just be sequential numbers, but not a variable, like time, which might be listed in sequential order. They will also be unique identifiers, with no two being the same. If the first column looks like analyzable data, often with a variable name that suggests it is data, it is not a row name.
Avoid embedding magic numbers (hard-coded numeric constants) in your code. For example, if you want to plot a subset of your data, but you want the axis to span the maximum and minimum of your data, don't find the maximum and minimum and embed those numbers in xlim, like this xlim=(2.3, 8.9). If the maximum and minimum change, your code will no longer work. Instead, call range() where you need it, for example:
plot(x[subset], y[subset], xlim=range(x))
If you need to use these repeatedly, save them as an object. This saves you from repeating yourself, decreases the likelihood of errors, and makes your code more self-explanatory. For example, if you wanted three plots to share the same axis, this is the safe and concise way to do it.
xlimits <- range(x) plot(x[groupA], y[groupA], xlim=xlimits) plot(x[groupB], y[groupB], xlim=xlimits) plot(x[groupC], y[groupC], xlim=xlimits)
It is usually best to assign arguments to functions by name rather than by position. Calling by name is not only safer, it is self-commenting, with the only drawback being a little more typing. For example, rlnorm(n=1000, meanlog=5.1, sdlog=1.2) will make much more sense to you later than rlnorm(1000, 5.1, 1.2). There are some common exceptions, where you can leave off the name. For example, functions that use file names (read.table(), read.csv(), etc.) generally put the file name in the first position. Common calculations generally put the input data as the first parameter, such as mean(), median(), etc. In these and other similar cases where functions follow conventions, you can often skip naming the first parameter. plot() is another example, where the x and y vectors are the first two arguments.
There are some exceptions to this, particularly for commonly used functions. For example, the first argument to hist(), barplot(), and many other functions is commonly called x. In these cases, you would typically omit x= and call the argument by position. Another common example is plot(), where the x and y values are the first two arguments. Again, the convention is to call those by position rather than name. There is nothing wrong with calling by name per se; it is just unnecessarily verbose. Over time, you will become familiar with cases where you can safely call an argument by position.
Use informative labels such as for main and your axes. In particular, do not use names of objects, like granite.SiO2 or myRandomNumbers. Follow normal punctuation rules for spaces, etc. Also, do not use the main label to repeat what is on your axes. For example, if you plot pH vs. alkalinity, having a main title that says “pH vs. alkalinity” is redundant, and it should be removed.
If you need a legend for a plot, put it in an out of the way place so that it does not overlap your data. Keep the font size in the legend the same size as your axis labels, since it has the same level of importance. Text should be black in most cases. Points in the legend should match those in the plot in size, color, and shape.
Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work, but do not include them in the file you turn in. If I ask you to display a vector or data frame, though, you need to show them.
Don’t generate additional objects unless you need them or unless they clarify the work, such as in long function calls. For example, if you are using a logical test to retrieve a subset of a vector or data frame, your code will generally be easier to read if you save that logical test to an object with a short descriptive name, then use that object in selecting your data.
Although it is true that the compiler or interpreter doesn’t care about things like spacing, capitalization, or object names, following conventions on these will make your code easier to read and debug. For object names, you want a balance between a name that is informative but not too long. For example, in importing the Kahmann data set on paleosols, paleosols would be the clearest naem to assign to the object, as it is short and unambigous. kahmann would be another alternative, but using it would make the most sense if you were analyzing multiple paleosol data sets and wanted to keep them straight by making the author’s name the name of the object.
It would be hard to overstate the importance of good names for your object. I have seen many cases where bugs were hard to detect because of a poorly chosen name.
Include only those steps that are necessary to generate the answers to the problems. You will need to edit your commands down to what is needed.
Don’t display the values of vectors or data frames in your answers unless I specifically ask for them. This is particularly true for large data frames and long vectors. It is ok to show them for yourself as you work — and this is an important part of learning R to convince yourself that your code is working — but do not include them in the file you turn in or files that you share with others. If I ask you to display a vector or data frame, though, you need to show them.
If a function returns a vector, don’t wrap the result in c(). Likewise if you have just one object, do not wrap it with c(), as it is already a vector. For example, c() is unnecessary in cases like c(8:12), c(rnorm(25), c(seq(from=2, to=20, by=2), c(rep(25, 2)). If you are unsure if c() is needed, delete it and see if the code works correctly.
When accessing a subset of a data frame with a logical test, don’t wrap the logical test in a which() statement. The which() command is needed when you specifically need the index of the matching value. If you are unsure if which() is necessary, delete it and see if the code works correctly.
Use parentheses to simplify equations only when necessary. Remember and use the order of operations to simplify your code. If you are unsure of parentheses are needed, delete them and see if the code works correctly (you should be seeing a pattern here…).
Include only the comments that I ask for. Novice coders tend to include too many comments, particularly ones that state what code does in cases where the operation is clear. In your work, good comments should indicate your intent, the why of code, not the what.
Do not include the command setwd(), even if you comment it out. You will likely use setwd() when you work, but just delete it before sharing your code.
Do not embed a path in your code, such as calls to scan() and read.table(); it will automatically generate errors on anyone else’s computer.
Using spaces in our writing to make it more readable; good programmers do the same with their code. In particular, spaces help separate the elements of code, and a lack of spaces helps keep related elements adjacent. There are several places you should use spaces:
There are several places where you should not put a space:
Similarly, use blank lines to group related parts of your code. If several steps go together, do not separate them with blank lines. Instead, keep those lines of code together, but separate that block of code from preceding and following blocks of code by a single blank line. Multiple blank lines rarely help, and if you aren’t consistent about using them, they make your code look haphazard. One place to use multiple blank lines is if you are separating even larger-scale blocks of code (think of your code as being organized into sentences, paragraphs, chapters, and so on); just be consistent in the number of lines.
Use single quotes or use double quotes, but don’t switch between them in your code, because that makes the reader think that you are trying to convey something when you aren’t.
When writing equations, include only those parentheses that are necessary. Including too many overcomplicates your code and makes it more error-prone.
Misspelled words convey carelessness.
Similarly, check grammar and punctuation in your comments.
Use comments where necessary to identify the intent behind a block of code does, or to explain a critical or confusing step. Avoid commenting every line of code, or even most lines of code. Also, avoid commenting when the purpose is obvious; for example, if you are importing data from a file, you do not need to say that in a comment. In these problem sets, include a comment signaling every labeled part of the assignment (e.g., # Part 1)
Use blank lines to separate groups of related commands. It is hard to read code that lacks blank lines, and too many blank lines is just as hard. Think of blank lines in code as the paragraph breaks in your writing; they are there to help you read. You wouldn’t make every sentence in an essay its own paragraph, so don’t do the equivalent in code by surrounding every statement with a blank line. For the problem sets, treat each numbered part (e.g., “Part 1”) in the assignment as a block of code, and put one blank line before that block of code and one blank line after it.
Following these principles, your code should look like this:
# Part 1 someCommand anotherCommand aComment # Part 2 aCommand aComment anotherCommand # Part 3 ...
For your own work, you would use descriptive comments instead of Part 1, Part 2, etc., such as Read data sets, Cull the data, Fe vs. Mg plot, Fe vs. Mg regression analysis, etc.
The convention in R is to use <- for assignments at the beginning of a line rather than =, and we will adhere to that convention in this course. The only place you should use = for assignment is for assigning arguments in function calls. Here’s an example that illustrates both, as well as spacing around commas:
evenNumbers <- seq(from=0, to=100, by=2)