27 October 2015

If you’ve come from a programming background, your instinct in R is likely to make loops. This instinct should be avoided, not only because loops are generally slower in R than other approaches (sometimes *much* slower), but because the code can also be expressed more clearly without loops.

For example, suppose you want to square a series of values stored in a vector. Coming from other procedural languages (like Fortran, BASIC, or C), you might accomplish this with a loop:

x <- c(12.2, 1.7, 53.3, 29.2, 6.1) squared <- vector(mode="numeric", length=length(x)) for (i in 1:length(x)) { squared[i] <- x[i]^2 }

That’s a lot of work for a simple operation. Not only is there a loop, it had to be preceded by explicitly declaring a vector to hold the results. R has vectorized math which let’s this be done much more simply:

x <- c(12.2, 1.7, 53.3, 29.2, 6.1) squared <- x^2

What’s nice about this is not just that it makes the code shorter, it also makes your intent clearer. Look for opportunities to take advantage of vectorized math, as the speed gains are impressive.

Not all loops can be eliminated with vectorized math. For example, suppose you wanted to simulate an F-distribution (the ratio of two variances, each drawn from a normal distribution). Using loops, you might write the following to build the distribution of F:

n1 <- 12 n2 <- 17 numTrials <- 10000 F <- vector(mode="numeric", length=numTrials) for (i in 1:numTrials) { sample1 <- rnorm(n1) sample2 <- rnorm(n2) F[i] <- var(sample1)/var(sample2) }

This has the same setup you saw above: creating a vector, then sequentially filling it with values in a loop. Although this problem cannot be solved with vectorized math, there is another solution: the replicate() function, which repeats a set of commands many times.

varianceRatio <- function(sampleSize1, sampleSize2) { data1 <- rnorm(sampleSize1) data2 <- rnorm(sampleSize2) var(data1)/var(data2) } n1 <- 12 n2 <- 17 numTrials <- 1000 F <- replicate(numTrials, varianceRatio(n1, n2)) # F could be plotted with hist() to show the distribution

In this case, the code is somewhat longer, but our intent is clear. The first step is to define a function that performs the operation that will be repeated; here, two random samples are drawn from a normal distribution, and the ratio of their variances is returned. The second step is to repeat that function many times with replicate(). Note that replicate() doesn’t iterate through an existing vector, list, matrix, or data frame.

In many cases, you may have an existing object (usually a vector, list, matrix, or data frame) that you want to iterate through, performing some calculation based on that object or parts of it. For example, you might want to calculate the median of every column of a matrix. For these types of problems, you should use apply(), lapply(), or sapply(), three incredibly powerful functions. Understanding these three functions is essential for knowing how to use R. Which one you should use will depend on the type of object you are working on and what you would like as output.

The apply() function is used when you want to work with a matrix and get a vector of results. For example, if you want the median for every column of a matrix, you would get back a vector of medians, and this vector would have the same length as the number of columns in the matrix. Approaching this with a loop, you might do the following.

# assume x is a matrix with our data n <- ncol(x) medians <- vector(mode="numeric", length=n) for (i in 1:n) { medians[i] <- median(x[ , i]) }

Using apply() is much simpler. The first argument is the matrix for the calculations. The second argument specifies whether the operation is to be applied row-by-row (MARGIN=1) or column-by-column (MARGIN=2). The third argument is the function to be applied; note that it does not include parentheses or arguments.

# assume x is a matrix with our data medians <- apply(x, MARGIN=2, median)

You aren't limited to the built-in functions of R. For example, if you wanted to calculated the coefficient of variation (the mean divided by the standard deviation), you could define it in a function and then apply it.

CV <- function(x) { mean(x)/sd(x) } cv <- apply(x, MARGIN=2, CV)

You can also do the same by defining an anonymous function (one without a name) within the call to apply():

cv <- apply(x, MARGIN=2, function(y) {mean(y)/sd(y)})

Notice how this is done. A function is defined as taking one argument called y, and the operations of the function are defined within the curly braces.

The argument y is what is extracted from the matrix. You can give this argument any name you wish. For example, since apply() is working on a column of data at a time, it would make the code clearer to call it column instead of the nondescript y.

cv <- apply(x, MARGIN=2, function(column) {mean(column)/sd(column)})

The curly braces can enclose multiple lines of code; that they are part of the anonymous function can be made clear by including returns and indents. By convention, the opening curly brace terminates the first line, and the closing curly brace and parenthesis (the one that ends the apply function) go on their own line at the end. This code is intentionally too verbose so that you see the principle that anonymous functions can have multiple lines of code.

cv <- apply(x, MARGIN=2, function(column) { theMean <- mean(column) theSd <- sd(column) cv <- theMean/theSd cv })

The sapply() function is used when you want to work with a vector and have a vector returned as a result. It can also be used when you want to work with a list and have a vector or matrix returned.

The jackknife is a resampling technique often applied to a vector of data, and it is a good candidate for sapply(). A jackknife works by removing one observation at a time from the data, calculating the statistic of interest, and from it, calculating something called a pseudovalue. The pseudovalues are used to place a confidence interval on the estimate. The details of the jackknife method aren’t critical here, but how to go about removing only one value at a time from a vector is. Using loops, and supposing you are performing a jackknife on the coefficient of variation (mean / standard deviation), you would likely to do something like this:

# assume x is a vector of data
allPseudovalues <- vector(length=length(x), mode="numeric")
theEstimate <- mean(x)/sd(x)
for (i in 1:length(x)) {
subset <- x[-i]

# value is removed here
subsetEstimate <- calculateStatistic(subset)
pseudovalues[i] <- theEstimate - (length(x) - 1)*(subsetEstimate - theEstimate)
}

With sapply(), you first define a function that calculates a pseudovalue by removing a single value from a vector. The function is set up so that the first parameter is the value to be removed, and the second is the vector. This is important, as the first value is what will be changed as sapply() iterates through the data. When sapply() is called, the first argument is the vector of data, and the second is the function to be called. Often, sapply() might be called with only these two arguments. In this case, our custom function requires multiple arguments, so these are added in the same order after the function argument to sapply(). In other words, our pseudovalue function requires a vector argument, so our vector of data (x) is added as a third argument to sapply().

pseudovalue <- function(value, vector) { theEstimate <- mean(vector)/sd(vector) index <- min(which(vector==value)) subset <- vector[-index] subsetEstimate <- mean(subset)/sd(subset) theEstimate - (length(subset))*(subsetEstimate - theEstimate) } # assume x is a vector of data allPseudovalues <- sapply(x, pseudovalue, x)

This setup of defining a function and then calling apply() or sapply() has two advantages over loops. First, it makes our intent clearer: you define what happens in one application and then call it over the entire data set. Second, this approach is more easily testable than the loop: you can invoke the custom function for a particular case and verify that it does what it ought to.

The lapply() function is used when you want to work with a list or vector and you want to have a list, matrix, or vector returned as a result.

For example, suppose you wanted to perform some analyses on a series of data sets. You might open these one at a time, then perform your operations.

c2 <- read.table("c2.txt", row.names=1, sep=",", header=TRUE)
# do some operations on c2
c3 <- read.table("c3.txt", row.names=1, sep=",", header=TRUE)
# do some operations on the c3 data
c4 <- read.table("c4.txt", row.names=1, sep=",", header=TRUE)
# do some operations on the c4 data

This would be laborious and error-prone. If any step was incorrect for one data set, you would need to check it for every other data set to make sure you caught every case. It is much better to do an operation once and not repeat code. You can start with reading the files, as all of these have the same format.

filePrefixes <- c("c2", "c3", "c4") fileNames <- paste(filePrefixes, ".txt", sep="") dataSets <- lapply(fileNames, read.table, row.names=1, sep=",", header=TRUE)

Because all of the file names have the same structure (or they should have), you should build the file names based on their prefixes. The alternative would be to repeatedly type ‘.txt’ for every file name and hope that you make no errors.

Once the file names have been constructed, lapply() can be used to open all of them with read.table(). Note that lapply() is set up like sapply(): the first argument is the object that will be iterated through (here, a list of file names), and the second argument (read.table) is the function that will be applied to each of our file names. The subsequent arguments (row.names, sep, and header) get passed to the read.table() function.

What lapply() returns is a list of all the data sets. You could loop though them and do the work on each one in turn.

for (i in 1:length(dataSets)) { # perform operations on the data set in dataSets[[i]] }

Better yet, you could handle this with lapply(). First, you define a function that will do the operations on one data set, and then you use lapply() to perform it on all the data sets. What you get back is a list with all of the results.

dataSetFunction <- function(dataSet) { # all the operations to be performed on a single data set } results <- lapply(dataSets, dataSetFunction)

Again, this has the advantage that the code is substantially shorter and it is is much more testable. It is also easily extended: if you want to add an additional data set - all you would have to add is file prefix to the filePrefixes vector.

Loops in R should be reserved for situations that require them, such as when one step depends on the previous iteration or when the loop must execute until some condition is met. Where the steps are independent, loops should be avoided. If you have code that requires loops, you should strongly consider writing compiled code in procedural language like C, and calling those functions from R. You will likely see substantial gains in speed.

The benefits to using vectorized math, replicate(), apply(), lapply(), and sapply() are many. Your code will generally be faster, sometimes dramatically so, especially if you tend to use loops within loops. You will not need to do housekeeping work, like creating vectors beforehand to hold data that you will generate. Your code will be clearer, because you will separate what needs to happen in one application of your methods from calling those operations repeatedly. This separation will also make your code more testable. Finally, you will be using R as the beautiful functional language that it was meant to be, not as a procedural language that strips it of its power and elegance.

Comments or questions? Contact me at stratum@uga.edu