R Tips

Home

Contact

Steven Holland

Handling groups of data

28 November 2021

Sometimes we wish to calculate statistics on groups of rows in a data frame, but calculating and tabulating those statistics seems complicated. This becomes an easy problem by using a pair of functions, split() and sapply(). Let’s consider three increasingly complex examples.

Example 1

Imagine you have three replicate measures of gold (Au) at three locations, stored in a data frame called gold:

location Au 1 1 2 2 1 3 3 1 4 4 2 1 5 2 3 6 2 8 7 3 1 8 3 4 9 3 7

You would like to calculate the average amount of gold at each location and have that nicely tabulated. At first, you might be thinking that could use the row indices for each location and calculate the average gold concentration for each site in turn, but that would be laborious and error-prone; it would also be hard to tabulate the results. You might also think of using a loop, but that would be clunky and not R-like. There is a simpler way, and it takes only three steps.

First, create a function that takes a data frame as an argument, calculates the statistic you are interested in, and returns the value (a mean in this case).

myFunc <- function(x) { AuMean <- mean(x$Au) names(AuMean) <- "mean" return(AuMean) }

The way I have written this function might seem unnecessarily verbose, but I am doing this so the following examples follow a similar pattern. The line with the names() function will make the final output more readable, especially in the more complex examples. You can (should) test your function against the full data set to make sure it works: it should return the mean of all the gold values.

In step two, we will split the data frame into groups based on the values of location, using the split() function. The first argument is the data frame and the second argument is the column in the data frame that defines the groups.

goldSplit <- split(gold, location)

If we view the object this produces, we can see that it is a list of three data frames, each one containing the data for one of the three locations. Notice that the structure of each of these data frames is identical to our original data frame. The $ notation specifies the groups, which we could use to access individual locations. For example, if we wanted the data just from locality 1, we could type goldSplit$"1".

$`1` location Au 1 1 2 2 1 3 3 1 4 $`2` location Au 4 2 1 5 2 3 6 2 8 $`3` location Au 7 3 1 8 3 4 9 3 7

For the third and final step, we calculate our statistic on each of these data frames and tabulate the result using sapply(), which applies a function to a list (or vector) and simplifies (that’s the s before apply) the result into a vector or data frame. The first argument is our data frame, and the second argument is our function.

goldMeans <- sapply(goldSplit, myFunc)

Viewing goldMeans shows the output as a vector. Each element is named by location and the name we added in our function, hence, 1.mean, 2.mean, etc.

1.mean 2.mean 3.mean 3 4 4

To recap, all we have to do is create our function, split the data frame using split(), and apply our function to the split data frame using sapply().

Example 2

In this example, suppose we have more variables and that we want to calculate multiple statistics and assemble them into a single data frame. In this example, we now have silver and gold, and we would like to calculate mean gold and mean silver on the replicates for each location. Here’s our new data in a data frame called metals:

location Au Ag 1 1 2 2 2 1 3 5 3 1 4 7 4 2 1 1 5 2 3 2 6 2 8 3 7 3 1 4 8 3 4 5 9 3 7 6

First, we create our function following the same approach, this time combining our statistics into a vector before assigning the names.

meanMetal <- function(x) { AuMean <- mean(x$Au) AgMean <- mean(x$Ag) results <- c(AuMean, AgMean) names(results) <- c("AuMean", "AgMean") return(results) }

Next, we split the data frame by locations.

metalsSplit <- split(metals, location)

Finally, we apply our function to our split data frame. The t() function is used to transpose the results to put mean gold and mean silver as columns, with rows corresponding to our locations.

means <- t(sapply(metalsSplit, meanMetal))

Viewing the means object shows a simple table of our results.

AuMean AgMean 1 3 4.666667 2 4 2.000000 3 4 5.000000

Again, there are just three steps: make a function to do the work, split the data frame by location, then apply the function to the split data frame.

Example 3

Sometimes, we will want to split our data in two ways. For example, suppose we had two sampling areas, and we had a two plots in each area, one treated and one control. We would like to split our data by both columns (area and type), not just one column. Here’s the data:

area type Au Ag 1 1 treated 3 5 2 1 treated 4 7 3 1 control 1 1 4 1 control 3 2 5 2 treated 8 3 6 2 treated 1 4 7 2 control 4 5 8 2 control 7 6

As before, we create our function. This is the same function used in example 2.

meanMetal <- function(x) { AuMean <- mean(x$Au) AgMean <- mean(x$Ag) results <- c(AuMean, AgMean) names(results) <- c("AuMean", "AgMean") return(results) }

Next, we split the data, and this is the only new part of the process. When we want to split a data frame by more than one column, we use a model specification for the second argument. The model specification will be similar to what is done in regression, ANOVA, and similar problems. Here, our model specification will be "~ area + type", that is, as a function of area and type.

metalsSplit <- split(metals, ~ area + type)

Viewing the metalsSplit object shows the list of our split data, with the names reflecting the area and the type variables.

$`1.control` area type Au Ag 3 1 control 1 1 4 1 control 3 2 $`2.control` area type Au Ag 7 2 control 4 5 8 2 control 7 6 $`1.treated` area type Au Ag 1 1 treated 3 5 2 1 treated 4 7 $`2.treated` area type Au Ag 5 2 treated 8 3 6 2 treated 1 4

Finally, we apply our function to the split data, again using t() to transpose the results so that columns are variables and rows are cases.

means <- t(sapply(metalsSplit, meanMetal))

Viewing means shows a simple data frame of our results, with columns corresponding to variables and rows corresponding to cases.

AuMean AgMean 1.control 2.0 1.5 2.control 5.5 5.5 1.treated 3.5 6.0 2.treated 4.5 3.5

Conclusion

This approach can also be taken with multivariate data or my complex return types. For example, suppose you wanted to perform a regression or a principal components analysis on each group of data. First, you would create a function that would perform your analysis and return the results. Second, you would split the data frame by groups. Third and last, you would apply the function to your split data frame. Because the return types are complex, not just a vector of values, you would use the apply() function rather than the sapply() function, so that the final result would be a list, with each list item corresponding to one group of the data. You would also skip the t() command.

Remember, function, split(), and sapply(): three easy steps to analyzing data frames by groups.