14 November 2015
We often want to know how many species are present, whether it is in a particular habitat, a province, or an interval of geological time. What sounds like a simple matter of counting species becomes complicated because the number of species increases with the number of individuals that are counted.
This becomes a problem when comparing the diversity (richness) of different samples and those samples have different numbers of counted individuals. A common approach is to use rarefaction (Hurlbert 1971) to subsample all of the samples and calculate their richness at the same number of individuals.
This also sounds straightforward, but it becomes complicated by evenness, that is, how abundance decreases from the most abundant species to the least abundant. When evenness is high, all species have relatively similar abundances, and when evenness is low, the abundances of species differs greatly. The problem for rarefaction is that rarefaction curves can cross when two samples differ in their evenness, and therefore the relative richness of samples will change at different sampling levels.
The second problem with rarefaction is what constitutes fair sampling. Although it may sound fair to count all samples to the same number of individuals, this can be misleading. An alternative approach is to count individuals until a given level of the relative abundance distribution has been sampled, for example, sampling might continue until the combined relative abundances of the sampled species equals some specified value. In effect, this means that samples with high evenness will need to be counted to a larger number of individuals to achieve the same coverage of relative abundance. John Alroy’s Shareholder Quorum Subsampling (SQS) performs this fair sampling (Alroy 2010).
Below, I give an implementation of SQS, based on John’s code. The implementation here removes the rarefaction option and the calculation of other metrics not needed for SQS. It also more closely follows R’s conventions of variable naming, returns and error handling, and avoidance of loops. I’ve also added more comments so that the steps in the calculations are clearer. The code retains one large loop, which is decidedly not R-like, but using this loop is simpler than the alternatives, given the large number of required parameters. The code is also available as a download.
sqs <-function(abundance, quota=0.9, trials=100, ignore.singletons=FALSE, exclude.dominant=FALSE) { # abundance is a vector of integers representing the abundance of every species if ((quota <= 0 || quota >= 1)) { stop('The SQS quota must be greater than 0.0 and less than 1.0') } # compute basic statistics specimens <- sum(abundance) numTaxa <- length(abundance) singletons <- sum(abundance==1) doubletons <- sum(abundance==2) highest <- max(abundance) mostFrequent <- which(abundance==highest)[1] if (exclude.dominant == FALSE) { highest <- 0 mostFrequent <- 0 } # compute Good's u u <- 0 if (exclude.dominant == TRUE) { u <- 1 - singletons / (specimens - highest) } else { u <- 1 - singletons / specimens } if (u == 0) { stop('Coverage is zero because all taxa are singletons') } # re-compute taxon frequencies for SQS frequencyInitial <- abundance - (singletons + doubletons / 2) / numTaxa frequency <- frequencyInitial / (specimens - highest) # return if the quorum target is higher than estimated coverage if ((quota > sum(frequency)) || (quota >= sum(abundance))) { stop('SQS quota is too large, relative to the estimated coverage') } # create a vector, length equal to total number of specimens, # each value is the index of that species in the abundance array ids <- unlist(mapply(rep, 1:numTaxa, abundance)) # subsampling trial loop richness <- rep(0, trials) # subsampled taxon richness for (trial in 1:trials) { pool <- ids # pool from which specimens will be sampled specimensRemaining <- length(pool) # number of specimens still to be sampled seen <- rep(0, numTaxa) # taxa that have been sampled subsampledFrequency <- rep(0, numTaxa) # subsampled frequencies of the taxa coverage <- 0 while (coverage < quota) { # draw a specimen drawnSpecimen <- sample(1:specimensRemaining, size=1) drawnTaxon <- pool[drawnSpecimen] # increment frequency for this taxon subsampledFrequency[drawnTaxon] <- subsampledFrequency[drawnTaxon] + 1 # if taxon has not yet been found, increment the coverage if (seen[drawnTaxon] == 0) { if (drawnTaxon != mostFrequent && (ignore.singletons == 0 || abundance[drawnTaxon] > 1)) { coverage <- coverage + frequency[drawnTaxon] } seen[drawnTaxon] <- 1 # increment the richness if the quota hasn't been exceeded, # and randomly throw back some draws that put the coverage over quota if (coverage < quota || runif(1) <= frequency[drawnTaxon]) { richness[trial] <- richness[trial] + 1 } else { subsampledFrequency[drawnTaxon] <- subsampledFrequency[drawnTaxon] - 1 } } # decrease pool of specimens not yet drawn pool[drawnSpecimen] <- pool[specimensRemaining] specimensRemaining <- specimensRemaining - 1 } } # compute subsampled richness s2 <- richness[richness>0] subsampledRichness <- exp(mean(log(s2))) * length(s2)/length(richness) return(round(subsampledRichness, 1)) }
Using the SQS function is simple: just supply it a vector of species abundances, in any order.
abundances <- c(2, 1, 12, 2, 5, 3, 25, 2, 1, 17, 1, 9) sqs(abundances)
>## [1] 8.5
The quota argument is set to 0.9 by default, which means that sampling will continue until taxa whose combined relative abundance exceeds 90%. This quota could be set to lower levels. Although John does not provide guidelines as to what the quota (quorum) should be, he does argue that the final results are not dependent on its value, unlike rarefaction (Alroy 2010, p. 1217). In his examples, John sets the quota to 40% and 55%.
The number of trials will control the precision of the answer, with the tradeoff being computation time. By default, 100 trials are performed. This value could be increased, but shouldn’t be decreased except when testing the method.
The remaining two parameters control whether the method includes singletons (species occurring only once in a sample) and dominants (the most abundant taxon). John argues (Alroy 2010, p. 1217) that these are special cases that should rarely effect the result. One might set ignore.singletons to true if one sample was unusually intensely sampled compared to other collections. One might set exclude.dominant to true to avoid having dominance affect the size of the species pool. John mentions that both corrections are relevant in idiosyncratic and small data sets and that they decrease sampling error in larger data sets.
Alroy, J. 2010. Geographical, environmental and intrinsic biotic controls on Phanerozoic marine diversification. Palaeontology 53:1211–1235.
Hurlbert, S.H. 1971. The nonconcept of species diversity: a critique and alternative parameters. Ecology 52:577–586.
Comments or questions? Contact me at stratum@uga.edu