R Tips

Home

Contact

Steven Holland

Unusual distributions from the bootstrap

16 October 2015

In class, we demonstrated the bootstrap by using it to calculate a confidence interval on the standard deviation of a small sample. This is the data we used; note that there is an outlier at 3.24.

x <- c(0.01, 0.25, 0.35, 0.37, 0.68, 0.84, 1.13, 1.25, 1.61, 3.24) stripchart(x, pch=1)

plot of chunk data, plotted

The bootstrap is set up in two steps. The first is to create a function that performs one bootstrap. It draws a sample of the same size as our data, with replacement, and calculates the standard deviation (the statistic we wish to bootstrap).

sdOfOneBootstrap <- function(x) { bootstrappedSample <- sample(x, size=length(x), replace=TRUE) theSd <- sd(bootstrappedSample) theSd }

The second step is to run the bootstrap function many times to produce a distribution of the statistic (the standard deviation).

nTrials <- 100000 standardDeviation <- replicate(nTrials, sdOfOneBootstrap(x))

Plotting this produces an unusual tri-modal distribution. Normally, this distribution would be used to generate confidence limits with the quantile() function. In this case, the unusual distribution warrants investigation.

hist(standardDeviation, breaks=50, col='gray', xlab='bootstrapped standard deviation', main='')

plot of chunk frequency distribution

The first thing to check is where the standard deviation of our data falls, and I’ll add this as a vertical black line. Expectedly, it falls near the center of the distribution.

hist(standardDeviation, breaks=50, col='gray', xlab='bootstrapped standard deviation', main='') abline(v=sd(x), col='black', lwd=3)

plot of chunk standard deviation of all the data

The second thing to check is the effect of the outlier at 3.24. One possibility is that the outlier might not get drawn in a bootstrap trial. This can be simulated by replacing the outlier value with any of the other values, calculating the standard deviation of that sample, and adding it as a red line. Not drawing the outlier lands us squarely in the lower mode.

hist(standardDeviation, breaks=50, col='gray', xlab='bootstrapped standard deviation', main='') noOutlier <- c(x[1:9], x[5]) abline(v=sd(noOutlier), col='red', lwd=3)

plot of chunk effect of not sampling the outlier

The other possibility is that the outlier might get drawn twice in a bootstrap trial. This can be simulated by replacing one of the other data values with the outlier, calculating the standard deviation of that sample, and plotting it as a blue line.

hist(standardDeviation, breaks=50, col='gray', xlab='bootstrapped standard deviation', main='') outlierTwice <- c(x[1:8], x[10], x[10]) abline(v=sd(outlierTwice), col='blue', lwd=3)

plot of chunk effect of sampling the outlier twice

That explains our unusual distribution. The left mode occurs when we fail to draw the outlier in one of the bootstrap trials. The middle mode results from drawing the outlier once, and the right mode forms from drawing it twice. Drawing it three times happens seldom enough that a distinct mode isn’t produced, but it does extend the right tail of the distribution.

These results emphasize an important point: outliers can have a strong effect on the results of a bootstrap, especially when sample size is small.

The sole assumption of a bootstrap is that the data are a good reflection of the parent distribution. In this case, one should question that assumption. The best solution might be to collect more data to produce a more representative sample. If that isn’t possible, the results of the bootstrap should be interpreted cautiously, as they are quite sensitive to the outlier.

A tip o’ the hat to Kelly Cronin and her persistent curiosity.

Comments

Comments or questions? Contact me at stratum@uga.edu