Problem Sets

Home

Contact

Steven Holland

Problem Set 8: Bootstrap

 

This problem set is optional and will be treated as up to 10 points of extra credit. It is due by 2:00 PM, Tuesday, 28 October.

 

When we studied central tendency, we saw two common limitations. The first is that the t-test needs the statistic (the mean) to be normally distributed, but the central limit theorem may not guarantee this when the sample size is small and the parent distribution is non-normal. The second is that the Mann-Whitney U-test on the median does not generate a confidence interval.

Both issues can be solved with resampling. With the bootstrap and the jackknife, we can create confidence intervals on the mean and the median that reflect how the data are actually distributed. You will do that in this problem set. You will also build functions that let you perform a bootstrap or jackknife on any statistic calculated from a single variable. You will also learn a common technique for reusing functions you create, one you might apply in your course project.

Part 1: Write a one-variable bootstrap function

Building off the code given in the resampling lecture for the generalized one-variable bootstrap, write a function that will perform all these steps so that you can create a bootstrapped estimate and confidence intervals on a statistic in one line of code. Here are the constraints:

Test your function to make sure it works before proceeding. One way to do this is to make a vector of normally distributed values and bootstrap its mean. The estimate and confidence limits should be similar to what t.test() produces. In addition, try to bootstrap a different statistic (median, min, max, sd, var, etc.) to see if you get reasonable values. Try a different number of iterations and try a different significance level. Do not turn in these tests; they are solely for you to convince yourself that your function works.

Part 2: Write a one-variable jackknife function

Building off the code given in the resampling lecture for the jackknife, write a function that will perform all these steps so that you can create a jackknifed estimate and confidence intervals on a statistic in one line of code. Follow the same constraints as you did for the bootstrap, except that:

Test your function, but don’t turn in your test code.

Part 3: Create an .R file for your functions

Create a text file named “hollandResampling.R”, substituting your last name for mine. Paste in your bootstrap function, then your jackknife function.

Always document your functions with comments. Add the following comments before each of your functions, one line per item listed below:

If you use code that you did not write, you must attribute it; otherwise, you are committing plagiarism. Put the following comment at the top of the file, followed by a blank line: Code for the jackknife() and bootstrap() functions is used with permission from Steven Holland, http://stratigrafia.org/8370/lecturenotes/resampling.html

Putting your functions into a source file like this is good practice because it promotes code reuse and simplifies your code. When you need to use these functions, put the file in your working directory, then call source("myFunctions.R") or whatever your file is named. If you create many different types of functions, you would group them based on their function, such as “plots.R”, “regressionUtilities.R”, etc.

Part 4: Use your functions

What follows is the only code that will go in your commands file. In other words, your commands file will not include anything from Parts 1–3.

Import the trilobites4 data set, and assign it to an object called trilobites. Note that this data set has a column called species, which should be imported as a factor. Next, use the appropriate command for viewing the structure, then run the appropriate command for letting us bypass dollar-sign notation. Remember to undo this command as the final step of your problem set. Last, run the command that lets you see the unique values of species. Skip a line in your code. We will judiciously add extra lines as we go to help delineate related blocks of code.

Visualize the data by plotting sigma as a function of species. Use the plus sign as your plotting symbol (better than a filled circle when some points are so similar), rotate the y-axis labels, and do not plot the frame. Use par() and its argument mar to increase the size of the left margin so that the labels are not clipped. The units of sigma are centimeters (cm). The plot should be 4" tall and 6" wide. This is plot 1.

D. gracilis has a right-skewed distribution, fails a Shapiro-Wilks test for normality, and has such a small sample size that the normality assumption of a t-test is likely not valid, so we will focus on it. Make a logical vector called gracilis that tests every value of species for whether it matches “D. gracilis” so that we can easily retrieve any values for this species.

Use source() to import your resampling functions.

Perform a t-test on sigma of D. gracilis to create an estimate of the sample mean and its confidence interval. Even though this is likely invalid based on the normality assumption, you will use these to evaluate your resampling-based confidence intervals.

Use your function to calculate a bootstrapped estimate and confidence interval on the mean value of sigma for D. gracilis. Your bootstrap should be based on 10,000 iterations and a 95% confidence level, but only set the arguments that are not the defaults.

Use your function to calculate a jackknifed estimate and confidence interval on the mean value of sigma for D. gracilis. Your jackknife should be at the 95% confidence level, but only set an argument if it is not the default.

Question 1: How do the three estimates of the mean compare?

Question 2: How do the three sets of confidence intervals compare?

Question 3: Given the assumptions of the three approaches, which confidence interval(s) should be used?

Use your function to calculate a bootstrapped estimate and confidence interval on the median value of sigma for D. gracilis, following the same guidelines you used for the mean.

Use your function to calculate a jackknifed estimate and confidence interval on the median value of sigma for D. gracilis, following the same guidelines you used for the mean.

Question 4: How do these estimates of the median compare to those for the mean? What is the very simple explanation for this?

Question 5: What do you notice about the widths of the confidence intervals on the median versus the ones for the mean?

Submitting your problem set

Quit R. Run your commands file in a fresh R session. Inspect the results and your PDF plot to make sure everything is correct. I have structured this problem set in a way that it will likely produce errors if you skip this important step; checking your code like this should always be done before you stop work for the day, and especially before you share code with anyone.

I will also run your resampling functions on a data set and a statistic that are unknown to you to verify that they work correctly in any situation.

Format your commands file and function file following the standard instructions. E-mail both .R files to stratum@uga.edu. The subject of your email should be 8370 problem set 8. Do not send me the data file or your plot file. This optional problem set is due by Tuesday, 28 October. Note the earlier time.