Problem Sets

Home

Contact

Problem Set 5: Central Tendency

Two temperature sensors were connected to a Raspberry Pi, and the temperature value (in degrees Centigrade) was read every ten minutes for a little over 6 days. The sensors are of the same type and from the same manufacturer. They were located a few inches apart, so they should have produced similar readings. Although individual readings might disagree by a small amount owing to small-scale temperature variations and instrument-related factors, there is no reason to expect one sensor to give generally higher readings than the other. In this problem set, you will evaluate whether the two sensors give consistently similar values. You will do this graphically and statistically, showing how you can approach a problem in a couple of ways.

Starting with this problem set, I will ask you questions about what the plots and statistical analyses indicate. Answer each question on its own line as a comment, beginning with its label (e.g. # Question 1:, # Question 2:, etc.). This format will also apply in all remaining problem sets, although I will not repeat these instructions. Here’s an example of what answer will be formatted:

# Question 1: It’s not who I am underneath, but what I do that defines me.

For every answer you give, your answer should be able to stand on its own and not need the question to understand the answer. For example, if I asked, “What color is the sky?”, you would not say, “It’s blue.” because that answer makes sense only if you know the question. Instead, you would say, “The sky is blue.” That answer stands on its own.

You are aiming for specificity and conciseness in your answers: be clear but not wordy. Each answer should be one sentence of no more than 200 characters (counting spaces) unless I specify otherwise.

Remember that all plots should be saved as pdf files, following the naming conventions made in problem set 4.

Part 1

First, read the data file “temperatureSensors.csv” and assign it to an object called sensors. It is often helpful to give the object a name similar to the data file like we’re doing here.

Use str() to verify that the data were successfully imported and to see the structure of the data.

Because this is the only data set you will work with, it is safe and simpler to use the variables directly without dollar-sign notation. Use attach() to allow this.

Part 2

The first rule of data analysis is to visualize the data, and that is where we will start. Seeing the data will nearly always help you understand the task better. You will want to plot the readings from each sensor through time (decimalDay), that is, with time on the x-axis. For time series that are relatively short (few observations), it is common to plot the data points along with a line connecting those symbols (type="b" or type="o"). For time series like this that are relatively long, the symbols make the plot cluttered and are commonly omitted. You will plot only a line connecting the observations (type="l"). This will be plot 1.

Plots of time series often look better if they are wider than tall. Make the plot 4 inches tall and 8 inches wide. Refer to the R Tutorial for how to do this, as the help page is not very helpful in this case.

Our approach will be to plot the one sensor against time, then use the points() command to add the other data series. To ensure that all points are plotted, you need to find the total range of data in the combined data series. So, in one line of code, combine the two sets of sensor readings into a single vector with c(), calculate the range of the data, and assign it to an object called limits.

Now, create the plot in two lines of code, using those limits for the y-axis. Use a “blue” line for sensorA and a “dodgerblue” line for sensorB. For the two axes, give simple meaningful labels that include the units in parentheses. Rotate the y-axis labels. Hint: create one time series in the plot() command and add the other with the points() command.

Use the text() function to place two labels for “sensor A” and “sensor B”. The color of the label should match the color of the data series; otherwise, use the default values for text(). We will use direct labeling, so find a place to place each label near its data series. Although we generally want to avoid hardcoded values and prefer named constants, it is okay to use hardcoded values here for the x and y coordinates. One way to find the coordinates of a point is to use locator(1) and click on a point on the graph. If you do this, be sure to use a reasonable number of significant figures. You want these labels to be close to the time series but avoid crowding it.

Question 1: It is clear that each sensor shows temperature changes over time, but at any given time, do the sensors report the same temperature or different temperatures? State what you observe in the plot. For example, if one sensor gives higher readings than the other, specify the sensor that gives those higher readings and the approximate amount (eyeball this).

Part 3

You would likely to statistically evaluate any differences in the mean temperature of the two sensors. This calls for a t-test because the sample size is large enough to invoke the central limit theorem, which assures that sample means will follow a normal distribution, an assumption of the t-test.

There are two approaches you could use for applying a t-test. The first approach takes all values from one sensor and compares them to all values from the other sensor, effectively scrambling any order to the readings. This is called an unpaired t-test, which is the default behavior of the t.test() command. The second approach compares the values at each point in time, preserving the pairs of observations rather than scrambling them. This is called a paired t-test, and to use it, you set the paired argument to t.test(). Use the appropriate command and test for differences in the mean value of the two sensors.

Examine the output. It is simpler to think about a positive difference in temperatures rather than a negative difference. In other words, it is clearer to say that one sensor gives values X° warmer (not colder) than the other sensor. You would therefore like to have the results of t.test() follow that same convention. If the sign of the “mean difference” in the last line of the output is negative, switch the order of sensorA and sensorB so that this value is positive. Notice that this has no effect on the other results of the test. If you have to make this change, do not show both calls to t.test(); include only the one that produces a positive mean of the differences. If your t.test() call produced a positive mean of the differences on your first try, you do not need to change anything. Finally, note that the “mean difference” that R reports is exactly the same as the difference of the means; we can use either phrase.

Question 2: You can think about the output in two ways. The first is to use the p-value to say whether the difference in means is statistically significant. Using that criterion, state whether the means of the two sensors are significantly different, supporting your answer with the appropriate numerical results of the test. Be sure to refer to Stating statistical results before composing your answer.

Question 3: Considering what the null hypothesis is, what does this approach actually let us say? Say this as simply as possible without using jargon. Your answer should be in the form of “The data are (or are not) consistent with <null hypothesis>”, replacing <null hypothesis> with a statement of what the null hypothesis actually is.

Question 4: The second way to look at the output is to focus on the difference in the means and your uncertainty in that difference. State the results of the test this way using the appropriate numerical results of the test. Again, be sure to refer to Stating statistical results before composing your answer, and use a reasonable number of significant figures (based on the number of significant figures in the data).

Question 5: Based on the confidence interval, state whether the results are statistically significant. Your answer should succinctly explain why you came to this conclusion.

Question 6: Which of these two approaches tells you more about the behavior of these two sensors, and so is the better one to include in a report? Explain in 1–3 sentences, using less than 450 characters total.

Part 4

Let’s try another approach to this problem. This time, you will focus explicitly on the differences in the sensor readings.

First, let’s visualize the difference between the two sensors. Create a vector called difference, which is the difference in the readings of two sensors. To be consistent with how we have set up this problem so far, calculate difference such that the values are generally positive numbers. Do this simply; do not just use a negative sign to change the sign. Your vector should have the same length as sensorA and sensorB.

Make a new plot with the same dimensions as your first plot; this will be plot 2. In it, plot difference against decimalDay. As before, use only a line to connect the values, and do not show the individual points. Use the default color for the line. Give meaningful names and units for your axes, and be consistent with plot 1 where possible.

To help us see the mean value, use abline() to add a horizontal red line of three times the normal width, with the line drawn at the mean value of difference.

Question 7: Does plot 2 appear to agree with the results of your t-test? Explain using the appropriate numerical values from plot 2 and from your t.test() command.

Part 5

The t.test() can also be used to give a confidence interval on the mean of a single vector. Doing this will also compare the mean to a null hypothesis of zero. Your goal is to calculate the average difference in the two time series plus a confidence interval on the difference to express the uncertainty. Run the t.test() command on difference.

Question 8: How do the mean value, the output line with the p-value, and the output line with the 95% confidence interval, compare with the t-test you ran in Part 3? Be specific, but state your answer simply.

Question 9: What do you conclude about these two approaches to problems like this?

Part 6

The statistical approach you have used addresses statistical significance, but there is also another meaning of significance to consider, which is scientific importance. Importance is always relative to the application of the data, for example, what the temperature sensors are to be used for. Always remember that when someone says results are “significant”, they usually mean in the statistical sense, and they may or may not realize that this has nothing to do with importance. You should always ask whether the results are important for the problem at hand.

Question 10: In two sentences (350 characters maximum), give an example where the difference in mean temperature of these two sensors would be scientifically significant (important or substantial) and another where the difference would be scientifically insignificant (unimportant or inconsequential).

Finally, always remember that if you have called attach(), you must call detach() when you finish your work with that data. In the future, I may not explicitly ask you to call detach(), but you must remember to do so. Failing to do this can create a confusing raft of problems.

Submitting your problem set

Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 5. Do not send me the data file, as I have it already. This problem set is due 10 October.

Data Analysis in the Geosciences

GEOL 8370