Two temperature sensors were connected to a Raspberry Pi, and the temperature value (in degrees Celsius) was read every ten minutes for about 6 days. The sensors are the same type from the same manufacturer. They were located a few inches apart, so they should have produced similar readings. Although individual readings might disagree by a small amount owing to small-scale temperature variations and instrument-related factors, there is no reason to expect one sensor to give generally higher readings than the other. In this problem set, you will evaluate whether the two sensors give consistently similar values. You will do this graphically and statistically, showing how you can approach a problem in a couple of ways.
Plots in this and subsequent problem sets should be saved as PDF files, following the naming conventions introduced in problem set 4.
Starting with this problem set, I will ask you questions about what the plots and statistical analyses indicate. Answer each question on its own line as a comment, beginning with its label (e.g. # Question 1:, # Question 2:, etc.). This format will also apply in all remaining problem sets, although I will not repeat these instructions. Here’s an example of how an answer should be formatted:
# Question 1: It’s not who I am underneath, but what I do that defines me.
For every answer you give, your answer should be able to stand on its own and not need the question to understand the answer. For example, if I asked, “What color is the sky?”, you would not say, “It’s blue.” because that answer makes sense only if you know the question. Instead, you would say, “The sky is blue.” That answer stands on its own.
You are aiming for specificity and concision in your answers: be clear but not wordy. Each answer should be one sentence of no more than 200 characters (counting spaces) unless I specify otherwise.
Last, your code should work on any pair of temperature sensors spanning any amount of time, so be sure that you do not hard-code any numerical constants, except where I explicitly state that you can use hard-coded values.
First, import the data file “temperatureSensors.csv” and assign it to an object called sensors. It is helpful to give the object a name that succinctly captures what is in the file; often, this might be similar to the name of the data file, assuming that the file name was well chosen.
Use the best command to verify that the data was imported correctly and to see the structure of the data.
Because this is the only data set you will work with, it is safe and simpler to use the variables directly without dollar-sign notation. Use attach() to allow this, and be sure to undo this at the end of your code.
The first rule of data analysis is to visualize the data, and that is where we will start. Seeing the data will nearly always help you understand the task better. The following will be plot 1.
To compare the sensors, it will help to see their values through time (decimalDay), that is, with time on the x-axis. For time series that are relatively short (few observations), it is common to plot the data points along with a line connecting those symbols (type="b" or type="o"). For longer time series like these data, symbols are usually omitted because they clutter the plot; do this by specifying type="l").
Plots of time series often look better if they are wider than tall. In your code, make the plot 4 inches tall and 8 inches wide.
Because we want to compare two sensors, it will be helpful to show them on the same plot. The easiest way to do this is to use plot() to show one sensor and add the second sensor with points(). However, we need to be sure that the y-axis will encompass the values for both sensors. So, in one line of code, combine the two sets of sensor readings into a single vector with c(), calculate the range of the data, and assign it to an object called tempRange.
Now, create the plot showing both sensors in two lines of code, using those limits for the y-axis. Use a “darkorange1” line for sensorA and a “darkorange3” line for sensorB. For the two axes, give simple meaningful labels that include the units in parentheses. Rotate the y-axis labels.
Use the text() function to place separate labels that read “sensor A” and “sensor B”. The color of the label should match the color of the data series; otherwise, use the default values for text(). We will use direct labeling, so find a place to place each label near its data series. Although we generally want to avoid hardcoded values and prefer named constants, it is okay to use hardcoded values here for the x and y coordinates. One way to find the coordinates of a point is to use locator(1) and click on a point on the graph. If you do this, be sure to use a reasonable number of significant figures. Your goal is to place these labels close to their corresponding time series but not so close that it crowds it; this may take some trial and error until you find a placement that looks good to your eye. Once you’ve done this, your plot is complete.
Question 1: The temperatures reported by each sensor change over time, but at any given time, do the sensors report the same temperature or different temperatures? State what you observe in the plot. Be specific; do not just say that they are different. For example, if one sensor gives higher readings than the other, specify the sensor that gives those higher readings and the approximate amount (eyeball this).
You should statistically evaluate any differences in the mean temperature of the two sensors, and there are a couple of ways of accomplishing this. Because we are interested in the mean values of the sensors, the t-test is the most common approach. It is a valid approach because the sample sizes are large enough to invoke the central limit theorem, which assures that sample means will follow a normal distribution, the assumption of the t-test. (People commonly think —incorrectly— that the t-test requires the data to be normally distributed. It does not.)
There are two approaches you could use for applying a t-test. The first takes all values from one sensor and compares them to all values from the other sensor, effectively scrambling any order to the readings. This is called an unpaired t-test, which is the default behavior of the t.test() command. The second approach compares the values at each point in time, preserving the pairs of observations rather than scrambling them. This is called a paired t-test, and to use it, you set the paired argument to t.test(). Choose the correct approach to test for differences in the mean value of the two sensors.
Examine the output of your test. It is simpler to think about a positive difference in temperatures rather than a negative difference. In other words, it is clearer to say that one sensor gives values X° warmer than the other sensor than it would be to say the readings from one sensor are X° colder than the other. You should set up your t.test() to follow that convention: if the sign of the “mean difference” in the last line of the output is negative, switch the order of sensorA and sensorB so that it is positive. Notice that this has no effect on the other results of the test. If you have to make this change, do not show both calls to t.test(); include only the one that produces a positive mean of the differences. Finally, note that the “mean difference” that R reports is exactly the same as the difference of the means; we can use either phrase.
Question 2: You can think about the t-test in two ways. The first is to use the p-value to say whether the difference in means is statistically significant. Using that criterion, state whether the means of the two sensors are significantly different, supporting your answer with the appropriate numerical results of the test. Be sure to refer to Stating statistical results before composing your answer.
Question 3: Considering what the null hypothesis is, what does this approach actually let us say? Say this as simply as possible without using jargon. Your answer should be in the form of “The data are (or are not) consistent with <null hypothesis>”, replacing <null hypothesis> with a statement of what the null hypothesis actually is.
Question 4: The second way to think about the t-test is to focus on the difference in the means and your uncertainty in that difference. State the results of the test this way using the appropriate numerical results of the test. Again, be sure to refer to Stating statistical results before composing your answer, and use a reasonable number of significant figures, that is, based on the number of significant figures in the data.
Question 5: Based on the confidence interval, state whether the results are statistically significant. Your answer should succinctly explain why you came to this conclusion.
Question 6: Which of these two approaches tells you more about the behavior of these two sensors, and so is the better one to include in a report? Explain in 1–3 sentences, using less than 450 characters total.
Let’s try another approach to evaluating these two sensors. This time, you will focus explicitly on the differences in the sensor readings.
First, let’s visualize the difference between the two sensors. Create a vector called tempDifference, which is the difference in the readings of two sensors. Because it is comparing every pair of readings, tempDifference should have the same length as the sensorA and sensorB vectors. To be consistent with how we have set up this problem so far, calculate this such that the values are generally positive numbers. Do this simply; do not use a negative sign to reverse the sign.
Make a new plot with the same dimensions as your first plot; this will be plot 2. In it, plot the temperature difference against time. As before, use a line to connect the values, but do not show the individual points. Use the default color for the line. Give meaningful names and units for your axes, and be consistent with plot 1 where possible.
To help visualize the mean value, use abline() to add a horizontal line at the mean value of the temperature difference. The line should be “darkorange2” and three times as wide as normal.
Question 7: Does plot 2 appear to agree with the results of your t-test? Explain using the appropriate numerical values from plot 2 and from your t.test() command.
The t.test() command can also be used to give a confidence interval on the mean of a single vector. Doing this will also produce a p-value that corresponding to a null hypothesis of the mean being zero. In one line of code, calculate the confidence interval on the mean difference in temperature and the corresponding p-value.
Question 8: How do the mean value, the output line with the p-value, and the output line with the 95% confidence interval compare with the t-test you ran in Part 3? Be specific, but state your answer simply.
Question 9: What do you conclude about these two approaches to problems like this?
The statistical approach you have used addresses statistical significance, but there is also another meaning of significance to consider, which is scientific importance. Importance is always relative to the application of the data. Always remember that when someone says results are “significant”, they usually mean in the statistical sense, and they may or may not realize that this has nothing to do with importance. You should always ask whether the results are important for the problem at hand.
Question 10: In two sentences (350 characters maximum), give a context in which the difference in mean temperature of these two sensors would be scientifically significant (important or substantial) and another where the difference would be scientifically insignificant (unimportant or inconsequential).
Format your commands file following the standard instructions. E-mail your commands file to stratum@uga.edu. The subject of your email should be 8370 problem set 5. Do not send me the data file, as I have it already. This problem set is due Monday, 6 October.