Lecture Notes

Home

Contact

Steven Holland

Statistics fundamentals

Kinds of data

There are four types of data, of increasing usefulness:

Nominal data consists of mutually exclusive and unranked categories, like granite and basalt. Color names like red, blue, and green are another example. Nominal data are also called categorical data or attribute data. On a plot, nominal data can be ordered any way you wish.

Ordinal data are like nominal data, but they are ordered or ranked. The size of the steps in the ranking is unequal. Moh’s hardness scale is an example of ordinal data, as is metamorphic grade, and Mary Droser’s scale of ichnofabric or burrowing intensity. On a plot, ordinal data must be displayed in order.

Interval data are like ordinal data, but the steps or increments are equally sized. Interval data lack a true natural zero, that is, their zero is set as an arbitrary point rather than a fundamental point. For example, the zero point for Celsius is defined as the freezing point of water, but this is arbitrary in that the freezing point of alcohol or mercury could have been chosen as the zero point. Likewise, the zero point in isotopic ratios is based on an arbitrarily selected isotopic standard. Interval data measured on different scales can be converted through multiplication and addition, such as converting between Fahrenheit and Celsius.

Ratio data are like interval data, but they have a natural or fundamental zero point that is not arbitrarily chosen. For example 0° Kelvin and 0° Rankine are based on a fundamental zero point, the cessation of molecular motion. This is a fundamental zero point because the temperature cannot be colder than absolute zero. Zero-length is likewise a fundamental quantity: the absence of any length. Conversion between ratio data on different scales (e.g., length in cm vs. length in inches) requires only multiplication.

Interval and ratio data are sometimes called measurement variables. They may be open or closed. The measurement system does not constrain open data; they are free to vary, such as length, concentration, or isotopic ratios. Closed data are not free to vary; for example, percentage measurements are constrained to sum to 100%.

Measurement variables may also be described as continuous, where all possible intermediate values are possible, or discrete, where only certain values are possible, typically integers. Temperature is an example of a continuous measurement variable, and numbers of individuals is a discrete interval variable.

Populations versus samples

A population is a well-defined set of elements, which may be finite or infinite in size. It is the total collection of objects to be studied, from which a subset is usually selected for study. You will rarely work with the full population.

A sample is a subset of elements drawn from a population. Note that this is not the same as a geological or tissue sample, etc., so be aware that people use the term “sample” in different contexts. Samples may be random, that is, collected without systematic inclusion or exclusion, or they may be biased, that is, collected from a population in a way that systematically includes or excludes part of the population. All statistical analyses assume that samples are random. You must ensure this through your sampling design; if you do not, your conclusions will likely be invalid.

The general problem in statistics is that we are nearly always more interested in the population, but populations are so immense that we can examine only a sample because we are limited by time or money. As a result, our strategy is to collect an unbiased sample and use it to make inferences about the population.

Statistics vs. Parameters

A parameter is a measurement that describes some aspect of a population, such as the mean or variance. A statistic is the corresponding measurement that describes a sample. An easy way to keep these straight is that parameter and population begin with a p, and statistic and sample begin with an s. Greek letters are generally used for parameters (e.g., σ for population standard deviation), and Roman letters are typically used for statistics (e.g., s for sample standard deviation).

Rephrasing the general problem in statistics, we measure statistics on samples, but we’re interested in the parameters of the population. For example, we may have measured the means of two samples (statistics), but we really want to know how the means of their populations compare (the parameters). Similarly, we may measure the mean of a sample (the statistic), and use it to estimate the mean of the population (the parameter) and our uncertainty in that estimate. In both cases, the quality of our comparison or our uncertainty estimate will be controlled by the amount of replication. Replication decreases your uncertainty, and replication is therefore always a good thing.

Distributions

A frequency distribution describes the frequency with which all observations were made over the range of possible values. Often, a frequency distribution will be rescaled to probability by dividing the number of observations in one class by the total number of observations. Frequency distributions are visualized with histograms. Cumulative frequency distributions show the cumulative frequency or probability beginning at one edge of a distribution and progressing to the other. In terms of percentage, such cumulative distributions must begin at 0 and end at 100.

Some common theoretical distributions

The normal or Gaussian distribution is a symmetrical continuous distribution with the familiar bell shape. This distribution arises when a variable is affected by many independent factors whose contributions are additive (not multiplicative). For a normal distribution, 68.3% of observations will fall within one standard deviation of the mean, 95.4% within two standard deviations of the mean, and 99.7% within three standard deviations of the mean.

The lognormal distribution is an asymmetrical single-humped continuous distribution with a long right tail. Taking the logarithm of the measured variable produces a normal distribution, hence the name. A lognormal distribution arises like a normal distribution, except that the effects of the underlying factors are multiplicative. Lognormal distributions are widespread (e.g., grain size, concentration, body size) and should often be suspected when values below zero are impossible.

The binomial distribution is a discrete distribution that describes the number of successes in a series of trials, where the probability of success is fixed, for example, the number of heads when you flip a coin a given number of times. When the number of trials is large, the binomial distribution approximates the shape of a normal distribution. Binomial distributions are symmetrical when the probability is 0.5; they are asymmetrical when the probability is greater (left-tailed) or less than 0.5 (right-tailed).

The Poisson distribution is a discrete distribution that describes the number of events occurring in a fixed period of time, where the events occur at a fixed average rate, and the events are independent of one another, that is, they are not dependent on the time since the last event. Poisson distributions also describe the number of objects in a fixed area or volume when the occurrence of each object is independent of any other objects. The number of chocolate chips per cookie in a batch of cookies follows a Poisson distribution. So is the number of crabs in a square meter of salt marsh or meteorites per square mile of the planet. The Poisson distribution is a right-tailed distribution when the rate of events or occurrences is small, but it becomes increasingly symmetrical as the rate of events or occurrences increases.

The exponential distribution is related to the Poisson distribution, but it is a continuous distribution of the time between events when those events occur at a fixed average rate, or equivalently, the distance between objects. For example, the waiting times between shooting stars in an evening and clicks on a Geiger counter are exponentially distributed, whereas the number of those events over a fixed time span follows a Poisson distribution. Exponential distributions are asymmetrical, with a long right tail.

The even or uniform distribution is a flat, continuous distribution with an equal probability of all outcomes. Few things in nature follow such a distribution, but it is the most common distribution made by random-number generators.

Descriptive statistics

Central tendency or location is commonly described with the mean, which is the sum of measurements divided by sample size, symbolized with a lowercase “n”. A sample mean is indicated by an x with a bar over it (x̄), and a population mean (a parameter) is indicated by the Greek letter mu (μ). The mean is an unbiased estimator, meaning that the sample mean will not tend to be either larger or smaller than the population mean. In other words, the sample mean has an equal probability of being larger than or smaller than the population mean

The median can also measure the central tendency; the median is the value for which half of the sample is smaller, and half of the sample is larger. In the case of an even number of observations, the median is the average of the two middle values. The mode, the highest peak in a histogram, can also describe central tendency. Last, central tendency can be measured by the geometric mean, which is the product of all the observations taken to the nth root, although this is more easily calculated as the mean of the logarithm of the observations, raised to the base of e.

Because there are different measures of central tendency, one might ask which is the best. There is no simple answer, but one criterion is efficiency. A statistic is said to be more efficient than another if it is more probable for it to lie closer to the population parameter. The mean is a more efficient statistic of central tendency than the median.

Be aware that certain measures of central tendency can be used with only certain types of data. Mode can be calculated on any data type: nominal, ordinal, interval, or ratio. The median cannot be calculated on nominal data. The mean cannot be calculated on nominal or ordinal data, only interval and ratio data.

A distribution's variation, spread, or dispersion can also be measured in several ways. The range is the difference between the largest and smallest values. The range tends to increase with sample size, and this sensitivity is not desirable. The interquartile deviation is the difference between the largest and smallest values after the largest 25% and smallest 25% of distribution have been removed. It is somewhat less sensitive to sample size than range.

Variance is the average squared deviation of all possible observations from the population mean, and it plays a crucial role in statistics. Variance is symbolized as s2 for samples and σ2 for populations. Standard deviation is the square root of the variance, which places it on the same measurement scale as the mean. The coefficient of variation is the standard deviation divided by the mean. This dimensionless number is used to compare standard deviations for two samples with different means, such as when measurements are made on two scales (such as centimeters and inches).

Range can be calculated on ordinal, interval, and ratio data but not nominal data. Variance and standard deviation can be calculated only on interval or ratio data.

The shape of a distribution can be further described in terms of the number of peaks (or modes), the symmetry or asymmetry of the distribution, and how the data are distributed relative to the center or the tails of the distribution.

The number of modes is easily seen by the number of peaks on a histogram (a frequency distribution). Unimodal distributions have one peak, bimodal distributions have two, and multimodal distributions have more than two peaks.

Skewness describes the asymmetry of a distribution. Right-skewed distributions have a positive skew and a longer right tail. Left-skewed distributions have a negative skew and a longer left tail. Skewed distributions often cause problems for statistical analysis and require special treatment, such as data transformations or the use of non-parametric statistics.

Kurtosis is less commonly used, but it describes the amount of data in the peak and tails of the distribution relative to the shoulders. Distributions with more data in the center and tails than a normal distribution are called leptokurtic and have positive excess kurtosis (excess kurtosis is the measured kurtosis minus 3, the kurtosis of a normal distribution). Distributions with more data on the shoulders than in the tails and peak of a normal distribution are called platykurtic and have negative excess kurtosis.

Skewness and kurtosis can be calculated only on interval or ratio data.

Lecture slides

Download a pdf of the images used in today’s lecture.