Lecture Notes

Multiple regression

Nonlinear regression

Generalized linear models

R Tips

Extracting results

Central limit theorem

Ordination utilities

Box-and-whiskers plots

Digitizing points

Problem Sets

Project

Report

Handouts

Syllabus

R tutorial

Data

Goodies

Contact

Steven Holland

Nonlinear Regression

What makes an equation nonlinear?

A linear equation is any equation of the form y = ax + bz, that is, y equals the sum of a set of one or more coefficients, each multiplied by a variable. Such an equation is said to be linear in its parameters (the coefficients a and b in this example) and its variables (here, x and z). Any equation that cannot be put into this form is non-linear.

Some equations may not appear to be linear at first glance, but with a simple substitution, can be placed in linear form. For example, the equation y = ax + bx2 is linear because one could define a new variable z=x2, substitute it, and convert the equation to y = ax + bz. Likewise, y = ax2 + cex is also linear because, with two substitutions, w=x2 and z=ex, the equation becomes y = aw + cz and is now in linear form.

Some equations cannot be recast into linear form by substitutions. In situations like this, we must perform nonlinear least-squares regression, and this is easily done in R with the nls() function (for nls, think non-linear least squares).

Chapter 20 of The R Book by Michael J. Crawley is an excellent treatment of non-linear regression, and it includes a helpful table (20.1) of twelve common non-linear functions. If you need to fit a non-linear function, start with this chapter.

Explained variance

The nls() function does not produce coefficients of determination (R2), but the following r2nls()function will calculate a coefficient of determination for any model produced by nls(). This function uses the definition of the coefficient of determination, that is, one minus the sum of squares of the residuals divided by the total sum of squares.

r2nls <- function(model) { residuals <- residuals(model) y <- fitted(model) + residuals ssResiduals <- sum(residuals^2) ssTotal <- sum((y - mean(y))^2) rsq <- 1 - ssResiduals/ssTotal return(rsq) }

Fitting a nonlinear regression in R

Michaelis-Menten kinetics

One common non-linear equation is the Michaelis-Menten function, which is used in chemistry to describe reaction rate as a function of concentration. The Michaelis-Menten function is characterized by a steep rise that gradually flattens into a plateau. Initially, small increases in concentration cause a substantial increase in reaction rate, but additional increases in concentration cause progressively smaller increases in reaction rate. Eventually reaction rate plateaus and does not respond to increases in concentration. The Michaelis-Menten function has this form:

Michaelis-Menten equation

where r is the growth rate (the dependent variable), C is the concentration (the independent variable), Vm is the maximum rate in the system, and K is the concentration at which the reaction rate is half of the maximum (Vm/2). Thus, given a series of specified concentrations and the rates measured for each, the regression will find the best-fit coefficients Vm and K. We want to estimate these coefficients and the uncertainty in those estimates (i.e., confidence intervals).

To fit this model, begin by entering the data (kinetics.txt) and plotting it. The data frame contains two variables, concentration and rate.

kinetics <- read.table("kinetics.txt", header=TRUE, sep=",") attach(kinetics) plot(concentration, rate, las=1, pch=16)

The shape of this relationship suggests that the Michaelis-Menten function might be a good model.

Michaelis-Menten data

To fit the function, use nls(), which takes two arguments. The first argument is the model formula, specifying that the rate is a function of Vm multiplied by concentration, divided by the sum of K and concentration. Notice that unlike lm(), nls() does not need the I() function to specify that arithmetic operators ( *, /, +, etc.) should keep their algebraic interpretations. The second argument provides the fitting process with a starting estimate, the initial guesses for Vm and K; these are supplied as a list.

mmModel <- nls(rate ~ Vm * concentration / (K + concentration), start=list(Vm=30, K=25))

After fitting the model, its output is viewed with summary(). The fitted values of Vm and K can be extracted with the coef() function and their confidence intervals with the confint() function.

summary(mmModel) coef(mmModel) confint(mmModel)

Adding the fitted function to our data plot takes a couple of steps. First, create a range of concentrations over which to evaluate the function, then evaluate the function over that range (that is, calculate the predicted rates at those concentrations), and finally add the points to the plot with the points() function. One might be tempted to evaluate the function by extracting the coefficients for the model and writing out the model, but it is simpler and less error-prone to use predict().

x <- seq(min(concentration), max(concentration), length=100) y <- predict(mmModel, list(concentration=x)) points(x, y, type="l", col="blue") detach()

Fitted Michaelis-Menten curve

Logistic growth

Another common model is used in biology to model the logistic growth of a population. Logistic growth is characterized by slow initial growth, followed by a period of rapid growth, and concluding by a period of slowing growth. This history produces the characteristic sigmoidal (s-shaped) relationship of logistic growth. The model can be written as

Logistic equation

where P is population size (the dependent variable) and t is time (the independent variable). The three coefficients of the logistic model, Po (initial population size), r (growth rate), and K (carrying capacity, or final population size), are fitted by the regression.

Start as always by plotting the data (popGrowth.csv). The data frame contains two columns, population and time.

popGrowth <- read.table("popGrowth.csv", header=TRUE, sep=",") attach(popGrowth) plot(obsTime, population, las=1, pch=16, xlab="time")

Use the nls() function to fit the model, and supply initial guesses for the logistic parameters. These initial guesses might be made from another study that has fitted a logistic model to similar data.

logisticModel <- nls(population~K/(1+exp(Po-r*obsTime)), start=list(Po=5, r=0.211, K=5))

Non-linear fits can be sensitive to the initial guesses. If these initial guesses are poorly chosen, you may get a singular-gradient error

For this data, the initial population size seems close to zero, so zero is a reasonable initial guess for Po. The plot appears to plateau near 30, so that makes a good initial guess for K. To get an initial guess for r, select a point in the middle of the data set, substitute values of P and t into the equation, along with the initial guesses for Po and K, and solve for r. The model can then be specified with these appropriate guesses for the parameters, likely with better results.

logisticModel <-nls(population ~ K / (1 + exp(Po - r * obsTime)), start=list(Po=0, r=0.211, K=30))

As usual, view the results of the regression with summary(), the fitted parameters with coef(), and the confidence intervals with confint().

summary(logisticModel) coef(logisticModel) confint(logisticModel)

As in the Michaelis-Menton example, adding the fitted curve requires evaluating the function over the range of the data, and adding it with the points() function. The easiest way to evaluate the fitted function is with predict().

x <- seq(min(time), max(time), length=100) y <- predict(logisticModel, list(obsTime=x)) points(x, y, type="l", col="blue")

Fitted logistic curve

The added step of finding good starting parameters can be bypassed by using what is called a self-starting nls logistic model, using the SSlogis() function. This useful function creates an initial estimate of the parameters, which can then be fed to the nls() function to find the best-fit logistic equation. The SSlogis() function fits a slightly different parameterization of the logistic function:

Logistic equation in SSlogis form

where Asym is the carrying capacity, xmid is the x value at the inflection point of the logistic curve, and scal is a scaling parameter for the x-axis. These parameters are easily converted to the first form of the logistic model described above.

xmid

logisticModelSS <- nls(population~SSlogis(obsTime, Asym, xmid, scal)) summary(logisticModelSS) coef(logisticModelSS) detach()

Similar self-starting models are also available for the Michaelis-Menten model and several other functions, and they are all part of the built-in stats package.

References

Crawley, M.J., 2013. The R Book. Wiley: Chichester, 1051 p.

Generalized Linear Models

Linear models, as produced by the lm() function, make several critical assumptions: normally distributed errors, constant variance, and no trends in residuals relative to fitted values. When these assumptions will be violated, you must use a different modeling procedure, such as Generalized Linear Models. Generalized linear models are implemented in R with the glm() function, which has the same model syntax as lm() and aov(). Generalized linear models use likelihood methods, so they fundamentally differ in their approach from least-squares regression.

Generalized linear models have three important components: the error structure, the linear predictor, and the link function.

The error structure describes how the errors are distributed. Count data, represented as integers with a minimum value of zero, have Poisson-distributed errors. Proportions, which range from zero to one, have binomial-distributed errors. Data with a constant coefficient of variation, where standard deviation is a multiple of the mean, have gamma-distributed errors. Survival data, also called time-to-death data, have errors that follow an exponential distribution. Any time that you deal with such counts, proportions, or survival data, you should use a generalized linear model rather than a simple least-squares regression.

The linear predictor is the sum of the parameters (slopes), each multiplied by the values of the explanatory variables. It is formed by the right side of the linear model formula.

The link function is a function that relates the mean value of y to the linear predictor and may be the most confusing aspect of a generalized linear model. Every error structure has an associated link function. The normal distribution has a link function called the identity function, which means that the y-value equals the linear predictor. In other words, nothing must be done to relate the y-value to the linear predictor. In glm(), the family argument specifies the error distribution and link function. For example, to set a Gaussian error model and an identity link function, just set the family to be Gaussian:

glm(y ~ x, family=gaussian)

This will produce the same results as if you had called lm(), although glm() calculates the regression coefficient with likelihood methods, not by minimizing the sum of squares as lm() does.

Here are the most common settings for family in glm() and their associated link functions. Notice that only Gamma is capitalized.

Likelihood, deviance, and AIC

Because generalized linear models are based on likelihood, their output differs somewhat from common non-likelihood methods like t-tests, F-tests, ANOVA, correlation, and least-squares regression. In particular, generalized linear models report null deviance, residual deviance, and AIC (the Akaike Information Criterion). Deviance equals two times the log-likelihood ratio of a model compared with a saturated model, that is, a model that fits every data point perfectly. Smaller values indicate a good model fit. Null deviance compares a single-parameter model (the mean of the response variable) to the saturated model. Usually such a fit will be quite poor, causing null deviance to be large. Residual deviance compares the fit of the generalized linear model to the saturated model. If the generalized linear model fits well, residual deviance will be markedly smaller than the null deviance.

Whether this drop in deviance is statistically significant can be evaluated similarly to how an F-test is used to assess the statistical significance of a least-squares regression. The p-value is calculated using the difference in the null and residual deviances and the difference in null and residual degrees of freedom using a chi-squared distribution.

The output of a generalized linear model also reports the AIC. This value is of little use by itself, but it could be used to compare different models fitted to the same data, especially when they differ in their number of parameters. The model with the lowest AIC would be the preferred model.

Explained variance

The glm() function does not produce coefficients of determination (R2), but it can be calculated easily from the model (see stats.stackexchange). Similar to how the coefficient of determination is generally defined as 1 minus the sum of squares of residuals divided by the total sum of squares, the deviance-based approach uses 1 minus the model deviance divided by the null deviance. The r2glm() function will calculate the coefficient of determination; its only argument is the model fitted by glm().

r2glm <- function(model) { rsq <- 1 - model$deviance / model$null.deviance return(rsq) }

Count data

Count data have several important properties. First, count data have a minimum value of zero and are therefore said to be bounded below. Count data are integers, which affects their error distribution; in other words, their errors are discrete, not continuous. Last, count data follow a Poisson distribution in which variance is not constant and increases with the mean.

A generalized linear model for count data requires the Poisson family and would therefore be written as

glm(y~x, family=poisson)

This model is equivalent to the equation

logit model

which can be rewritten as

logit model

where y is counts and x is some independent variable thought to control the counts. a and b are the regression coefficients to be fitted to the data.

To fit a generalized linear model to count data, put the counts in one vector and the explanatory vector in another. For this example, we will use the flora.csv data set, which contains two vectors of interest, elevation and maple. Our goal is to model maple tree abundance as a function of elevation. As always, we plot the data before fitting a model. We will use glm() to fit the model using a Poisson family.

flora <- read.table("flora.csv", header=TRUE, sep=",") attach(flora) plot(elevation, maple, las=1, pch=16, col="green") model <- glm(maple~elevation, family=poisson)

View the results of the model with summary(), coef() to extract the fitted coefficients a and b, and confint() to generate the confidence intervals on those coefficients.

summary(model) coef(model) confint(model)

To add the fitted regression curve to the plot, use the predict() function to generate the points along the curve at specified x values. The predict() function can be used with all types of regression to display the regression curve on a plot.

elevationPoints <- seq(min(elevation), max(elevation), by=0.01) fittedValues <- predict(model, list(elevation=elevationPoints), type="response") lines(elevationPoints, fittedValues, lwd=3, col="blue")

Because glm() models are based on likelihood methods, the output of summary()differs from lm() models.

Call: glm(formula = maple ~ elevation, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -2.2948 -1.1941 -0.5495 1.0081 2.1554 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.58051 0.50482 15.02 <2e-16 *** elevation -0.28749 0.02794 -10.29 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 143.656 on 19 degrees of freedom Residual deviance: 35.216 on 18 degrees of freedom AIC: 105.5 Number of Fisher Scoring iterations: 5

Deviance Residuals should be checked for symmetry: the median value should be close to zero, and the 1Q and 3Q values should be roughly equidistant from the median. Max and Min should also be roughly equidistant from the median, but as endpoints, they are likely to deviate more than 1Q and 3Q.

The table also provides z-scores and p-values based on the z-scores to test whether the coefficients are significantly different from zero

The markedly lower residual deviance compared with the null deviance indicates a good model fit. The corresponding p-value is calculated using the difference in the null and residual deviances and the difference in the null and residual degrees of freedom using a chi-square distribution. For this model, the p-value could be calculated as follows:

pchisq(model$null.deviance - model$deviance, model$df.null - model$df.residual, lower.tail=FALSE)

Because we are only interested in a drop in deviance, we want to focus on the upper tail of the distribution and therefore set lower.tail=FALSE.

Proportion and binary data

Proportion data have three important properties. Values are constrained to lie between 0 and 1, inclusive, and are therefore said to be strictly bounded. The errors are non-normally distributed and follow a binomial distribution. Finally, variance is not constant, reaching a minimum when the proportions are near zero and one and reaching a maximum when the proportion is 0.5.

Proportion and binary data require the binomial family, which uses a logit link function. The logit function equals log(p/(1-p)), also called the log-odds, where p is the proportion. The model for proportion or binary data is stated as

glm(p~x, family=binomial)

This model specifies the relationship

logit model

which can be rewritten in terms of p as:

logit model in terms of p

where p is the proportion and x is the independent variable thought to control the proportion. a and b are the model coefficients to be fitted.

Example 1: proportions

The first example will be for proportional data, where we will model the proportional abundance of maple trees in forest communities as a function of elevation. The data come from the flora.csv data set, which contains three columns: maple, otherTrees, and elevation.

flora <- read.table("flora.csv", header=TRUE, sep=",") attach(flora)

To plot the data, we will calculate the proportion of the community represented by this species as a function of elevation.

proportion <- maple/(otherTrees+maple) plot(elevation, proportion, las=1, pch=16)

To fit the model and preserve information on sample size, we need to combine the abundances of maples and the other trees into a single matrix (abundances), using the cbind() function (think “column bind”). The model is fitted with glm(), specifying the community data as a function of elevation, using the binomial family because this is proportional data.

abundances <- cbind(maple, otherTrees) model <- glm(abundances~elevation, family=binomial)

As usual, view the results of the model with summary(), use coef() to extract the fitted coefficients a and b, and use confint() to generate the confidence intervals.

summary(model) coef(model) confint(model)

To added the fitted relationship to the plot, create a vector of elevation values along which the fitted relationship will be calculated. Use predict() to generate those proportions, then use lines() to add this curve to the plot with the data.

elevationValues <- seq(min(elevation), max(elevation), by=0.01) fittedValues <- predict(model, list(elevation=elevationValues), type="response") lines(elevationValues, fittedValues, lwd=3, col="forestgreen")

Alternatively, you could put the proportions in a vector (done above), and the sample sizes in a second vector. As before, glm() is used to fit the model, specifying a binomial family and setting the weight to equal the sample size. This will produce the same model as the first approach.

sampleSize <- maple + otherTrees model <- glm(proportion~elevation, family=binomial, weights=sampleSize) detach()

Example 2: binary data

This same model can also be used for binary data, such as the occurrence of a species along a habitat transect, which are recorded as 1’s (present) and 0’s (absent). For this we will use the floraPA.csv data set, which contains the presence (1) or absence (0) of pine trees (pine) with temperature. The first step is to plot the data.

floraPA <- read.table("floraPA.csv", header=TRUE, sep=",") attach(floraPA) plot(temperature, pine, las=1, pch=16)

The model is fitted with glm(), setting the family argument to binomial.

model <- glm(pine ~ temperature, family=binomial)

Model results, coefficients, and confidence intervals are extracted in the usual way.

summary(model) coef(model) confint(model)

The fitted relationship is added to the plot following the same approach.

temperatureValues <- seq(min(temperature), max(temperature), by=0.01) fittedValues <- predict(model, list(temperature=temperatureValues), type="response") lines(temperatureValues, fittedValues, lwd=3, col="forestgreen")

Survival data

Survival data, also called time-to-death and time-to-failure data, have nearly always non-constant variance. Where these data have errors that follow a gamma distribution, variance is proportional to the square of the mean, which causes variance to increase more rapidly than the mean. Gamma errors require a reciprocal link function (1/y), and their model is written as:

glm(y~x, family=Gamma)

This model specifies the relationship

logit model

or, alternatively

logit model

where y is the survival time (or time to death or failure), x is the predictor variable, and a and b are the coefficients to be fitted.

Survival data can be fitted with the time-to-death data in one vector and the explanatory vectors in another. The function calls and plotting are similar to those above, but set the family argument to Gamma (remember to capitalize it as opposed to all the other families, which are not capitalized).

References

Dalgaard, P. 2002. Introductory statistics with R. Springer-Verlag, New York.

Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W.H. Freeman and Company, New York.

Verzani, J. 2005. Using R for introductory statistics. Chapman & Hall / CRC Press, Boca Raton, Florida.