Multivariate data should be set up in a data frame with variables in columns and cases in rows. There should be no missing values: every variable should have a value for every case.
As with any data set, the first step in analyzing multivariate data is to plot the data. Strip charts and histograms are the standard tools for plotting distributions. Bivariate plots are useful for assessing correlations among variables, but the number of such comparisons grows rapidly with the number of variables. The pairs() command is an efficient way to generate all possible pairwise plots. The data points may be small and compressed, but the plots are enough to assess the forms of the relationships among the variables as well as the presence of outliers. The plots can be enlarged somewhat by setting gap=0, which removes the space between plots. Shrinking the size of the labels by setting the cex.labels to less than 1 can also help with crowding.
pairs(myDataFrame, gap=0, cex.labels=0.9)
Where multivariate data consist of one dependent and many independent variables, developing a linear model should be a priority. Linear models may include continuous numerical variables and categorical variables (factors) in any combination. When numerical and categorical variables are combined, the analysis is called an ANCOVA, an analysis of covariance. The interpretation of the slopes for continuous variables is intuitive, and the slope for factors is the difference between their group means. Many statistical tests can be expressed as linear models.
Despite the name, linear models do not necessarily produce a straight line. For example, the equation y = ax + bx2 + cex is a linear model even though it is not a line because new variables could be substituted for x2 and ex such that y = ax + bw + cz, where w=x2 and z=ex. Such a model is said to be “linear in its parameters and linear in its random variables”.
The highly flexible lm() function is the standard tool in R for linear modeling. The first argument to the lm() function is the model, and the simplest model is y ~ x. In this notation, the dependent variable is placed on the left side of the tilde (~), with the independent variable on the right side. The model y ~ x is read “y as a function of x”.
More complicated models can made by including additional independent variables on the right side. The syntax of the model specification can initially be confusing because it uses the symbols of arithmetic, but the symbols do not have their arithmetic meanings. For example, an additional variable is added to the model using the + operator: y ~ x + z. In this context, + does not indicate addition but means include an explanatory variable. This model would be read “y as a function of x and z”. With the + operator, you can include many explanatory variables in your model: y ~ x + z + w + q, which would be read as “y as a function of x and z and w and q”.
In some situations, there may be interactions between explanatory variables. For example, the effect of z may be different at high values of x than at low values. Such interactions can be included in two ways. The simplest is to specify only the interactions of interest using the : operator. For example, an interaction between x and z could be specified with y ~ x + z + x:z, read as “y as a function of x and z and the interaction of x and z”. Alternatively, you could use the * operator: y ~ x * z, which has the same meaning in this example because there is only one possible interaction. The * operator does not mean multiplication in this context but means to include an explanatory variable and all of its interactions.
When there are more than two explanatory variables, the combination of all possible interactions may be quite large. For example, y ~ x * z * w is the same as y ~ x + z + w + x:z + x:w + z:w + x:z:w. The ^ operator can be used to find all interactions up to a given level of complexity. For example, y ~ (x + z + w)^2 will include all interactions involving up to two explanatory variables at a time. That model could also be specified by writing y ~ x + z + w + x:z + x:w + z:w. To include interactions with up to three explanatory variables, you would write y ~ (x + z + w)^3, which would expand to y ~ x + z + w + x:z + x:w + z:w + x:z:w. That model could be written more simply as y ~ x * z * w, which would be y as a function x, z, and w, and all of their interactions.
Models can be nested. If you want to model y as a function of x and z within x, use the / operator: y ~ x / z. In this context, / indicates nesting, not division. This is used to include a higher-order term but not a lower-order term. In this example, y ~ x / z is the same as y ~ x * z - z, where the - operator is used to remove a term.
he - operator is used to identify a term that should be removed, an important step in model simplification (explained below). In this context, the symbol - does not mean subtraction, but to remove an explanatory variable.
Regressions may be forced through the origin by adding -1 to the model. The model y ~ 1 finds the grand mean of y, so appending -1 to a model removes this term and forces the regression through the origin.
Polynomial regression can be accomplished through the poly() function. For example, y ~ poly(x, 3) will fit a cubic polynomial in x, which would be y = b0 + b1x + b2x2 + b3x3.
Because the arithmetic operators have special meaning when used to specify a linear model, the identity function I() must be used when these operators need to be used in their arithmetic sense. For example, if you wanted to find the parameters for the equation y = a + bx + cx2, you would write the linear model as y ~ x + I(x^2). The I() function tells R to treat what is in the parentheses (x^2) as the arithmetic statement x2. Similarly, if you wanted to fit the equation y = a + b/x, you would state the linear model as y ~ I(1/x).
So many options can seem overwhelming at first, but the point you should take from this is that R is flexible in how a linear model can be specified. If you can think of a model, there is likely a way to specify that model in R. These model choices can be used elsewhere as well, such as in ANOVAs.
As the number of explanatory variables increases, the number of possible linear models rises dramatically, particularly when you consider interactions and higher-order terms. The goal of linear modeling is to find a model that is relatively simple yet explains as much variation as possible.
Finding an appropriate model for a given data set takes time, and several things must be kept in mind. First, there is no single model for explaining a given data set. Second, different strategies for choosing a model may lead to different results. Third, the order in which variables are added or removed can affect the final result. In all these cases, your understanding of the science must guide your decisions. For example, if two approaches lead to different models, choose the one that makes more sense regarding the underlying processes.
The simplest approach to finding the best model is to try all possible regressions and compare them with some metric of how well the models perform. There are several options for such a metric. For example, you could choose the model with the largest coefficient of multiple determination (R2). You could pick the model with the smallest mean square error term. You could also use the model with the largest adjusted R2, which adds a penalty for the number of parameters in the model to help prevent you from overfitting a model.
More commonly, the best model is found by either starting with a simple model and adding more predictor variables, called forward selection, or starting with the most complicated model and removing predictor variables, called backward elimination.
In forward selection, predictor variables are added until there is no substantial increase in the coefficient of determination (R2). The principal problem with this approach is determining the order in which to add variables. One intuitive method is to add the variable that produces the greatest increase in R2, which is usually determined by calculating partial correlation coefficients. A second approach is by calculating the F-statistic corresponding to the addition of a new variable (called F-in or F-to-enter), adding the variable that produces the largest F-statistic, provided the F-statistic meets some minimum requirement of significance. Variables are no longer added when none of the remaining variables produces a statistically significant F-statistic, in other words, when adding another explanatory variable does not substantially increase explained variance.
In backward elimination, all predictor variables are initially included in the model. Predictor variables are dropped sequentially, provided they do not substantially lower the coefficient of determination (R2). This is typically done by removing the variable with the smallest partial correlation coefficient or by removing the variable with the smallest F-statistic (called F-out or F-to-remove), provided that F is not statistically significant. The idea is that you are removing the explanatory variable that makes the smallest contribution to explained variance, hence an F-statistic (a ratio of variances) is used. A general approach is to remove the highest-order interaction terms first, usually those that are the least statistically significant (i.e., lowest p-values).
The challenge with both approaches is that subsequent choices necessarily build on earlier choices. For example, variables that have been added via forward selection are not reconsidered even though adding subsequent variables may mean that they do not explain much variance. Likewise, a variable is not subsequently reconsidered once it has been removed in backward elimination even though a subsequent removal may now make that variable a potentially important source of explained variance. Stepwise regression solves this problem. In forward selection, the list of included variables is re-evaluated after a predictor variable has been added to see if one of the included variables no longer adds substantially to R2. After each variable is removed in backward elimination, the list of removed variables is scanned to see if any of those variables would now add substantially to R2.
In both forward selection and backward elimination, the models produced are nested in that the simpler model always contains a subset of the parameters of the more complicated model. Nested models can be compared in R with the anova() function. The idea is that the difference between the two models has a quantity called the extra sum of squares, equal to the reduction in the unexplained sum of squares produced by the additional model terms. This can be converted to a variance, which can be scaled against the unexplained variance term for the full model. A ratio of variances can be tested with an F-test; hence this problem can be solved with an ANOVA. Testing for the significance of additional variables between two models is done like this:
model1 <- lm(y ~ x) model2 <- lm(y ~ x + z) anova(model1, model2)
A statistically significant result from this ANOVA indicates that including the additional parameter produces a statistically significant decrease in the unexplained variation and that the added parameter should therefore be included in the model. In forward selection, you would include the more complicated model’s parameters if they produced a statistically significant result in the ANOVA. In backward elimination, you would remove the parameters of the more complicated model if they did not generate a statistically significant result in the ANOVA, that is, if the p-value is not smaller than the significance level (alpha, usually but not necessarily 0.05).
Model simplification can be time-consuming, with difficult and subjective choices along the way. The step() function can be used to automate the process based on Akaike’s Information Criterion (AIC).
Non-nested models can also be tested with AIC, using the AIC() and stepAIC() functions. The latter is particularly useful because it can automate the entire process. Start by constructing the most complicated model, then call stepAIC() on the results of that model to see how it could be simplified by using AIC. Both functions require the MASS library.
library(MASS) mostComplicatedModel <- lm(y ~ x * z * w) stepAIC(mostComplicatedModel)
When fitting linear models, watch for strong correlations among the independent variables, called multicollinearity. Multicollinearity may indicate that two or more variables measure the same quantity. Including highly correlated independent variables can make model selection more difficult and complicate regression coefficient interpretations. When you have multiple highly correlated variables, it is often best to use only the one that is most strongly correlated with the dependent variable.
Once a model has been selected, it must be evaluated through the various diagnostic plots available in plot.lm(). Four aspects of the regression must be checked. First, verify that the residuals do not change systematically with the fitted values, which would suggest that the wrong model is being fitted to the data; the red trend line on this plot should be roughly horizontal. Second, verify that the errors are normally distributed using the qqnorm() plot of the residuals. Third, verify that the residuals are homoscedastic: the size of the residuals should not change systematically with the fitted values; the red trend line on this plot should be roughly horizontal. Fourth, verify that no points have an unduly large influence over the regression, as measured by Cook’s distance. Cook’s distance is measured as the square of the difference in the slope measured with all of the points relative to the slope measured without the point in question. Points with large values of Cook’s distance may indicate measurement errors or other problems.
Demonstration code used in class: multipleRegression.R
Dalgaard, P. 2002. Introductory statistics with R. Springer-Verlag, New York.
Sokal, R. R., and F. J. Rohlf. 1995. Biometry. W.H. Freeman and Company, New York.
Verzani, J. 2005. Using R for introductory statistics. Chapman & Hall / CRC Press, Boca Raton, Florida.