Lecture Notes

Home

Contact

Steven Holland

PCA: Matrix Operations

R’s prcomp() function, and others like it, are front ends that perform the matrix operations in a principal components analysis. These matrix operations are not complex, and it is straightforward to compare their results to those of prcomp() to show that they produce equivalent results.

First, we will create a two-dimensional data set for our analysis. This could be done for data sets with more variables, but using just two variables makes it easier to inspect the results.

rawX <- rnorm(50, sd=3) errors <- rnorm(50, sd=3) slope <- 0.8 rawY <- slope * rawX + errors raw <- cbind(rawX, rawY) remove(rawX, rawY, slope, errors)

Next, we will run prcomp() and extract the loadings, singular values, and scores. The center argument moves the centroid of the data to the origin, and retx argument ensures that the scores are returned. Recall that the loadings are the eigenvector and singular values are the standard deviation along each principal component.

pca <- prcomp(raw, center=TRUE, retx=TRUE) loadings <- pca$rotation singular <- pca$sdev scores <- pca$x

For comparison, we will calculate all of this using the underlying matrix operations. To do this, we need to center the data, that is put the centroid of the cloud of data at the origin, and this is done by subtracting the column mean from every value in each column in the raw data. From this, we calculate a covariance matrix, and from that, we calculate the eigenvalues and eigenvectors using singular value decomposition.

centered <- apply(raw, MARGIN=2, FUN=function(x){x-mean(x)}) covMatrix <- cov(centered) eigen <- eigen(covMatrix)

From this, we can calculate the singular values, the loadings, and the scores. Because the eigenvalues are the variance along each principal component, the singular values are the square root of the eigenvalues. Thus, the singular values are the standard deviation along each principal component.

singularLong <- sqrt(eigen$values)

The loadings are the eigenvectors.

loadingsLong <- eigen$vectors rownames(loadingsLong) <- colnames(raw)

The scores are the centered values of the raw data, post-multiplied by the loadings.

scoresLong <- centered %*% loadingsLong

Comparison of results

The singular values for both approaches are identical.

> singular [1] 4.714315 2.268171 > singularLong [1] 4.714315 2.268171

The absolute values of the loadings are also identical. Note that the orientation of the PCs is arbitrary so the signs of the loadings for each PC may be reversed from those produced by prcomp() and those produced by the matrix operations, as has happened here.

> loadings PC1 PC2 rawX -0.5369201 0.8436331 rawY -0.8436331 -0.5369201 > loadingsLong [,1] [,2] rawX 0.5369201 -0.8436331 rawY 0.8436331 0.5369201

Last, the absolute values of the scores are also identical. If the signs for the loadings are reversed, the signs for the scores will also be reversed, as they are here. Remember that in any PCA, you can reverse the signs for the loadings and the scores for any of the PCs, such as if you are comparing PCAs of different data sets and want the patterns to go in the same direction. Just be sure that if you change the signs, you change the signs for all loadings on a given PC and for all the scores on that same PC.

> head(scores) PC1 PC2 [1,] -3.4573188 1.9034691 [2,] -3.7136078 0.2395753 [3,] 6.7840044 -0.8680275 [4,] -7.0073627 -2.9162746 [5,] 0.2528793 0.8882884 [6,] 1.5870900 -1.6455650 > head(scoresLong) [,1] [,2] [1,] 3.4573188 -1.9034691 [2,] 3.7136078 -0.2395753 [3,] -6.7840044 0.8680275 [4,] 7.0073627 2.9162746 [5,] -0.2528793 -0.8882884 [6,] -1.5870900 1.6455650