How R Has Changed How We Look at Principal Component Analyses

A typical table reporting PCA results that I randomly trawled from the internet

This post isn’t really very anole-specific, but because lots of studies of anoles use principal component analyses, I think it’s at least tangentially relevant.

PCA is a way of to reduce the variation in a data set to a few dimensions by constructing new variables that emphasize variables that are highly correlated with each other. I won’t go into the details of the method here, because Ambika Kamath explains all in a post she wrote on her blog a while back.

What I want to mention here is how we interpret these new statistical axes. Back in my day, computer programs spit out a matrix of numbers like the one above, which we called “loadings.” These values represent how strongly each variable was correlated with each of the new axes. So, for example, in the table above, values for PCA axis one correlate most strongly with “fats and oil” and “animal protein” content and most weakly with Vitamin B1.In other words, the regression line determining the PCA (a PCA axis is a linear regression of all the variables), was determined the most by fats and oils and animal protein and the least by Vitamin B1.

Now, everyone uses the computer program R to conduct PCAs, and R, too, spits out “loadings.” But those are not your father’s loadings (or my loadings). Rather, those values are the coefficients of the new equation that defines the PCA axis. Thus, in the example above, hot dogs that scored high on PCA 1 would have the largest (positive or negative, depending on size) fat and oil and animal protein content. Back in the day, we could also access those values, but we called them “coefficients.”

Does this really matter? Only to the extent that what much of the literature used to call “coefficients” is now called “loadings” and what used to be called “loadings” apparently isn’t routinely spit out by R. And, more importantly, most R users are completely unaware of the switcheroo.

Ambika did a very preliminary analysis to see whether the values of coefficients (new “loadings”) and correlations (old “loadings”) are very different. Her tentative conclusion is that they aren’t, so maybe this doesn’t matter much, but it might be worth looking into more.

About Jonathan Losos

Professor and Curator of Herpetology at the Museum of Comparative Zoology at Harvard University. I've spent my entire professional career studying anoles and have discovered that the more I learn about anoles, the more I realize I don't know.

4 thoughts on “How R Has Changed How We Look at Principal Component Analyses

  1. One of my issues with PCA is more inherent in why it is being used rather than this issue. I would agree that the results being spat out by R and what was done in the past are both useful statistics. Personally I use SAS as my preferred statistical tool but that is preference.

    My issue stems from whether or not one should be using PCA in a given scenario, in particular for me is questions of species divergence, when DFA (Discriminate Function Analysis) may be the better tool of choice. This is inherent in the assumptions of the statistical test, which must be met for an analysis to hold.

  2. A typical output from a Principal Components Analysis from SAS back in the 1980’s contained the elements of the eigenvectors. These were untransformed coefficients (also called directional cosines) that provided the contribution of each variable to the rotation of each axis in the n-dimensional phenotype space. Loadings were introduced by Psychometricians to aid in the interpretation of PCA’s and as an approximation of the solution to Factor Analysis. PCA and Factor Analysis were often used to reduce the dimensionality of a large number of variables.

    The “loadings” produced in the R functions p are the elements of each eigenvector from the decomposition of the covariance (or correlation) matrix. That is, PCA in R is more like the original version of PCA.

    Also major axis regression is more similar to PCA than linear regression. The deviations from the regression line are assumed to be orthogonal to each axis, whereas in linear regression the deviations are measured as the distance of the response variable from the fitted line.

  3. The uncertainty about what the column values (“loadings” in a vague sense) are in a PC analysis has been around even since I was in grad school. The rule was that one should always look at the documentation for the software (similar warning for determining whether a correlation or covariance matrix was used for the calculation of the PCs.

    As I remember, the “coefficients” for a column (PC) are part of the equation specifying the axis. They can be used to convert the original data to the scores on PC1, PC2, etc. that are plotted on the biplots.

    The “correlations” for a particular column are the product of the coefficient and the square root of the eigenvalue (variance) for that column. The latter term is equivalent to the standard deviation, so the correlations are “de-standardized” and their magnitudes have meaning.

    So, the coefficients and correlations are related, but the scaling factor (eigenvalue) varies among PCs, so it’s not 1:1 across all PCs.

    Folks who do morphometrics usually find the “correlation” type to be more useful.

    At least that’s how I remember it; someone correct me if I’m wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *

Optionally add an image (JPEG only)