![]()
For a thorough background on PCA in general and its use in biological sciences, the reader is referred to the appropriate chapter in Sokal & Rohlf (1985) for a basic description, or Jongman (1987) for a more advanced treatment. Morrison (1990) gives a highly theoretical and technical description of PCA and related multivariate statistical methods. Examples of the application of pca in chemistry and environmental science can be found e.g. in cash & breen (1992), del valls et al. (1997), poissant et al. (1997), stone & brooks (1990), or wold et al. (1987).
Briefly, PCA is a projection method in which a dataset consisting of several, more or less correlated descriptor variables is 'plotted' in the multivariate descriptor space. This space is then 'rotated' thus that the first dimension (the first principal component) will be parallel to the direction of the largest variance in the data. The next dimension is chosen so as to be parallel to the direction of the next largest variance in the data, subject to the constraint that it must be perpendicular to the first dimension, etc. an example of this rotation, in a two-dimensional space, using hypothetical data, can be seen in figure 1.
Usually, original descriptors in datasets are correlated. Therefore, only a few 'dimensions' are needed to capture most of the variance in a dataset. The remaining dimensions can be discarded as noise, resulting in a substantial reduction in dataset complexity. An example of this phenomenon, using hypothetical data, can be seen in figure 2.
In this particular example, a dataset consisting of 14 original descriptors was subjected to a principal components analysis. The first principal component extracted from this dataset alone captured slightly more than 50% of the entire variance in the dataset. All subsequent components describe a small portion of the remaining variance.
Since PCA can be described geometrically as a rotation of the dimensions in the multivariate space spanned by the data, the new dimensions (principal components) can be described as linear combinations of the original variables, and may be interpreted as 'latent variables'. The loadings for each component indicate how strongly each original variable contributes to that component, and may be used to interpret the underlying 'meaning' of the 'latent variables'.
Usually, principal components analysis is done on pretreated data. Normally two assumptions are made when a dataset is reduced to principal components, viz.:
To emphasize these assumptions, data are 'autoscaled', i.e. centered on zero (by subtracting the mean of each variable from the individual observations of that variable), and scaled to unit variance (by dividing each observation from a variable by the variance of that variable). See figure 3 for an illustration of this process.
If data are not autoscaled, the first principal component will be determined completely, or at least to a very large extent by the variable with the largest (average or absolute) magnitude. In a data set with, e.g., several variables containing measured values for microcontaminants in surface water (in µg L-1 or lower) at several locations, and a variable for the salinity of the water (concentration of Cl- in mg L-1), a PCA decomposition of the raw values would yield a first principal component that basically only reflects the salinity at each location. If we are interested in patterns of variability between observations, such as different patterns in the presence and absence of certain contaminants in different locations, we therefore need to autoscale our variables before extracting principal components.
However, in certain circumstances, we may be interested not in patterns of variability but in patterns in the absolute magnitude of variables, e.g. to determine which locations are the more heavily contaminated ones in an absolute sense, and which compound, or compounds are responsible for these high levels of contamination. In such a case, we can use PCA on the raw values to find the appropriate patterns. For that same reason, it is usually more informative not to log-transform the data when we are interested in magnitudes of contamination.
Principal components analysis is a method to deal with collinearities in datasets, to eliminate noise from datasets, and to investigate the underlying factors that determine the variability in the data. Collinearity is dealt with because the principal components are orthogonal linear combinations of the original data that are ordered according to their 'size', large to small. The 'size' of the principal components (i.e the magnitude of their eigenvalue) is directly related to their importance in capturing the variability in a dataset. Highly collinear datasets are characterized by a few very large principal components folllowed by many small to very small components. These small components can usually be regarded as the 'noise' in the dataset. Conversely, in a dataset in which (practically) no correlation between the individual variables exists, the principal components will be (nearly) equal in size. The factors underlying the variability in a dataset can be studied by looking at the 'loadings' of the largest, or most important, principal components. The loadings are the coefficients (or weights) that determine the linear combinations of original variables that constitute a component. Usually, the significant components of a dataset are loaded by different selections of, usually similar, variables.
Statistical pattern recognition is a field of quantitative statistics encompassing techniques for the development of decision rules for the classification of objects into one (or more) distinct classes. Objects are represented as vectors in the feature space. The feature space is the multidimensional space of measured variables that is used to describe the status of the individual objects. Since many observable variables for real world systems are (highly) correlated, the dimensionality of a feature space is usually reduced before running a pattern recognition analysis. Principal components analysis is usually the method of choice for feature space reduction.
Pattern recognition can be divided into supervized and non-supervized pattern recognition. In non-supervized pattern recognition, groups of observations are defined based on the presence or absence of clusterings of objects in the feature space. In supervized pattern recognition, objects are classified into a priori groups, based on knowledge about the objects (e.g. measurements for a location on a small inland river and measurements for a location near an off-shore oil rig are placed in different groups), and discriminant functions are sought that maximize the separation between these a priori groups based on the groups' vectors in feature space. Figure 4 gives a hypothetical example of objects in a two-dimensional reduced feature space.
Based on the location of the individual objects, an unsupervized pattern recognition method may find three clusters, designated by the blue ellipses, whereas a supervized pattern recognition method may find the discriminant functions depicted as red curves, based on an a priori classification of objects into three 'catchment' clusters that do not coincide with the blue, unsupervized clusters.
There are two main approaches in statistical pattern recognition, the classical approach, and the SIMCA approach. In the classical approach, that is used in most statistical literature, a single model, based on a single (reduced) feature space is created, and one or more linear or nonlinear discriminant functions are optimized so as to maximize the separation between the groups of objects in this feature space. In the SIMCA approach, local models are developed for each group separately, based on reduced feature spaces for the individual groups. For each group a discriminant function is then created in the appropriate feature space, designating the separation between members and non-members for that particular group. A SIMCA approach usually offers somewhat better classification of members vs. non-members than 'single feature space' methods, at a price of less generality and interpretability.
For background information on pattern recognition techniques, the reader is referred to Wold et al. (1983), Dunn & Wold (1990), Morrison (1990), Wang & Milne, (1993), or Jain et al. (2000).
PLS regression can be loosely regarded as a regression in which not the original variables are the descriptors (the x-vector, or -matrix) in the regression analysis, but one or more latent variables. In that respect, PLS regression resembles principal components regression or canonical correlation analysis. The difference between these PCA-based methods and PLS regression is that in PLS regression the latent variables are extracted under the additional constraint of maximal covariance between the corresponding x- and y-axes. The first x- and y-latent variables are those dimensions that maximize the compound constraint of maximum variance in x (or y) and maximum covariance between x and y. The next latent variables maximize the residual variance in x (or y) for each additional step, together with the covariance between x and y.
The optimum number of latent variables to be included in the PLS regression model (comparable to the number of significant principal components in a principal components decomposition) is generally determined through cross-validation, using a leave-x-out procedure. In a leave-x-out procedure, regression models are calculated for subsets of the data. These regression models are then used to predict the y values for the observations that were excluded from the model. This process is iterated so that each obervation is left out and predicted exactly one time. An r2-like statistic, usually called Q2, is then calculated based on the sum of the squared differences between the observed y values and the predictions calculated with the models from which the observations were excluded. This procedure gives a robust indication of the predictive properties of a model. if this process is repeated for models with increasing numbers of latent variables, there usually is a model with an optimum number of latent variables, for which the Q2 value is at a maximum. This number of latent variables, under a few additional constraints, is chosen as the best number of latent variables for a particular model. The 'significance' of Q2 is not formally codified, but as a rule of thumb, for a good model, Q2 should be less than 0.2 lower than r2 , i.e. if r2 = 0.88, then Q2 should be > 0.68 for a good model. Additionally, Q2 should be higher than ca. 0.4 to be Feaningful. A Q2 of < 0 indicates a model that predicts worse than chance.
The results of a PLS regression are usually expressed as graphs of predicted vs. observed dependent values, the correlation coefficient r2 for these predicted vs. observed dependent values, and the cross-validated or leave-one-out r2 or Q2 The r2 indicates the fit of the correlation, while the Q2 , as said, indicates the predictive power of the model.
The relation between the independent, or predictor variables and the dependent variables, or effects, is usually presented as scores and loading plots for the latent variables. However, in order to interpret the results in terms of the original variables, the PLS solution can be transformed into an ordinary least squares form, which is expressed as a set of pseudo-regression coefficients in the original variables. These coefficients are 'pseudo'-regression coefficients since they do not denote the true OLS solution but the PLS solution.
For a more thorough theoretical discourse on PLS analysis, the reader is referred to Höskuldsson (1988), Stone & Brooks (1990), Wold (1975), or Wold et al. (1984), and for some examples of applications of PLS regression to Eide & Johansson (1994), Eriksson et al. (1995), Sjöström & Eriksson (1995), or Verhaar et al. (1994).
Ordinary statistical routines usually treat missing values by deleting any objects or variables with missing values. Unfortunately, with real world data, this usually results in the deletion of a major part of the data and a good deal of the information contained in it. Multivariate methods, including PCA and PLS, can cope with limited amounts of missing data in parameter matrices. In general PCA and PLS results do not severely degrade with amounts of missing data up to 20%, provided that these data gaps are distributed randomly over the data matrix (see e.g. Eriksson, 1995). Qualitatively speaking, missing values in a predictor or multivariate response matrix are replaced with zeros. This has the added benefit that with matrices that are centered or centered and scaled to unit variance, which is the default approach in PLS modelling, the missing values are set to the variable mean as a first estimate. Then the appropriate PCA or PLS model is iteratively calculated, while the missing values are replaced with values that are most consistent with the model, while at the same time incurring zero leverage on the model, until the estimates of the missing data converge.
![]()
©2004 Henk Verhaar
back to main site