Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Aquatic Sciences 57/3, 1995 1015-1621/030217-25 $1.50 + 0.20/0 © 1995 Birkh~iuser Verlag, Basel Multivariate analysis of aquatic toxicity data with PLS L e n n a r t E r i k s s o n 1, J o o p L . M . H e r m e n s 2, Erik J o h a n s s o n 1, H e n k J.M. Verhaar 2 and Svante Wold 3 1 Umetri AB, EO. Box 7960, 90719 Umegt, Sweden 2 Research Institute of Toxicology, Utrecht University, E O. Box 80176, 3508 TD Utrecht, The Netherlands 3 Research Group for Chemometrics, Department of Organic Chemistry, Ume~ University, 90187 Ume~, Sweden Key words: PLS, multivariate analysis, experimental design, aquatic toxicity, Q S A R ABSTRACT A common task in data analysis is to model the relationships between two sets of variables, the descriptor matrix X and the response matrix Y. A typical example in aquatic science concerns the relationships between the chemical composition of a number of samples (X) and their toxicity to a number of different aquatic species (Y). This modelling is done in order to understand the variation of Y in terms of the variation of X, but also to lay the ground for predicting Y of unknown observations based on their known X-data. Correlations of this type are usually expressed as regression models, and are rather common in aquatic science. Often, however, the multivariate X and Y matrices invalidate the use of multiple linear regression (MLR) and call for methods which are better suited for collinear data. In this context, multivariate projection methods represent a highly useful alternative, in particular, partial least squares projections to latent structures (PLS). This paper introduces PLS, highlights its strengths and presents applications of PLS to modelling aquatic toxicity data. A general discussion of regression, comparing MLR and PLS, is provided. 1 Introduction 1.1 The evolution of data matrices In the early days of this century it was difficult to m a k e extensive m e a s u r e m e n t s on a series of investigated samples. Thus, data tables usually had m a n y m o r e observations (rows) than variables (columns), see for instance (Fisher, 1936). This type of data a r r a n g e m e n t with m o r e observations than variables is representative for most data tables that arose in scientific applications at that time (Wold et al., 1984). We refer to such matrices as "long and lean" (Fig. 1). Today, however, reality for experimentalists has changed. It is no longer difficult and t i m e - c o n s u m i n g to m e a s u r e variables. D u e to the introduction of m o d e r n electronics, a vast array of technical instruments ( s p e c t r o p h o t o m e t e r s , c h r o m a t o g r a p h s , 218 Eriksson et al. Underlying Assumptions Classical methods of statistics - Multiple linear regression - Canonical correlation - Linear discriminant analysis - Analysis of variance - Maximum likelihood methods LONG * X-variables a m independent AND * X-variables are exact !ii!!iiiiiiiiiii!iiii!ii!iiiiii!l * Residuals are randomly distributed LEAN Chemometrics * X-variables are Projection methods * X-variables may have errors not independent • Residuals may be structured PCA, PLS, PCR, PLS-DA SHORT AND FAT Figure L Two shapes of data matrices, long and lean, and short and fat, and some assumptions of common data analytical methods, Some methods, such as ridge regression, have an intermediate position between the classical and chemometric techniques Problem Summarize a data set (overview) Uni- and bivariate c'~o c~o ,X ~, S2 Histogram Median ....D o c~o _£xo o Xb S~ ANOVA y" ~Y = F(X,B) + E Compare two sets of variables, X and Y (quantification and prediction) O O S2 -Xa o8 o X~, ,~, PCA + plots R. s~ Compare two or more groups (classification) Multivariate cO Ix LR MLR PCA PLS-DA NN KNN LDA PLS PCR NN KNN MLR RR CC Figure 2. The three problem types of data analysis and some pertinent methods Multivariate analysis with PLS 219 etc.) have been devised, which are capable of outputting hundreds or thousands of variables within a short period of time, reflecting the characteristics of a sample. The number of observations in a data table, on the other hand, is comparatively difficult to increase because nowadays new regulations apply regarding costs, time and ethics (individuals, animal testing, etc.), constraining the number of samples. The practical consequence of this is that data matrices are no longer typically "long and lean" but rather "short and fat" (Fig. 1). This fact raises new demands on data analytical techniques (Wold, 1995). 1.2 Three types of data analytical problems Once experimental data have been acquired they must be analysed to separate information form noise. In principle, data analytical problems can be divided into three major types, regardless of application. These are: (i) summarizing a data set, (ii) comparing groups, aiming at classification of unknown samples, and (iii) modelling of relationship between variables or sets of variables for quantification and prediction purposes (Fig. 2). In the univariate, bivariate and few variate (less than, say, five variables) cases, (i) and (ii) can be accomplished by calculating variable averages, standard deviations and covariances and evaluating these. In the multivariate case, however, this approach becomes tedious and inefficient, and other alternatives must be sought for. Moreover, with case (iii), one can plot and compare one pair of variables at a time, or try to find a mathematical expression linking a predictor variable to a response variable using multiple linear regression. Anew, the analyst runs into difficulties in the multivariate case, because not only are there many pairwise variable comparisons to make, but the risk for coincidental correlations increases quadratically with increasing number of variables (Topliss and Edwards, 1979). In science in general, and certainly in aquatic science, many applications are of type (iii). For instance, numerous examples are found in the literature regarding the determination of toxicity of chemicals to aquatic species, in which quantitative relationships are explored between chemical properties of compounds and toxicological responses (Blum and Speece, 1990). In the present contribution we are concerned with data analytical methods suitable for finding and probing such quantitative relationships, and in particular our aim is to introduce partial least squares (PLS) regression analysis and illuminate its utility in aquatic toxicity research. But before we enter this discussion, we shall briefly review some general points regarding regression analysis - case (iii) - and point out some of the problems that might occur when applying multiple linear regression (MLR) to the "short and fat" data structures that are predominant today. 1.3 Consequences of "short and fat" data structures on regression The classical approach to regression problems is MLR. For MLR to work properly, however, experimental data must fulfill certain statistical conditions, which reflect the basic assumptions underlying the technique (Draper and Smith, 1981; Wold, 220 Eriksson et al. 1995) (Fig. 1). Notably, the predictor variables, normally called X, are assumed mathematically independent, implying that a change in one X-variable is not strongly coupled to a change in another X-variable. Mathematical independence means that the rank of X is K (i. e. equals the number of X-variables). This assumption may be reasonable for "long and lean" data matrices, but violated as soon as matrices are multivariate and include collinear variables (Wold, 1995; Topliss and Edwards, 1979). Such multicollinearity occurs whenever some predictor variables are linear functions of other predictor variables, a feature which is typical for instance in spectroscopic data. This appears automatically with short and fat data (few observations, many variables) regardless of origin and regardless of moderate pairwise correlations. If MLR is applied to data sets exhibiting collinearities, the calculated regression coefficients get unstable and their interpretability breaks down (Draper and Smith, 1981; Topliss and Edwards, 1979; Lindgren, 1994). For example, certain coefficients may be much larger than expected, or they may even have the wrong sign (Lindgren, 1994; Mulllet, 1976). In fact, the problem of sign inversion with respect to an anticipated correlation structure is not uncommon, and will here be exemplified with a small data set from the literature (see below). Furthermore, stepwise multiple linear regression (SMLR) with variable deletion is sensitive to collinear data structures. SMLR models give rise to misleading interpretation and poor predictions (Frank and Friedman, 1993; Topliss and Edwards, 1979). 1.4 The need for multivariate projection methods One way to circumvent the dilemma of multicollinearity is to take benefit from it, by employing multivariate projection methods, such as partial least squares projections to latent structures, PLS. This method is particularly apt at handling the situation when the number of variables exceeds the number of observations. This is because projections to latent variables in multivariate space tend to become more distinct and stable the more variables are involved (Wold, 1995; Lindgren, 1994). PLS is a recently developed generalization of regression and gives identical results to MLR in situations when X has full rank. In most other cases, PLS gives a solution that is reminiscent to that of MLR. However, in addition PLS provides a set of score and loading plots that inform about the correlation structure between predictor and response variables, and regarding similarities among the observations. Model interpretation is also faciliated by these plots. PLS will be described in detail below. We note that there exist alternatives to PLS for the multivariate analysis of aquatic science data. Some of these methods are principal components regression (PCR), canonical correspondence analysis (CCA), correspondence analysis scaling (CAS, for discrete data), redundancy analysis (RA) and ridge regression (RR) (Jackson, 1991; Jongman et al., 1987). In fact, CCA of Ter Braak (Jongman et al., 1987) is similar to PLS in the same way as correspondence analysis is similar to principal component analysis of data with an appropriate scaling. We do not give a detailed account of these alternatives to PLS but the interested reader is referred to the appropriate references. 221 Multivariate analysis with PLS 2 Examples I n o r d e r to i n t r o d u c e PLS to a q u a t i c science a n d highlight s o m e of its useful features, we shall c o n s i d e r t h r e e data sets f r o m the literature. Two of the data sets are directly c o n n e c t e d with a q u a t i c toxicology, w h e r e a s o n e is n o t r e l a t e d to a q u a t i c science, b u t is i n c l u d e d to b e simple a n d yet illustrative of typical p r o b l e m s associated with M L R w h e n a p p l i e d to data sets c o n t a i n i n g c o l l i n e a r variables. 2.1 Energy of protein unfolding (Data Set I) T h e first data set c o n c e r n s a series of 19 p r o t e i n s ( t r y p t o p h a n e s y n t h a s e a u n i t of b a c t e r i o p h a g e T4 lysosome, m o d i f i e d in p o s i t i o n 49, Table 1). T h e a l t e r e d a m i n o Table 1. Chemical descriptor data and biological response for data set I Xl x2 x3 x4 Yl Protein no Amino acid PIF DGR SAC MR DDGTS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ala Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val 0.31 -0.6 -0.77 1.54 -0.22 -0.64 0 0.13 1.8 1.7 -0.99 1.23 1.79 0.49 -0.04 0.26 2.25 0.96 1.22 -0.55 0.51 1.2 -1.4 0.29 0.76 0 -0.25 -2.1 -2 0.78 -1.6 -2.6 -1.5 0.09 -0.58 -2.7 -1.7 -1.6 254.2 303.6 287.9 282.9 335 311.6 224.9 337.2 322.6 324 336.6 336.3 366.1 288.5 266.7 283.9 401.8 377.8 295.1 2.126 2.994 2.994 2.933 3.458 3.243 1.662 3.856 3.35 3.5t8 2.933 3.86 4.638 2.876 2.279 2.743 5.755 4.791 3.054 8.5 8.2 8.5 11 6.3 8.8 7.1 10.1 16.8 15 7.9 13.3 11.2 8.2 7.4 8.8 9.9 8.8 12 Correlation PIF DGR SAC MR DDGTS PIF DGR SAC MR DDGTS -0.96832 0.416383 0.555481 0.711445 1 -0.46264 -0.58201 -0.64764 1 0.955283 0.267735 1 0.290469 1 x 1 (PIF) = lipophilicity constant. x2 (DGR) = polarity measure. x3 (SAC) = accessible surface area [A]. x4 (MR) = molecular refractivity. Yl (DDGTS) = the unfolding free energy in water of the 19 modified proteins [kcal/mol]. All variables are taken from references Wold 1995 and E1 Tayar et al. 1992. 222 Eriksson et al. Table 2. The eight chemical descriptors and the eight aquatic toxicity responses for data set II X1 x2 X3 X4 X5 X6 No Compound Bp Mp D log P cr HOMO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nitrobenzene 1-chloro-2-nitrobenzene 1-chloro-3-nitrobenzene 1-chloro-4-nitrobenzene 1,2-dichloro-3-nitrobenzene 1,3-dichloro-4-nitrobenzene 1,4-dichloro-2-nitrobenzene 1,3-dichtoro-5-nitrobenzene 2-nitrotoluene 3-nitrotoluene 4-nitrotoluene 4-chloro-2-nitrotoluene 2-chloro-6-nitrotoluene 2,3-dimethylnitrobenzene 3,4-dimethylnitrobenzene 210.8 246 235 242 257 258 267 5.7 34.5 43 83.6 62 30 56 65.4 - 9.5 16 54.5 38 35 15 30 1.2037 1.348 1.534 1.298 1.449 1.89 2.26 2.49 2.35 3.01 2.9 2.9 3.13 2.3 2.4 2.34 3.05 3.09 2.83 2.91 0 0.27 0.37 0.27 0.64 0.54 0.64 0.74 -0.15 -0.07 -0.15 0.22 0.22 -0.22 -0.22 -10.5615 -10.3348 -10.3668 -10.474 - 10.2826 -10.4768 -10.2177 -10.4143 -10.1716 -10.1972 -10.3039 -10.0528 -10.1267 -9.94105 -10.0749 221.7 232.6 238.3 240 238 240 254 1.669 1.692 1.1629 1.1571 1.392 1.1402 1.112 x1 (Bp) = boiling point [C]; x 2 (Mp) = melting point [C]; x 3 (D) = density []; x 4 (log P) = log octanol/water partition coefficient; x 5 (cr, sigm) = sigma minus of Hansch & Leo; x6 (HOMO) = energy of highest molecular orbital [eV]; x 7 (LUMO) = energy of lowest unoccupied molecular orbital [eV]; x~ (7, Eta) = energy difference between (HOMO-LUMO)/2 [eV]. Y1 (DM148h) = log conc causing immobilization of 50% of D Magna after 48h [umol/1]; Y2 (DMI21d) = log conc causing immobilization of 50% of D Magna after 21 days [umol/1]. Y3 (DMRm) = log lowest conc causing significantly lowered population growth of D Magna after 21 days [umol/1]; Y4 (DMle) = log lowest conc causing significantly lowered mean length of D magna after 21 days [umol/1]; Y5(CPEC50) = log conc causing 50% decrease in population density of Cpyrenoidosa after 96 h [umol/1]; Y6(PHEC50) = log conc causing 50% decrease in bioluminescense of Ph phosphoreum after 15 rain [umol/1]. Y7(PoeLC50) = log conc causing 50 % lethality of P Reticulata after 14 days (umol/1]; Ys (BCF) = log bioconcentration factor for P Reticulata. For more details on these data reference is made to the original literature [Deneer et al. 1987, Deneer et al. 1989]. acids a r e d e s c r i b e d b y f o u r p r e d i c t o r v a r i a b l e s , n a m e l y l i p o p h i l i c i t y ( P I E xl), p o l a rity ( D G R , x2), m o l e c u l a r s u r f a c e a r e a ( S A C , x3) a n d m o l e c u l a r r e f r a c t i v i t y ( M R , x4). T h e r e s p o n s e v a r i a b l e of i n t e r e s t is t h e e n e r g y of u n f o l d i n g of t h e s e m o d i f i e d p r o t e i n s . It s h o u l d b e n o t e d t h a t two pairs of v a r i a b l e s , x l / x 2 a n d x j x 4 , a r e highly c o r r e l a t e d , with r2> 0.9. F o r m o r e details, r e f e r e n c e is m a d e t o t h e l i t e r a t u r e ( W o l d , 1995; E1 T a y a r et al., 1992). 2.2 Aquatic toxicity of mono-nitrobenzene derivatives (Data Set II) In the second example, quantitative structure-activity relationship (QSAR) modelling is a t t e m p t e d for a set o f 15 m o n o - n i t r o b e n z e n e d e r i v a t i v e s (Table 2). T h e g o a l in this s t u d y is t o b e a b l e to m o d e l a n d p r e d i c t t h e a q u a t i c toxic p r o f i l e s o f t h e 15 c h e m i c a l s b a s e d o n i n f o r m a t i o n c o n c e r n i n g t h e i r c h e m i c a l p r o p e r t i e s . T h e 15 c o m p o u n d s w e r e m u l t i v a r i a t e l y c h a r a c t e r i z e d using a n e n s e m b l e o f eight e x p e r i m e n t a l Multivariate analysis with PLS 223 Table 2 (continued) X7 X8 Yl Y2 LUMO ~ DM148h DM121d DMRm DMle CPEC50 PHEC50 PoeLC50 BCF -1.06761 -1.0817 -1.2849 -1.34278 -1.21595 -1.51518 -1.29702 1.48489 -1.01153 -1.01392 -1.0442 - 1.2255 -1.20624 -0.96153 -0.99881 4.74693 4.626555 4.540965 4.565585 4.533345 4.4808 4.460335 4.464725 4.580055 4.591625 4.62987 4.413635 4.46022 4.48976 4.538065 2.43 2.18 2.1 1.63 1.34 1.34 1.76 1.59 1.9 1.74 2.14 1.73 1.39 1.44 2.02 2.29 1.83 1.77 1.46 1.26 1.36 1.3 1.15 1.73 1.78 1.71 1.6 1.3 1.4 1.59 Y3 2.16 1.8 1.05 1.05 0.97 0.72 0.97 0.46 1.86 1.37 1.61 1.02 1.02 1.33 1.33 Y4 2.16 1.8 1.8 1.31 0.72 1.22 1.22 0.72 1.86 1.12 1.61 1.27 1.27 1.33 1.33 Y5 2.16 1.64 1.08 1.49 1.18 1.1 1.04 0.49 2.54 2.01 2.21 1.54 1.6 1.62 1.77 Y6 2.16 1.46 1.92 2.33 0.89 0.95 1.64 1.97 1.13 1.46 1.9 1.45 0.71 0.55 1.15 Y7 2.7 2.28 1.99 1.58 1.34 1.54 1.41 1.47 2.38 2.34 2.43 1.56 1.48 1.61 1.79 Y8 1.47 2.29 2.42 2.46 3.01 3.02 2.92 3.01 2.28 2.31 2.37 3.02 3.09 2.86 2.84 and quantum-chemically derived descriptor variables (Table 2). Variables like boiling point (Bp), melting point (Mp) and density (D) were taken from standard reference compilations (Weast, 1987), whereas log P and or (sigm) were obtained from the original work ( D e n e e r et al., 1987; D e n e e r et al., 1989), and the three theoretical descriptors ( H O M O , L U M O , hardness (r/)) from semi-empirical molecular orbital calculations (Stewart, 1990). In total, eight biological responses were available for this set of compounds. These are primarily related to toxicity towards the four aquatic species Poecilia reticulata, Daphnia magna, Chlorella pyrenoidosa and Phytobacterium phosphoreum (Table 2). With the exception of the response BCF, lower measured response values imply higher toxicity (Table 2). 2.3 Identification of sources of acute toxicity in produced water (Data Set III) On offshore oil production platforms, large volumes of so called produced water are discharged into the sea. In addition to dispersed oil, such produced water contains dissolved hydrocarbons, organic acids, phenols, salts of heavy metals and traces of chemicals added along the process line (Johnsen et al., 1994). In order to understand the possible environmental impact of produced water emissions, it is important to uncover the causes of observed toxicological effects. In this example artificial water samples were tested for their acute toxicity using the Microtox test. The produced water samples were m a d e artificially based on detailed knowledge of the composition of "real" produced water. In the produced water samples, the influence of five constituents (chemical factors) on aquatic toxicity was studied using statistical experimental design (Box et al., 1978). In total, 24 mixtures of produced water were blended according to a statistical experimental design in the five chemical factors (Table 3). The five factors were x 1 (Aro) representing the dissolved crude oil 224 Eriksson et al. Table 3. Factors and responses of the experimental design underlying data set III xl x2 x3 x4 x5 Yl Y2 Y3 Y4 Run Aro Phen HM Inhib Flocc V18 V14 V12 Vll 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 5 5 5 5 0.1 10 5 5 0.05 0.05 5 5 0.05 0.05 5 5 0.05 0.05 5 5 0.05 0.05 5 5 2.5 2.5 2.5 2.5 2.5 2.5 0.05 5 0.0064 0.0064 0.0064 0.0064 0.32 0.32 0.32 0.32 0.0064 0.0064 0.0064 0.0064 0.32 0.32 0.32 0.32 0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.16 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 15 15 15 15 15 15 15 15 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 15 0.5 0.5 15 0.5 15 15 0.5 0.5 15 15 0.5 15 0.5 0.5 15 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 0.5 44.75 30 51.5 1.5 34.25 31 40 0 44.25 26.25 50 0 39.5 25.5 51.25 38.5 38.5 36.5 37 18.25 45.5 30 44 1 55.5 38.5 58.75 2.25 46 40.25 50 1 58.25 36 58.5 0 54.5 36.5 62 48.5 47 46.75 46.75 27 57 41.25 53.5 2 64.5 44.5 65.75 6.5 56 48.75 61.5 10.25 63.5 47.25 66.25 4.5 62.5 50.25 68.5 57.25 56 55.75 55.75 35.25 66.5 49.5 58.5 9 77 48.75 70.75 11.25 61.5 50.5 67 34.25 74.25 59.75 74.5 14 70.75 62.25 72.75 65.5 62 61.25 62.25 43.75 69.75 60 65.25 x 1 (Aro) = aromatics representing dissolved crude oil fraction [ppm]; x 2 (Phen) = mixture of phenols and C1 - C 4 alkylated phenols [ppm]; x 3 (HM) = mixture of water soluble salts of the heavy metals Cr, Mn, Cu, Zn, Hg and Pb [ppm]; x 4 (Inhib) = corrosion inhibitor [ppm]; x 5 (Flocc) = flocculant used in oil production [ppm]. Microtox responses are given as absolute reduction in light emissions. Yl ( V l l ) = undiluted sample; Y2(V12) = sample diluted 1:2 with sea water; Y3(V14) = sample diluted 1:4 with sea water; Y4(V18) = sample diluted 1:8 with sea water. For more details, see Johnsen et al. 1994. a r o m a t i c fraction, x 2 ( P h e n ) c o r r e s p o n d i n g to t h e p h e n o l i c fraction, x 3 ( H M ) r e p r e s en t i n g a m i x t u r e of w a t e r s o l u b l e salts of v a r i o u s h e a v y metals, x 4 ( I n h i b ) d e p i c t i n g t h e m o s t toxic c o r r o s i o n i n h i b i t o r e m p l o y e d at t h e oil fields, an d x 5 (Flocc) corr e s p o n d i n g to t h e m o s t t o x ic f l o c c u l a n t a d d e d a l o n g t h e p r o d u c t i o n line. A m o n g t h e s e five factors, x 1- x 3 are u n a v o i d a b l e c o n s t i t u e n t s w h e n p r o d u c i n g oil, w h e r e a s x 4 an d x 5 r e p r e s e n t s c o n t a m i n a n t s f r o m artificial additives. F o r e a c h o n e o f t h e s e 24 w a t e r samples, f o u r toxicity r e s p o n s e s w e r e r e g i s t e r e d in t h e M i c r o t o x assay, an d t h e e n d p o i n t s w e r e e x p r e s s e d as a b s o l u t e r e d u c t i o n in light e m i s s i o n at f o u r d i f f e r e n t dilutions o f t h e w a t e r m i x t u r e s ( J o h n s e n et al., 1994). In summary, t h e o b j e c t i v e was to i n v e s t i g a t e w h e t h e r t h e artificial c o n s t i t u e n t s (x 4 an d xs) c a u s e d significant a q u a t i c toxicity. Multivariate analysis with PLS 225 3 The linear PLS model The development of a PLS model can be described as follows: For a certain set of observations - compounds in data sets I and II and mixed samples in data set III - appropriate response variables are monitored. These form the N × M response data matrix Y, where N and M are the number of observations and responses, respectively. Moreover, for the same set of observations, relevant predictor variables are gathered to constitute the N x K predictor matrix X, where N is the same as above and K the number of predictor variables. The Y-data are then modelled by the X-data using PLS. A geometric representation of PLS is given in Fig. 3. The observations can be seen as points in two spaces, that of X with K dimensions and that of Y with M dimensions. PLS finds lines, planes or hyperplanes in X and Y that map the shapes of the point-swarms as well as possible. PLS has two primary objectives, namely to well approximate X and Y and to model the relationship between X and Y. This is accomplished by making the bilinear projections X=TP'+E Y=UC'+G (1) (2) and connecting X and Y through the inner relation U=T+H (3) where E, G and H are residual matrices. Here T is N x A , P is K x A and C is M x A , where A is the number of PLS components. A more detailed account of the PLS algorithm is given in the appendix. PLS simultaneously projects the X and Y-variables onto the same subspace, T, in such a manner that there is a good relation between the position of one observation on the X-plane and its corresponding position on the Y-plane (Fig. 3). Moreover, this relation is asymmetric (X ~ Y ) , which follows from equation (3). In this respect, PLS differs from, e.g., canonical correlation where this relation is symmetric. In essence, each PLS model dimension consists of the X score vector t, the Y score vector u, the X loading vector p, the X weight vector w and the Y weight vector c (see appendix). The weight vectors w and c are used for interpreting which X-variables are influential for modelling the Y-variables. A n o t h e r way to see PLS is that it forms "new x-variables", t, as linear combinations of the old ones, and therafter uses these new t's as predictors of Y. Only as many new t's are formed as are needed, and this is assessed from their predictive power (see below). 3.1 Interpretation Once a PLS model has been derived, it is important to construe its meaning. For this, the scores t and u are considered. They contain information about the observations and their similarities/dissimilarities in X- and Y-space with respect to the given problem and model. The X-weights w and the Y-weights c provide informa- 226 Eriksson et al. "-y2 / ylY = 1 "y + U'C' + F ~ul U=T+H X = 1.,~+ T'P' + E / tl ) xl Figure 3. A geometrical representation of PLS tion about how the variables combine to form t and n, which in turn express the quantitative relation between X and ¥. Hence, these weights are essential for the understanding of which X-variables are important for modelling Y (numerically large w-values), which X-variables that provide common information (similar profiles of w-values), and for the interpretation of the scores t. Sometimes it may be quite taxing to overview the PLS weights, especially if the number of latent variables to consider is larger than about 3. In such circumstances, PLS provides a powerful alternative, the VIP (variable influence on projection) parameter, which informs about the relevance of each X-variable pooled over all dimensions and Y-variables (Wold, 1995). Thus, in principle, VIP in square is a weighted sum of squares of the PLS weights, w, taking into account also the amount of Y-variance explained by each latent variable. We note that for a one-dimensional PLS model, VIP-values are proportional to the values of w. Alternatively, the PLS solution may be transferred into a regression-like model: Y=XBpL s +F (4) Here BpLs corresponds to the regression coefficients. Thus, these coefficients are determined from the underlying PLS model and can be used for interpretation, in Multivariate analysis with PLS 227 the same way as coefficients originating from MLR. However, with collinear variables we must r e m e m b e r that these coefficients are not independent. The parts of the data that are not explained by the model, the residuals, are of diagnostic interest. Large Y-residuals indicate that the model is inadequate, and a normal probability plot of the residuals of a single Y-variable is useful for identifying outliers. In PLS we also get residuals for X, the part not used in the modelling of Y. Such X-residuals are useful for identifying outliers in the X-space, i.e., observations which do not conform with the model. 3.2 Incomplete X and Y matrices (missing data) PLS tolerates moderate amounts of missing data both in X and Y. With missing data in Y, it must be multivariate, i.e. Y must have at least two columns. The larger the matrices X and Y are, the higher the proportion of missing data that may be tolerated. Here, the three examples have 19, 15 and 24 observations, and for such ordinary sized matrices around 10 to 20% missing data elements can be handled, provided that they are not missing according to some systematic pattern. The PLS algorithm accounts for the missing values, in principle by iteratively substituting the missing values with predictions from the model. This corresponds to giving the missing data values that have zero residuals and thus have no influence on the model parameters. 3.3 One Y-variable at a time, or all in the same model? PLS has the ability to model and analyze several Y-variables together. This is favorable when the Y-variables are correlated, because the analyst only obtains one model to interpret and not one model for each single variable. If the Y's really measure different things, however, and are fairly independent, one gains little by analyzing them in the same model. On the contrary, with fairly independent Y-variables the PLS model tends to have many components and hence be difficult to interpret. The separate modelling of the Y's then gives a set of simpler models with fewer dimensions, which are easier to interpret. To judge whether the Y-variables are correlated or not, it is recommended to precede the PLS analysis with a principal component analysis (PCA) of the Y-matrix. This will inform about the practical rank of Y, A, i. e., the number of components of the PC model. If A is small compared to the number of Y-variables (M), and if we can understand the resulting components, we can conclude that the Y's are correlated, and a PLS model of all responses together is warranted. Often, however, one finds from the PCA that the Y's cluster in two or three groups according to the nature of activity they measure. Then this is an indication for one separate PLS model for each such group of Y-variables. 3.4 The n u m b e r o f P L S components, A It is essential to determine the correct complexity of a PLS model. With many Xvariables there is a substantial risk for "overfitting", i. e., getting a well fitting model 228 Eriksson et al. with little or no predictive power. Hence a strict test of the statistical significance of each consecutive PLS component is necessary. This test is used to determine where to stop, when components start to be non-significant. Cross-validation (CV) is a practical and reliable way to test this significance (Wold, 1995; Lindgren, 1994; Wold, 1978), and one that has become standard in PLS analysis. A good discussion of the subject was recently given in (Wakeling and Morris, 1993). Basically, CV is performed by dividing the data in a number of groups, say, seven, and then developing a number of parallel models from the reduced data with one of the groups deleted. It should be noted that having the number of CV groups equal to N, i.e., the so called leave-one-out approach, is debatable (Shao, 1993; Wold and Eriksson, 1995). In practice, between five and ten groups works well. After developing a model, the deleted data are used as a test set, and differences between actual and predicted Y-values are calculated for the test set. The sum of squares of these differences are computed and collected from all the parallel models to form PRES S (predictive residual sum of squares), which is a measure of the predictive ability of the model. Usually, PRESS is reexpressed as Q2 (the "crossvalidated R 2'') which is (1-PRESS/SS) were SS is the sum of squares of Y, corrected for the mean. This can be compared with R 2= (1-RSS/SS), where RSS is the residual sum of squares. In models with several Y's, one obtains also R 2 and Q2m for each Y-variable. The explained variance, R 2 or more strictly adjusted for degrees of free2 j , varies between 0 and 1, where 1 means a perfect model and 0 a model of dom, Rad no relevance at all. Normally, the predicted variance, Q2, varies between 0 and i as well, but negative values indicating nonsense models, may be obtained occasionally. As a rule of thumb, W is normally 5 - 2 0 % higher than Q2, and substantially larger differences is a warning for overfitting, or many irrelevant X-variables. 3.5 Softwares PLS is incorporated into many commercially available statistical packages including SIRIUS, SCAN, U N S C R A M B L E R , P I R O U E T T E , M O D D E and SIMCA. We use M O D D E 2.1 for Windows (Modde manual 1994) for examples 1 and 3, and SIMCA P 2.1 for Windows (Simca P manual 1994) for example 2. M O D D E contains an M L R option (example 1) and statistical experimental design support (example 3), whereas SIMCA offers multivariate projection methods (example 2). 4 Results 4.1 Data set I As said above, the aim of the first example is not so much to deal with PLS and aquatic data, but rather to highlight some advantages of PLS compared to M L R when dealing with collinear variables. These advantages include features such as stability of the model and believability of its result. This data set has four X-variables and one response. When applying M L R to the data a model with R2= 0.66, Multivariate analysis with PLS 8 6 4 2 0 229 [] [] [] [] PIF DGR SAC MR -2 -4 45 -8 Figure 4. Regression coefficients of scaled and centered variables for data set I (MLR) 1.0 0.5- ~ PIF DGR [ ] SAC ~]MR 0.0 - - 43.5- Figure 5. Regression coefficients of scaled and centered variables for data set I (PLS) R2adj= 0.57 and Q2= 0.29 resulted, while the corresponding values for PLS were R2= 0.48, R2adj= 0.34 and Q2= 0.30. A t first sight M L R seems to o u t p e r f o r m PLS. However, since the Q2's are of similar size the tentative conclusion is that M L R shows overfit. Next, we examine the regression coefficients of both models (Figs. 4 and 5). Interestingly, there are some discrepancies, both with regards to the size of the coefficients as well as their sign. Let us take a closer look at variables x 2 ( D G R ) and x 4 (MR). According to the M L R model (Fig. 4), D G R has a positive and M R a negative relation to the response variable. As opposed to this result, the PLS model (Fig. 5) suggests that D G is negatively related with the response variable, and that M R has no modelling influence whatsoever. Furthermore, x 1 and x 2 are strongly negatively correlated (Table 1) and thus their coefficients ought to have opposite signs in the M L R model. Astonishingly, however, these two variables are estimated by M L R to relate positively to this response. A similar contradiction can also be traced for the variable pair % / x 4 . It should be noted thought that - due to their collinearity - the coefficients of x 3 and x 4 are statistically insignificant. All these MLR-results are puzzling and we must try to find out what is real and what is misleading. The easiest way to elucidate this is to look at scatter plots of raw 230 Eriksson et al. R a w Data Plot 18- • DDGTS .9 16- .10 14 12- .12 .19 .4 • 13 .17 10- .8 ~II6 "!~4 ~61.3 ..71g .5 4 4z 4 DGR 0 "l Figure 6. Raw data plot of the response DDGTS versus the predictor DGR. Notation as in Table 1 R a w Data Plot 18- 16 .9 .10 14 12 .12 .19 •4 .13 •8 lo- a62 • DDGTS • 7" ~15"1~6 ,17 .18 .5 MR Figure 7. Raw data plot of the response DDGTS versus the predictor MR. Notation as in Table 1 data and scrutinize how, for instance, D G R and M R correlate with the response variable in reality. Figure 6 shows that D G R has a negative correlation (r = -0.65) and Fig. 7 that M R has, if any, a weak positive correlation (r -- 0.29) with respect to the endpoint. Thus, this supports the PLS model and contradicts the M L R output. The reason why M L R produces erroneous coefficients is that the X-variables are strongly correlated. Not only are the coefficients of wrong sign, but the model is likely overfitted as well. Hence, PLS is preferred to M L R for this and other data sets with correlated predictor variables. Multivariate analysis with PLS 231 4.2 Data set H The PLS analysis of the second data set, with eight X-variables and eight Y-variables, resulted in a two-component model with R2= 0.76, R2dj = 0.72 and Q 2= 0.67, which is an excellent result taking into account biological variability and the fact that eight responses are handled simultaneously. It is also possible to extract similar information for each Y-variable separately, which is presented in Fig. 8. Evidently, the individual R2's vary between 0.53-0.95 and Q2's range from 0.38-0.92, and it is clear that five endpoints (DM21d, D M R m , CPEC50, PoeLC50 and BCF) are well modelled by this battery of eight predictor variables, and that some improvements would be desirable for three responses (DM48h, D M e and PHEC50). However, the X/Y Ovel~iew (eum), Comp 2 [] R2VY(cun)[2] N Q2V(am)[2] Figure 8. Individual RzY (explained sum of squares) and Q2 (predicted variance) of each response Sco : t[j]/u[1] 4- 2 -9 .311 x--- .10 o, 0 .15 • 12 •7 -2 % • 14 .13 .8 -2 t[q Figure 9. First pair of latent variables for the PLS model of data set II. Notation as in Table 2 232 Eriksson et al. Scores: t121/u121 3 •3 ,8 2~ ,1 ,4 ,7 • 12 O E = -1 • 15 ,10 .6 -9 .2 ~ .5 ,13 -3 • 14 Figure 10. Second pair of latent variables for the PLS model of data set II. Notation as in Table 2 D M [ ~ Gomn 2(Onn~ .1 2.0- •2 ,9 ,11 1.5 • 1.0- ,7 !,514 10 • ,5. 1213-8 ,6 0.5- .8 0'.5 1'.0 115 Calculated ~,0 Figure 1L Observed versus calculated values for the response D M R m (data set II). Notation as in Table 2 CPECSO. Con~ 2(Cun~ ,9 .11 2.0 .1 .I0 .41N "ils'14 , 7 6 "5 1.0 0.5 ,3 ,8 0'.5 1~.o 115 Calculated 2'.0 2'.5 Figure 12. Observed versus calculated values for the response CPEC50 (data set II). Notation as in Table 2 Multivariate analysis with PLS 233 N-Protxdgih'ty,X/Y: DMRm 99.9 ~ .......... i . . . . . . . . . . i ..... ~. . . . . . . . . . . . . . . . ', .......... . . . . . . . . . -'-,---2 . . . . " . . . . . . . . ~ - .~ :::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::::::: E _4£._. . . . . . . . . . . . . . . . . . 4 ~ ~-'1~t-"-' . . . . . . . . . . . . . . . . . . . . . . . . . . . ~: ::iiii':!?iii'~iiiiiiiii:iiiiiiiii;;iii;iiiiiiii:iii;iiil °' -i ; i i 0.5 . . . . . . . . . . . . [. . . . . . . . . . . ~.......... ~-. . . . . . . . . . . I. . . . H YRes 13121 Figure 13. N o r m a l probability plot of the residuals of D M R m after two PLS c o m p o n e n t s (data set II). N o t a t i o n as in Table 2 N-Probability,X/Y: CPEC50 99.9 ~ , , , , :::::::::::::::::::::::::::::::::::::::::::::::::::: .,~. 9 o 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 ............ 4 ~- ~ ~::::::!!!!!i!!1:.:~::~:!!!!!!!!!!:!!!!!!!!!!?!i!!!!!! :¢" ~oZ ..... ~. _;_'__~ ........ •. . . . L ........... ; .... ~iiiii-!~ii:i::~i::::iiiiiiiiii::~::)i:i:i:::::iiiiiiii YRes15121 Figure 14. N o r m a l probability plot of t h e residuals of C P E C 5 0 after two PLS c o m p o n e n t s (data set II). N o t a t i o n as in Table 2 w c l l l / w c I21 LPHECSOj 0.4- • D ~-- 250 0.2- I DM~ oo ~ _o.~! •1 .15p I.LJ ~ l K m I.cP:c~o ~ ,LUMO -0.4 -o6 HO~ I0 • -f~.4 ' -I~.2 ' .0 ' 0'.2 ' 0'.4 wc[1] Figure 15. T h e s e c o n d PLS weight vector p l o t t e d against t h e first for the PLS m o d e l of data set II. N o t a t i o n as in Table 2. T h e eight responses are b o x e d 234 Eriksson et al. riP. 2(cm) 1.21.0- 0.8D. 0.60,40.20.0- Figure 16. Variable influence on projection (VIP) for the predictor variables of dat set II. The higher the value the more influential the variable. Notation as in Table 2 general relationship between X and Y, as expressed by the latent variables t and u (Figs. 9 and 10), is stable and the conclusion is that the multivariate QSAR is well founded and warranted, and has a good predictive power. To explore the fit of the QSAR, we consider Figs. 11 and 12 in which good relationships between observed and calculated responses for DMRm and CPEC50 are displayed. The normal probability plots (Figs. 13 and 14) of the Y-residuals corroborate this model, since no strong deviants are found (the residuals lie well within _+2 SD.s). For the interpretation of this QSAR model we consider the PLS weights to see how the X- and Y-variables are interrelated (Fig. 15). Figure 15 indicates that all Xvariables load strongly in the two model dimensions, and that D, Mp, o~ (sigm) and LUMO are closely related. A second group is formed by log R Bp and 7/(Eta); whereas H O M O provides information different from these two groups. The VIP plot is displayed in Fig. 16, and this column plot reveales that log P is the most important variable, followed by eta, Bp, and so on. This may be interpreted saying that hydrophobic properties of the nitrobenzene derivatives are of crucial importance for the toxic effects they elicit. Now that we tentatively have interpreted the QSAR, we may attempt to get some feedback from the PLS score plot in Fig. 9. (Fig. 10 accounts for a minor portion of the Y-variance, and is neglected for this sake). This graph well summarizes the distribution of the nitrobenzene derivatives along the various toxicity scales. Altogether, nitrobenzene (no. 1) is the least toxic compound to these aquatic organisms and at the same time exhibits the lowest bioconcentration factor, whereas 1,3-dichloro-5-nitrobenzene (no. 8) is the most potent chemical in the same test systems. Actually, nitrobenzene is the least hydrophobic compound (lowest value of log P) and 1,3-dichloro-5-nitrobenzene the most, which corroborate the previous interpretation concerning the significance of log R For a deeper toxicological account of these endpoints, we refer to the original literature (Deneer et al., 1987; Deneer et al., 1989). In summary, this example underlines the suitability of PLS for modelling highly correlated X- and Y-variables, and at the same time point out the Multivariate analysis with PLS 235 possibilities of predicting aquatic toxicity of environmental pollutants based on knowledge of their chemical and structural characteristics. 4.3 Data set I I I The third example is different from the two preceding ones in that statistical experimental design was used to plan the experiments. One great advantage of designing the X-matrix is the possibility of evaluating model inadequacies in terms of a lack of fit estimate. In addition, with designed data, the X-variables are independent, which justifies the use of MLR, although we prefer PLS because it can accommodate all four responses in the same model. The current investigation was done in two-step procedure. From the beginning 19 experiments (Table 3) were set up according to a 25-~ fractional factorial design with three center-points. This design supports the estimation of linear and interaction terms. The first PLS analysis of the resulting data suggested that only the interaction effect A r o * P h e n was meaningful besides the five linear terms. Thereafter PLS was applied to these six X-variables and the four Y-responses, yielding a two-component model with R 2= 0.95, R2adj= 0.92 and Q 2 = 0.80. The relationship between X and Y for this model is linear with respect to the first pair of latent variables, but is non-linear in the second (Fig. 17). This model deficiency is also seen in the lack of fit of 26.2, which is large value indicating model imperfections. The interpretation of this model revealed that x 1 (Aro) and x 2 (Phen) and their joint interaction are the only significant variables, and that the influence of x 3 - x 5 is negligible. This was uncovered by inspecting resulting regression coefficients and their confidence limits (no plots provided). The investigators wished to better account for the non-linearity observed (cf. Fig. 17). Hence, some experimental trials were added to enable estimation of the squared terms of the two primary variables, Aro and Phen. Thus, the original design was augmented with five trials (runs 2 0 - 2 4 in Table 3) laid out so as to allow Aro 2 S c o r e Scatter: t121 vs u l 2 l 4- 3~ 2~ .9 .15 ,11 1 .2 •8 o -1 -2 .4 -2 .10 .6 .14 .7 • 1 .16.5 -1 t [2] Figure 17. PLS t2/u2 score plot for model one (19 observations) of data set III. Notation as in Table 3. Note the non-linear relationship between t and u (and hence X and Y) 236 Eriksson et al. S¢oreScatter: t[21 vs u[21 4- .9 3 .15 21 .m. .11 1 • 23,2 •8 o -1 -2 •4 -2 ,14 .6 .1"T8.;: -t i t [21 Figure 18. PLS t2/u2 score plot for model two (24 observations) of data set III. Notation as in Table 3. Note that the curvature has diminished (cf Figure 17) Regression Coefficients for V18 15105- 0-5- Ar Ph HM In FI A¢;~r Ar*Ph Figure 19. PLS regression coefficients of response V18 (with 95 % confidence bars) for model two of data set III. Notation as in Table 3 and P h e n 2 to be estimated. In the P L S analysis of this data set, it was f o u n d that of these only A r o 2 was significant. T h e "final" m o d e l thus included seven terms, five linear, one interaction and one square term, with the overall statistics of R2= 0.97, R]~; = 0.95 and Q 2 = 0.82. We see in Fig. 18 that the non-linearity in t2/u 2 has b e e n eliminated due to the inclusion of A r o L T h e lack of fit is r e d u c e d to 13.2. Figure 19 shows the significance of A r o 2 for the response V18, and Fig. 20 displays the resp o n s e surface for V18 in the two significant factors. In summary, this investigation revealed that only the two factors x I and x 2 critically influence the toxicity responses, and that they do so in a non-additive manner. This is m o d e l l e d by incorporating one interaction and one square term. It is also evident that this kind of i n f o r m a t i o n would have b e e n difficult to acquire without the statistical experimental design of the data. 237 Muitivariate analysis with PLS HM = 0.163 In = 7 . 7 6 0 FI = 7 . 7 5 0 40 30 20 1G 4ro ° 1o q"" Figure 20. Response surface plot of the second PLS model of data set III, showing how the response V18 changes as a function of x 1 (Aro) and x2 (Phen), with x3-x s held constant at their center level 5 Discussion Data analysis carried out with the intention of linking a set of predictor variables, matrix X, to a set of response variables, matrix Y, for quantification and prediction purposes, is a common task in scientific and industrial research and development. In chemistry, aquatic science, and so on, these data tables X and Y often are multicollinear, because they are not generated in adherence to a statistical experimental design protocol, and because the manner in which these tables are produced means that they have many more variables than observations. The classical approach to a regression-type problem formulation relies on methods such as M L R or canonical correlation. As discussed and shown above, however, M L R (and the like) will not work properly when applied to multivariate collinear data. Inevitably, this will only yield models of low relevance and poor reliability because the derived regression coefficients are highly uncertain. Being a bilinear projection method, PLS provides a rational methodology for modelling the quantitative and often complex relationships between the multivariate matrices X and Y. The assumptions underlying PLS - correlated X-variables, X-variables with errors, residuals may be structured (cf. Fig. 1) - are more realistic than those underlying MLR. Hence models developed with PLS generally will have greater practical applicability and be more realistic. In addition, the diagnostics of PLS and similar methods (PCR, CCA, etc.), the scores, loadings, coefficients, and VIP plots and crossvalidation, supply information about the data structure and model complexity that is not attainable in traditional MLR. This facilitates model interpretation and aids the detection of inhomogeneities and inconsistencies in data (Verhaar et al., 1994). The connection between PLS weights, coefficients and VIPs is overviewed in Fig. 21. The aim with the three examples has been to introduce the concept of multivariate projections and the method of PLS, to illustrate the utility of PLS in aquatic science (as well as many other fields), and to motivate present and presumptive 238 Eriksson et al. Pooling over all PLS dimensions 7 Re~ression Coefficients for V18 4 lo X 5! Y ;'4 0 T -5 q Ar Ph HM FI Ar*Ph Reuression Coefficients for V14 24 20? T lO PLS gives 2 dimensions 0 "-x - ~ -10 PLS Total Snmmarv (cum~ ~ 20 15 Ar Ph HM In FI Ar*ArAr*Ph Re~ression Coefficients for V12 ~o: o: -10 h Regression Coefficients for VI 1 ~o Compl Comp2 -5 -I0 Ar Ph HM In FI rAr*Ph PLS weights Column: we[l] vii' 0.3 l.( 0A 0.2 0.C N 0~ ~--~N .0.4 1.( >PLS weights Column: we[2] 0.6 == o2o~ :o:41...... -0.6 ft. .~ Ill ~ Ar Ph Ar*PhAr*Ar In NNN ~ex~. HM FI ~-N, ~~-~--@ VIP, most condensed!!! ,< < Figure 21. A n overvies of PLS weights, coefficients and VIPs. This may be considered as a twoway phenomenon. The calculation of weights is a pooling over Y-variables, and the computation of regression coefficients is a pooling over components. VIPs, however, are obtained pooling both ways. Thus, the VIP parameter is the most condensed way of expressing the variable information Multivariate analysis with PLS 239 users of PLS of its analytical power. The first case addressed was an example in which the PLS solution cast some doubts on the M L R model, and did so by indicating that the upper (and more reliable) limits of R 2 and R2adiwere lower than what could be expected judging from MLR. Also, the regression coefficients of PLS were also more in line with reality than those of M L R (cf. Figs. 4 - 7). Secondly, PLS was used for Q S A R analysis, with the final goal of being capable of modelling and predicting toxicity profiles of mono-nitrobenzenederivatives from their chemical properties. This data set exemplified two features of PLS that are lacking in MLR, viz. the ability to treat several responses (here eight) in one single model, and the capability to cope with incomplete data matrices (missing data). Besides PLS working well in this application, we note that multivariate Q S A R modelling is very useful in aquatic science (Hermens, 1989; Blum and Speece, 1990). Finally, PLS was used to explore which among five chemical factors significantly influenced the aquatic toxicity of produced water discharged from oil production. With the statistical experimental design used as a foundation, this example underlines how a complicated system - here the relationship between mixture composition and aquatic toxicity - can be mapped efficiently and intelligently. It is indisputable that only two of the chemical factors considered were influencing the responses. The question of revealing which mixture constituents adversely affect certain species or environmental compartments, cannot be adequately resolved and quantified unless such experimental planning is utilized. 6 Concluding remarks PLS is a rational data analytical tool to gain insight into complex systems encountered in aquatic science. The graphical representation of PLS parameters and residuals enables evaluation of developed models and facilitates their interpretation. Since the assumptions of PLS are more realistic than those of MLR, it is our belief that PLS will be increasingly used in all kinds of complicated scientific applications. Appendix In the outline of PLS below, the index a (a = 1, 2 , . . . , A) represents the number of PLS components, index i (i = 1, 2 . . . . , N) the number of observations, index k (k = 1, 2 . . . . . K) the number of X-variables and index m (m = 1, 2 . . . . . M) the number of Y-variables. The linear PLS model finds A "new" variables, latent variables, denoted by t,. These scores are linear combinations of the original variables x k with the coefficients, "weights", w~.. tia -- 2k WkaXik (1) PLS computes the X-scores (ta's) to have certain advantageous properties. First of all, they are good predictors of Y, so that Yim= Ea Cmat~ + fire Y = TC' + F (2) (2 a) 240 Eriksson et al. in which em, are the PLS Y-weights and f~mthe Y-residuals. The latter formulation (2 a) expresses the model in matrix form. The residuals, fi,., express the deviations between the observed and modelled data, and comprise the elements of the Y residual matrix, F in (2a). Because of (1) and (2), the latter can be rewritten as a regression model: Yim= 2gaCma Y~k Wka Xik "Jr-rim = ~k bmk Xik + fire (3) The PLS regression coefficients, bmk, can be written as: bmk = ~a CmaWka (4) Secondly, the X-scores are few (A in number) and orthogonal, and are good summaries of X, so that the X-residuals, elk , in (5) are "small": xik z 2a tia Pka + elk X = TP" + E (5) (5 a) In (5) above Pk, are the X-variable loadings. The latter equation (5 a) is the X-model in matrix form. With multivariate Y (when M > 1), the Y-scores are good summaries of Y, so that the residuals, gim, in (6) are "small": Yim= 2a nia Cmaq- gim Y = UC' + G (6) (6a) Here, ui. denotes the Y-scores and e,.. corresponds to the Y-weights. The latter (6 a) is the Y-model in matrix form. After each dimension, a, the X-matrix is updated by subtracting ti~ Pk. from the element X~k. This makes the PLS model alternatively be expressed in weights w a referring to the residuals after previous dimension, E._I, instead of relating to the X-variables themselves (weights w* in eqn. 1). Thus, instead of (1), we have (7): tia = ~k Wka eik,a 1 (7) eia,a-1 ----eik,a 2 -- ti, a-1Pk, a 1 elk,0 = Xik However, the weights, w, can be transformed to w*, which directly relate to X, giving (1) above. The relation between the two is given by: W* = W (P' W) -~ (8) REFERENCES Blum, D.J.W. and R.E. Speece, 1990. Determining chemicals toxicity to aquatic species. Environ. Sci. Technol. 24:284-293. Box, G.E. R, W.G. Hunter and J.S. Hunter, 1978. Statistics for Experimenters, J. Wiley and Sons, N.Y. Deneer, J.W., T.L. Sinnige, W. Seinen and J.L.M. Hermens, 1987. Quantitative structure-activity relatinships for the toxicity and bioconcentration factor of nitrobenzene derivatives towards the guppy (Poecilia reticulata). Aquatic Toxicol. 10:115-129. Multivariate analysis with PLS 241 Deneer, J.W., C.J. van Leeuwen, W. Seinen, J.L. Maas-Diepeveen and J.L.M. Hermens, 1989. QSAR study of the toxicity of nitrobenzene derivatives towards Daphnia magna, Chlorella pyrenoidosa and Photobacteriurnphosphoreum. Aquatic Toxicol. 15:83-98. Draper, N.R. and H. Smith, 1981. Applied regression analysis, J. Wiley and Sons, N.Y. E1 Tayar, N., R.S. Tsai, RA. Carrupt and B. Testa, 1992. Octan-l-ol water partition coefficients of zwitterionic amino acids. Determination by centrifugal partition chromatography and factorization into steric/hydrophobic and polar components. J. Chem. Soc. Perkin Trans. 2:79- 84. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7:179-188. Frank, I.E. and J.H. Friedman, 1993. A statistical view of some chemometric regression tools. Technometrics 35 : 109-148 Hermens, J. L. M., 1989. Quantitative structure-activity relationships of environmental pollutants. In: (ed.) O. Hutzinger, The handbook of environmental chemistry, Vol. 2E, Springer Verlag, Berlin, Germany, pp. 111-162. Jackson, J.E., 1991. A users guide to principal components, Wiley-Interscience. Jongman, R. G. H., C. J. E ter Braak and O. E R. van Tongeren, 1987. Data analysis in community and landscape ecology, Pudoc, Wagcningen, The Netherlands. Johnsen, S, A.T. Smith, J. Brendenhaug, H. Riksheim and A. L. Gjose, 1994. Identification of sources of acute toxicity in produced water. SPE 27138, pp. 1-8. Lindgren, E, 1994. Third generation PLS - Some elements and applications. Ph. D. Thesis, Ume~ Univesity, Umefi, Sweden. MODDE 2.1 manual 1994, Umetri AB, RO. Box, 90719 UmeS, Sweden. Mullet, G.M., 1976. Why regression coefficients have the wrong sign. J. Qual. Technol. 8:121-126. Shao, J., 1993. Linear model selection by cross-validation. J. Amer. Stat. Assoc. 88:486-494. Stewart, J.J. R, 1990. MOPAC manual, version 6.0. Frank J. Seiler Research Laboratory, U.S. Air Force Academy, CO. SIMCA P 2.1 manual 1994, Umetri AB, RO. Box 7960, 90719 UmeS, Sweden. Topliss, J.G. and R.R Edwards, 1979. Chance factors in studies of quantitative structure-activity relationships. J. Med. Chem. 22:1238-1244. Verhaar, H.J.M., L. Eriksson, M. Sj6str6m, G. Schttt~rmann, W. Seinen and J.L.M. Hermens, 1994. Modelling the toxicity of organophosphates: A comparison of the multiple lineare regression and PLS regression methods. Quant. Struct.-Act. Relat. 13 : 133 - 143. Wakeling, I.N. and J.J. Morris, 1993. A test of significance for partial least squares (PLS). J. Chemometrics 7: 281-304. Weast, R.C., 1987. Handbook of chemistry and physics, 67th ed., CRC Press, Bocan Raton, FL. Wold, S., 1978. Cross-validatory estimation of the number of components in factor and principal component models. Technometrics 20:387-405. Wold, S., C. Albano, W.J. Dunn III, U. Edlund, K. Esbensen, R Geladi, S. Hellberg, E. Johansson, W. Lindberg and M. Sj6str6m, 1984. Multivariate data analysis in chemistry. IN: B. R. Kowalski (ed.), Chemometrics - Mathematics and statistics in chemistry, D. Reidel Publishing Company, Dordrecht, Holland, pp. 1-79. Wold, S., 1995. PLS for multivariate linear modelling. In: H. van de Waterbeemd (ed.), QSAR: Chemometric methods in molecular design, Methods and principles in medicinal chemistry, Vol. 2, Verlag Chemie, Weinheim, Germany, pp. 195-218. Wold, S. and L. Eriksson, 1995. Validation Tools. In: H. van de Waterbeemd (ed.), QSAR: Chemometric methods in molecular design, Methods and principles in medicinal chemistry, Vol. 2, Verlag Chemic, Weinheim, Germany, pp.309-318. Received 23 January 1995; revised manuscript accepted 30 May 1995.